Home/Blog/How do I encode international (non-ASCII) characters in URLs?
Web Development

How do I encode international (non-ASCII) characters in URLs?

Master encoding international characters in URLs, from UTF-8 encoding to percent-encoding, with practical examples and implementation strategies.

By Inventive HQ Team
How do I encode international (non-ASCII) characters in URLs?

Understanding Character Encoding in URLs

URLs present a unique challenge when dealing with international (non-ASCII) characters. While modern web applications support text in virtually any language, URLs are technically defined to contain only ASCII characters. This creates a gap that must be bridged through proper encoding. Understanding how to encode international characters in URLs is essential for building truly global applications.

The process of encoding international characters in URLs involves two steps: character encoding (usually UTF-8) and then percent-encoding (also called URL encoding). These steps must happen in the correct order, and understanding why each is necessary is fundamental to implementing this correctly.

The Two-Step Encoding Process

The first step in encoding international characters is character encoding. Before any character can be percent-encoded, it must first be represented as bytes using a specific character encoding standard. UTF-8 (Unicode Transformation Format - 8-bit) is the de facto standard for this purpose and is strongly recommended for all modern applications.

UTF-8 represents characters as variable-length byte sequences. ASCII characters (0-127) are represented as single bytes, which means they don't change during UTF-8 encoding. However, international characters require multiple bytes. For example:

  • The Euro symbol (€) is encoded as three bytes in UTF-8: E2 82 AC
  • The Chinese character (中) is encoded as three bytes: E4 B8 AD
  • The Cyrillic character (я) is encoded as two bytes: D1 8F
  • The Arabic character (ع) is encoded as two bytes: D8 B9

The second step is percent-encoding, where each byte is represented as a percent sign followed by its two-digit hexadecimal value. So the Euro symbol, after UTF-8 encoding to E2 82 AC, becomes "%E2%82%AC" in a URL.

Practical Examples of International Character Encoding

Let's walk through concrete examples to illustrate the encoding process. If you want to create a URL with a German filename containing umlauts, like "münchen.pdf", here's what happens:

  1. The string "münchen" contains the character "ü"
  2. UTF-8 encoding converts "ü" to the bytes C3 BC (in hexadecimal)
  3. Percent-encoding converts C3 BC to "%C3%BC"
  4. The full URL becomes: https://example.com/files/m%C3%BCnchen.pdf

For a Japanese example, if you want to encode "日本語.txt" (Japanese language):

  1. "日" in UTF-8 is E6 97 A5
  2. "本" in UTF-8 is E6 9C AC
  3. "語" in UTF-8 is E8 AA 9E
  4. The full encoded filename becomes: "%E6%97%A5%E6%9C%AC%E8%AA%9E.txt"
  5. The complete URL: https://example.com/files/%E6%97%A5%E6%9C%AC%E8%AA%9E.txt

Language-Specific Implementation

Different programming languages handle international character encoding with varying levels of convenience.

JavaScript Implementation:

const text = "München café naïve";
const encoded = encodeURIComponent(text);
// Result: "M%C3%BCnchen%20caf%C3%A9%20na%C3%AFve"

// For complete URIs where you want to preserve slashes
const uri = "https://example.com/München/café/";
const encodedUri = encodeURI(uri);
// Result: "https://example.com/M%C3%BCnchen/caf%C3%A9/"

JavaScript's encodeURIComponent() function automatically handles UTF-8 encoding and percent-encoding, making it convenient for most use cases. The encodeURI() variant preserves structural characters like slashes.

Python Implementation:

from urllib.parse import quote, quote_plus

text = "München café naïve"
# For path components
encoded_path = quote(text)
# Result: "M%C3%BCnchen%20caf%C3%A9%20na%C3%AFve"

# For query strings (uses + for spaces)
encoded_query = quote_plus(text)
# Result: "M%C3%BCnchen+caf%C3%A9+na%C3%AFve"

Python's urllib.parse module provides functions that handle the complete encoding pipeline. By default, they use UTF-8, which is the right choice for international characters.

PHP Implementation:

$text = "München café naïve";
// Using rawurlencode (RFC 3986 compliant)
$encoded = rawurlencode($text);
// Result: "M%C3%BCnchen%20caf%C3%A9%20na%C3%AFve"

// Using urlencode (application/x-www-form-urlencoded)
$form_encoded = urlencode($text);
// Result: "M%C3%BCnchen+caf%C3%A9+na%C3%AFve"

PHP distinguishes between urlencode() and rawurlencode(). The former uses "+" for spaces (legacy HTML form format), while the latter uses "%20" (RFC 3986 compliant).

Java Implementation:

import java.net.URLEncoder;
import java.net.URLDecoder;

String text = "München café naïve";
String encoded = URLEncoder.encode(text, "UTF-8");
// Result: "M%C3%BCnchen+caf%C3%A9+na%C3%AFve"

// Decode it back
String decoded = URLDecoder.decode(encoded, "UTF-8");

Java's URLEncoder defaults to UTF-8, making it straightforward to handle international characters correctly.

Handling Different URL Components

The way you encode international characters might differ depending on the URL component where they appear.

In Path Segments: Path segments should use percent-encoding consistently. The forward slash (/) should remain unencoded as it's a path separator, but all other special characters should be encoded. For example: https://example.com/café/menü/coffee becomes https://example.com/caf%C3%A9/men%C3%BC/coffee

In Query String Parameters: Query parameters are typically encoded more aggressively. Both reserved characters (like & and =) and international characters should be percent-encoded. For example: https://example.com/search?q=café+français becomes https://example.com/search?q=caf%C3%A9+fran%C3%A7ais

In Fragment Identifiers: Fragments (the part after #) are also subject to encoding rules. International characters should be percent-encoded similarly to query strings.

Character Encoding Selection and UTF-8 Dominance

While multiple character encoding standards exist (ISO-8859-1, Big5, Shift_JIS), UTF-8 has become the universal standard for web applications for several important reasons:

  1. Unicode Support: UTF-8 can represent any Unicode character, making it suitable for any language
  2. Backward Compatibility: ASCII characters are represented identically in UTF-8, ensuring compatibility with legacy systems
  3. Web Standards: HTML5 requires UTF-8, and modern web standards assume UTF-8
  4. Efficiency: UTF-8 uses variable-length encoding, so common ASCII characters take minimal space
  5. Browser Support: All modern browsers handle UTF-8 URL encoding correctly

When implementing international character support, always specify UTF-8 as your character encoding. If you inherit code that uses other encodings, consider migrating to UTF-8 for consistency.

International Domain Names (IDNs)

A special case in URL encoding involves international domain names. While domain names might contain international characters, they don't use percent-encoding. Instead, they use Punycode encoding, which converts international characters to ASCII-compatible encoding (ACE).

For example, "münchen.de" becomes "xn--mnchen-3ya.de" through Punycode conversion. This is different from percent-encoding and is handled by domain name systems separately from URL path encoding.

When working with IDNs:

  • The browser typically displays the international form in the address bar
  • The DNS system uses the Punycode form internally
  • Your URL encoding functions don't need to handle domain name encoding
  • Only the path and query string portions need percent-encoding

Common Pitfalls and Solutions

Pitfall 1: Assuming your text editor or source code encoding will handle it automatically Never assume this. Always explicitly specify UTF-8 encoding when reading or processing text that might contain international characters.

Pitfall 2: Encoding before you're sure about the character encoding of the source If you're receiving text from an external source, verify its encoding first. Many legacy systems use ISO-8859-1 or other non-UTF-8 encodings. Convert to UTF-8 before percent-encoding.

Pitfall 3: Mixing percent-encoded and non-encoded international characters Be consistent. Either encode all non-ASCII characters or encode none. Mixing them in the same URL creates confusion and potential issues.

Pitfall 4: Not considering URL normalization URLs can be normalized in different ways. Some normalization processes might change how international characters are represented. Be aware of this if you're comparing or storing URLs.

Testing International Character Encoding

When testing URL encoding for international characters, create test cases that include:

  1. Latin characters with diacritics: café, naïve, Zürich
  2. Non-Latin scripts: 中文 (Chinese), العربية (Arabic), Русский (Russian)
  3. Mixed scripts: A string combining multiple languages
  4. Characters that need multiple bytes: Emoji (though emoji support in URLs is limited)
  5. Complete URLs with multiple components: Path, query string, and fragment with international characters

Test that:

  • Encoding produces correct percent-encoded output
  • Decoding recovers the original characters
  • URLs work correctly when transmitted over HTTP
  • Server-side code correctly receives and processes the decoded values

Best Practices for International URL Encoding

  1. Always use UTF-8: Make UTF-8 your default character encoding for all applications
  2. Use standard library functions: Don't write custom encoding logic; use tested, standard functions
  3. Encode at the right point: Encode when building the URL, not before
  4. Document encoding assumptions: Make it clear in your code and documentation that UTF-8 is used
  5. Test with real international characters: Don't just test with ASCII
  6. Handle decoding on the server: Always verify that your server correctly decodes international characters from URLs
  7. Be consistent: Apply the same encoding strategy throughout your application

Conclusion

Encoding international characters in URLs is a fundamental requirement for global web applications. By understanding the two-step process of UTF-8 character encoding followed by percent-encoding, and by using standard library functions specific to your programming language, you can reliably handle any international character. UTF-8 has become the de facto standard precisely because it solves this problem elegantly, making it the obvious choice for any modern application. With these practices in place, your URLs will work correctly with any language and script the world's users speak.

Need Expert IT & Security Guidance?

Our team is ready to help protect and optimize your business technology infrastructure.