How to Remove Non-ASCII Characters from a String in Python?

Estimated read time 2 min read

To remove non-ASCII characters from a string in Python, you can use a combination of regular expressions and the unicodedata module. Here is an example code snippet that removes non-ASCII characters from a string:

import unicodedata
import re

original_string = "Hello, World! This is a Python string with non-ASCII characters: é, ë, ê."

# Remove non-ASCII characters
clean_string = re.sub(r'[^\x00-\x7F]+', '', unicodedata.normalize('NFKD', original_string).encode('ASCII', 'ignore').decode('ASCII'))

print(clean_string)  # Output: "Hello, World! This is a Python string with non-ASCII characters: , , ."

In this code, we import the unicodedata module and the re module to use regular expressions.

Next, we define an original string (original_string) that contains non-ASCII characters.

Then, we use the unicodedata.normalize() function with the NFKD argument to convert the original string into a normalized form that separates out the diacritic marks. We then use .encode('ASCII', 'ignore') to convert the string to ASCII and .decode('ASCII') to decode it back to a string. The resulting string will contain only ASCII characters.

Finally, we use a regular expression pattern ([^\x00-\x7F]+) to match one or more non-ASCII characters and substitute them with an empty string. The \x00-\x7F matches all ASCII characters.

Note that this approach removes non-ASCII characters entirely, so it may not be appropriate in all cases. It’s important to consider the context and purpose of the string when removing non-ASCII characters.

You May Also Like

More From Author

+ There are no comments

Add yours

Leave a Reply