Related
I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding
In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.
It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)
Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.
I am trying to scan a string and every time it reads a certain character 3 times, I would like to cut the remaining string
for example:
The string "C:\Temp\Test\Documents\Test.doc" would turn into "C:\Temp\Test\"
Every time the string hits "\" 3 times it should trim the string
here is my code that I am working on
prefix = ["" for x in range(size)]
num = 0
...
...
for char in os.path.realpath(src):
for x in prefix:
x = char
if x =='\': # I get an error here
num = num + 1
if num == 3:
break
print (num)
print(prefix)
...
...
the os.path.realpath(src) is the string with with the filepath. The "prefix" variable is the string array that I want to store the trimmed string.
Please let me know what I need to fix or if there is a simpler way to perform this.
Do split and then slice list to grab required and join:
s = 'C:\Temp\Test\Documents\Test.doc'
print('\\'.join(s.split('\\')[:3]) + '\\')
# C:\Temp\Test\
Note that \ (backslash) is an escaping character. To specifically mean a backslash, force it to be a backslash by adding a backslash before backslash \\, thereby removing the special meaning of backslash.
In python the backslash character is used as an escape character. If you do \n it does a newline, \t does a tab. There are many other things such as \" lets you do a quote in a string. If you want a regular backslash you should do "\\"
try
s = "C:\\Temp\\Test\\Documents\\Test.doc"
answer = '\\'.join(s.split('\\', 3)[:3])
Something like this would do..
x = "C:\Temp\Test\Documents\Test.doc"
print('\\'.join(x.split("\\")[:3])+"\\")
Suppose I want to read a sequence of inputs, where each input is a tuple is of the form <string> , <integer>, <string>. Additionally, there can be arbitrary amount of whitespace around the commas. An easy way to do this in C/C++ is to use scanf with format string "%s , %d , %s". What is the equivalent function in python?
Suppose we knew that each input is on a separate line, then you could easily parse this in python using split and strip. But the newline requirement complicates things. Furthermore, we could even have weird inputs such as
<s11>, <i1>
, <s12> <s21>,
<i2> , <s22>
Where s11, i1, s12 is the first input and s21, i2, s22 is the second. And scanf would still be able to handle this. How does one do it in python? I also don't want to take the entire input at once and parse it, since I know that there will be other inputs that don't fit this format later on, and I don't want to do the parsing manually.
You should be able to first strip the whitespace, then split on commas, then handle the resulting strings and integers however you want. The regular expression s\+ matches any nonzero amount of whitespace characters:
input_string = " hello \n \t , 10 , world \n "
stripped_string = re.sub('\s+', '', input_string)
substrings = stripped_string.split(',')
string1 = substrings[0]
integer1 = int(substrings[1])
string2 = substrings[2]
You'd just have to put those last three lines inside a loop if you need to handle multiple s,i,s tuples in a row.
EDIT: I realize now you want to interpret any whitespace as a comma. I'm not sure how wise that is, but a hacky way to do it is to replace all the commas with whitespace, split on whitespace, and call it a day
input_string = " hello \n \t , 10 world \n "
stripped_string = re.sub(',', ' ', input_string)
substrings = stripped_string.split()
string1 = substrings[0]
integer1 = int(substrings[1])
string2 = substrings[2]
For delimited format it's pretty easy with the csv module.
You can plugin any kind of file-like inputs to it.
And you handle stripping white spaces and type casting downstream. Here's a sample to get you going:
In [25]: import fileinput
In [26]: import csv
In [28]: reader = csv.reader(fileinput.input())
In [29]: for l in reader:
...: print(l)
...:
stdin input -> a,b, c, d
print output -> ['a', 'b', ' c', ' d ']
I'm working on strings where I'm taking input from the command line. For example, with this input:
format driveName "datahere"
when I go string.split(), it comes out as:
>>> input.split()
['format, 'driveName', '"datahere"']
which is what I want.
However, when I specify it to be string.split(" ", 2), I get:
>>> input.split(' ', 2)
['format\n, 'driveName\n', '"datahere"']
Does anyone know why and how I can resolve this? I thought it could be because I'm creating it on Windows and running on Unix, but the same problem occurs when I use nano in unix.
The third argument (data) could contain newlines, so I'm cautious not to use a sweeping newline remover.
Default separator in split() is all whitespace which includes newlines \n and spaces.
Here is what the docs on split say:
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
When you define a new sep it only uses that separator to split the strings.
Use None to get the default whitespace splitting behaviour with a limit:
input.split(None, 2)
This leaves the whitespace at the end of input() untouched.
Or you could strip the values afterwards; this removes whitespace from the start and end, not the middle, of each resulting string, just like input.split() would:
[v.strip() for v in input.split(' ', 2)]
The default str.split targets a number of "whitespace characters", including also tabs and others. If you do str.split(' '), you tell it to split only on ' ' (a space). You can get the default behavior by specifying None, as in str.split(None, 2).
There may be a better way of doing this, depending on what your actual use-case is (your example does not replicate the problem...). As your example output implies newlines as separators, you should consider splitting on them explicitly.
inp = """
format
driveName
datahere
datathere
"""
inp.strip().split('\n', 2)
# ['format', 'driveName', 'datahere\ndatathere']
This allows you to have spaces (and tabs etc) in the first and second item as well.
I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk