How to find/replace non printable / non-ascii characters using Python 3? - python

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$

A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding

In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.

It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)

Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.

Related

Replace all newline characters using python

I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:
from tika import parser
filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)
But the output remains unchanged. I tried using double backslashes also which didn't fix the issue. can someone please advise?
content = content.replace("\\r\\n", "")
You need to double escape them.
I don't have access to your pdf file, so I processed one on my system. I also don't know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.
Please let me know if this works for your current needs.
from tika import parser
filename = 'myfile.pdf'
# Parse the PDF
parsedPDF = parser.from_file(filename)
# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]
# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')
#####################################
# Do something with the PDF
#####################################
print (pdf)
If you are having issues with different forms of line break, try the str.splitlines() function and then re-join the result using the string you're after. Like this:
content = "".join(l for l in content.splitlines() if l)
Then, you just have to change the value within the quotes to what you need to join on.
This will allow you to detect all of the line boundaries found here.
Be aware though that str.splitlines() returns a list not an iterator. So, for large strings, this will blow out your memory usage.
In those cases, you are better off using the file stream or io.StringIO and read line by line.
print(open('myfile.txt').read().replace('\n', ''))
When you write something like t.replace("\r\n", "") python will look for a carriage-return followed by a new-line.
Python will not replace carriage returns by themselves or replace new-line characters by themselves.
Consider the following:
t = "abc abracadabra abc"
t.replace("abc", "x")
Will t.replace("abc", "x") replace every occurrence of the letter a with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter b with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter c with the letter x? No
What will t.replace("abc", "x") do?
t.replace("abc", "x") will replace the entire string "abc" with the letter "x"
Consider the following:
test_input = "\r\nAPPLE\rORANGE\nKIWI\n\rPOMEGRANATE\r\nCHERRY\r\nSTRAWBERRY"
t = test_input
for _ in range(0, 3):
t = t.replace("\r\n", "")
print(repr(t))
result2 = "".join(test_input.split("\r\n"))
print(repr(result2))
The output sent to the console is as follows:
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
Note that:
str.replace() replaces every occurrence of the target string, not just the left-most occurrence.
str.replace() replaces the target string, but not every character of the target string.
If you want to delete all new-line and carriage returns, something like the following will get the job done:
in_string = "\r\n-APPLE-\r-ORANGE-\n-KIWI-\n\r-POMEGRANATE-\r\n-CHERRY-\r\n-STRAWBERRY-"
out_string = "".join(filter(lambda ch: ch not in "\n\r", in_string))
print(repr(out_string))
# prints -APPLE--ORANGE--KIWI--POMEGRANATE--CHERRY--STRAWBERRY-
You can also just use
text = '''
As she said these words her foot slipped, and in another moment, splash! she
was up to her chin in salt water. Her first idea was that she had somehow
fallen into the sea, “and in that case I can go back by railway,”
she said to herself.”'''
text = ' '.join(text.splitlines())
print(text)
# As she said these words her foot slipped, and in another moment, splash! she was up to her chin in salt water. Her first idea was that she had somehow fallen into the sea, “and in that case I can go back by railway,” she said to herself.”
#write a file
enter code here
write_File=open("sample.txt","w")
write_File.write("line1\nline2\nline3\nline4\nline5\nline6\n")
write_File.close()
#open a file without new line of the characters
open_file=open("sample.txt","r")
open_new_File=open_file.read()
replace_string=open_new_File.replace("\n",." ")
print(replace_string,end=" ")
open_file.close()
OUTPUT
line1 line2 line3 line4 line5 line6

A way to remove all occurrences of words within brackets in a string?

I'm trying to find a way to delete all mentions of references in a text file.
I haven't tried much, as I am new to Python but thought that this is something that Python could do.
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt') as file:
wholefile = f.read()
for '(' in
I have no idea where to go from here or if what I've done is right. Any suggestions would be helpful!
You'll have an easier time with a text editing program that handles regular expressions, like Notepad++, than learning Python for this one task (reading in a file, correcting fundamental errors like for '(' in..., etc.). You can even use tools available online for this, such as RegExr (a regular expression tester). In RegExr, write an appropriate expression into the "expression" field and paste your text into the "text" field. Then, in the "tools" area below the text, choose the "replace" option and remove the placeholder expression. Your cleaned-up text will appear there.
You're looking for a space, then a literal opening parenthesis, then some characters, then a comma, then a year (let's just call that 3 or 4 digits), then a literal closing parenthesis, so I'd suggest the following expression:
\(.*?, \d{3,4}\)
This will preserve non-citation parenthesized text and remove the leading space before a citation.
Try re
>>> import re
>>> re.sub(r'\(.*?\)', '', 'nonsense (nonsense, 2015)')
'nonsense '
>>> re.sub(r'\(.*?\)', '', 'qwerty (qwerty) dkjah (Smith, 2018)')
'qwerty dkjah '
import re
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt', 'r') as file:
wholefile = file.read()
# Be care for use 'w', it will delete raw data.
whth open('random_text.txt', 'w') as file:
file.write(re.sub(r'\(.*?\)', '', wholefile))

function to get rid of delimeters in python

I have a function where the user passes in a file and a String and the code should get rid of the specificed delimeters. I am having trouble finishing the part where I loop through my code and get rid of each of the replacements. I will post the code down below
def forReader(filename):
try:
# Opens up the file
file = open(filename , "r")
# Reads the lines in the file
read = file.readlines()
# closes the files
file.close()
# loops through the lines in the file
for sentence in read:
# will split each element by a spaace
line = sentence.split()
replacements = (',', '-', '!', '?' '(' ')' '<' ' = ' ';')
# will loop through the space delimited line and get rid of
# of the replacements
for sentences in line:
# Exception thrown if File does not exist
except FileExistsError:
print('File is not created yet')
forReader("mo.txt")
mo.txt
for ( int i;
After running the filemo.txt I would like for the output to look like this
for int i
Here's a way to do this using regex. First, we create a pattern consisting of all the delimiter characters, being careful to escape them, since several of those characters have special meaning in a regex. Then we can use re.sub to replace each delimiter with an empty string. This process can leave us with two or more adjacent spaces, which we then need to replace with a single space.
The Python re module allows us to compile patterns that are used frequently. Theoretically, this can make them more efficient, but it's a good idea to test such patterns against real data to see if it does actually help. :)
import re
delimiters = ',-!?()<=;'
# Make a pattern consisting of all the delimiters
pat = re.compile('|'.join(re.escape(c) for c in delimiters))
s = 'for ( int i;'
# Remove the delimiters
z = pat.sub('', s)
#Clean up any runs of 2 or more spaces
z = re.sub(r'\s{2,}', ' ', z)
print(z)
output
for int i

Python re.sub with a string that contains a special character without modifying the string

My problem is that in order to do sub I need to escape a special character. But I don't want to modify the string that I'm substituting. Does Python have way to handle this?
fstring = r'C:\Temp\1_file.txt' #this is the new data that I want to substitute in
old_data = r'Some random text\n .*' #I'm looking for this in the file that I'll read
new_data = r'Some random text\n '+fstring #I want to change it to this
f = open(myfile,'r') #open the file
filedata = f.read() #read the file
f.close()
newfiledata = re.sub(old_data,new_data,filedata) #substitute the new data
An error is returned becasue the "\1" in the in "fstring" is seen as a group object.
error: invalid group reference
Escape eventual backslashes:
new_data = r'Some random text\n ' + fstring.replace('\\', r'\\')
\1 would normally imply a backreference to group 1 of the regex match, but that is not what you intend here (in fact no group 1, which is the reason for the error), you'll, therefore, need to escape the string so re does not treat the \1 as a metacharacter:
fstring = r'C:\Temp\\1_file.txt'

Replace with abbreviations from dictionary using Python

I'm trying to replace words like 'rna' with 'ribonucleic acid' from a dictionary of abbreviations. I tried writing the following, but it doesn't replace the abbreviations.
import csv,re
outfile = open ("Dict.txt", "w")
with open('Dictionary.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {rows[0]:rows[1] for rows in reader}
print >> outfile, mydict
out = open ("out.txt", "w")
ss = open ("trial.csv", "r").readlines()
s = str(ss)
def process(s):
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', s ) )
print >> out, da
process(s)
A sample trial.csv file would be
A,B,C,D
RNA,lung cancer,15,biotin
RNA,lung cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,lung cancer,15,biotin
Sample Dictionary.csv:
rna,ribonucleic acid
rnd,radical neck dissection
rni,recommended nutrient intake
rnp,ribonucleoprotein
My output file should have 'RNA' replaced by 'ribonucleic acid'
I'm trying to replace 'RNA' but my dictionary has 'rna'. Is there a way I can ignore the case.
Sure. Just call casefold on each key while creating the dictionary, and again while looking up values:
mydict = {rows[0].casefold(): rows[1] for rows in reader}
# ...
da = ''.join( mydict.get(word.casefold(), word) for word in re.split( '(\W+)', s ) )
If you're using an older version of Python that doesn't have casefold (IIRC, it was added in 2.7 and 3.2, but it may have been later than that…), use lower instead. It won't always do the right thing for non-English characters (e.g., 'ß'.casefold() is 'ss', while 'ß'.lower() is 'ß'), but it seems like that's OK for your application. (If it's not, you have to either write something more complicated with unicodedata, or find a third-party library.)
Also, I don't want it to replace 'corna' (I know such a word doesn't exist, but I want to make sure it doesn't happen) with 'coribonucleic acid'.
Well, you're already doing that with your re.split, which splits on any "non-word" characters; you then look up each resulting word separtely. Since corna won't be in the dict, it won't be replaced. (Although note that re's notion of "word" characters may not actually be what you want—it includes underscores and digits as part of a word, so rna2dna won't match, while a chunk of binary data like s1$_2(rNa/ might.)
You've also got another serious problem in your code:
ss = open ("trial.csv", "r").readlines()
s = str(ss)
Calling readlines means that ss is going to be a list of lines. Calling str on that list means that s is going to be a big string with [, then the repr of each line (with quotes around it, backslash escapes within it, etc.) separated by commas, then ]. You almost certainly don't want that. Just use read() if you want to read the whole file into a string as-is.
And you appear to have a problem in your data, too:
rna,ibonucleic acid
If you replace rna with ibonucleic acid, and so forth, you're going to have some hard-to-read output. If this is really your dictionary format, and the dictionary's user is supposed to infer some logic, e.g., that the first letter gets copied from the abbreviation, you have to write that logic. For example:
def lookup(word):
try:
return word[0] + mydict[word.casefold()]
except KeyError:
return word
da = ''.join(lookup(word) for word in re.split('(\W+), s))
Finally, it's a bad idea to use unescaped backslashes in a string literal. In this case, you get away with it, because Python happens to not have a meaning for \W, but that's not always going to be true. The best way around this is to use raw string literals, like r'(\W+)'.
I think this line s = str(ss) is causing the problem - the list that was created just became a string!
Try this instead:
def process(ss):
for line in ss:
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', line ) )
print >> out, da
process(ss)

Categories

Resources