Replace with abbreviations from dictionary using Python - python

I'm trying to replace words like 'rna' with 'ribonucleic acid' from a dictionary of abbreviations. I tried writing the following, but it doesn't replace the abbreviations.
import csv,re
outfile = open ("Dict.txt", "w")
with open('Dictionary.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {rows[0]:rows[1] for rows in reader}
print >> outfile, mydict
out = open ("out.txt", "w")
ss = open ("trial.csv", "r").readlines()
s = str(ss)
def process(s):
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', s ) )
print >> out, da
process(s)
A sample trial.csv file would be
A,B,C,D
RNA,lung cancer,15,biotin
RNA,lung cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,lung cancer,15,biotin
Sample Dictionary.csv:
rna,ribonucleic acid
rnd,radical neck dissection
rni,recommended nutrient intake
rnp,ribonucleoprotein
My output file should have 'RNA' replaced by 'ribonucleic acid'

I'm trying to replace 'RNA' but my dictionary has 'rna'. Is there a way I can ignore the case.
Sure. Just call casefold on each key while creating the dictionary, and again while looking up values:
mydict = {rows[0].casefold(): rows[1] for rows in reader}
# ...
da = ''.join( mydict.get(word.casefold(), word) for word in re.split( '(\W+)', s ) )
If you're using an older version of Python that doesn't have casefold (IIRC, it was added in 2.7 and 3.2, but it may have been later than that…), use lower instead. It won't always do the right thing for non-English characters (e.g., 'ß'.casefold() is 'ss', while 'ß'.lower() is 'ß'), but it seems like that's OK for your application. (If it's not, you have to either write something more complicated with unicodedata, or find a third-party library.)
Also, I don't want it to replace 'corna' (I know such a word doesn't exist, but I want to make sure it doesn't happen) with 'coribonucleic acid'.
Well, you're already doing that with your re.split, which splits on any "non-word" characters; you then look up each resulting word separtely. Since corna won't be in the dict, it won't be replaced. (Although note that re's notion of "word" characters may not actually be what you want—it includes underscores and digits as part of a word, so rna2dna won't match, while a chunk of binary data like s1$_2(rNa/ might.)
You've also got another serious problem in your code:
ss = open ("trial.csv", "r").readlines()
s = str(ss)
Calling readlines means that ss is going to be a list of lines. Calling str on that list means that s is going to be a big string with [, then the repr of each line (with quotes around it, backslash escapes within it, etc.) separated by commas, then ]. You almost certainly don't want that. Just use read() if you want to read the whole file into a string as-is.
And you appear to have a problem in your data, too:
rna,ibonucleic acid
If you replace rna with ibonucleic acid, and so forth, you're going to have some hard-to-read output. If this is really your dictionary format, and the dictionary's user is supposed to infer some logic, e.g., that the first letter gets copied from the abbreviation, you have to write that logic. For example:
def lookup(word):
try:
return word[0] + mydict[word.casefold()]
except KeyError:
return word
da = ''.join(lookup(word) for word in re.split('(\W+), s))
Finally, it's a bad idea to use unescaped backslashes in a string literal. In this case, you get away with it, because Python happens to not have a meaning for \W, but that's not always going to be true. The best way around this is to use raw string literals, like r'(\W+)'.

I think this line s = str(ss) is causing the problem - the list that was created just became a string!
Try this instead:
def process(ss):
for line in ss:
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', line ) )
print >> out, da
process(ss)

Related

Replace all newline characters using python

I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:
from tika import parser
filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)
But the output remains unchanged. I tried using double backslashes also which didn't fix the issue. can someone please advise?
content = content.replace("\\r\\n", "")
You need to double escape them.
I don't have access to your pdf file, so I processed one on my system. I also don't know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.
Please let me know if this works for your current needs.
from tika import parser
filename = 'myfile.pdf'
# Parse the PDF
parsedPDF = parser.from_file(filename)
# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]
# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')
#####################################
# Do something with the PDF
#####################################
print (pdf)
If you are having issues with different forms of line break, try the str.splitlines() function and then re-join the result using the string you're after. Like this:
content = "".join(l for l in content.splitlines() if l)
Then, you just have to change the value within the quotes to what you need to join on.
This will allow you to detect all of the line boundaries found here.
Be aware though that str.splitlines() returns a list not an iterator. So, for large strings, this will blow out your memory usage.
In those cases, you are better off using the file stream or io.StringIO and read line by line.
print(open('myfile.txt').read().replace('\n', ''))
When you write something like t.replace("\r\n", "") python will look for a carriage-return followed by a new-line.
Python will not replace carriage returns by themselves or replace new-line characters by themselves.
Consider the following:
t = "abc abracadabra abc"
t.replace("abc", "x")
Will t.replace("abc", "x") replace every occurrence of the letter a with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter b with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter c with the letter x? No
What will t.replace("abc", "x") do?
t.replace("abc", "x") will replace the entire string "abc" with the letter "x"
Consider the following:
test_input = "\r\nAPPLE\rORANGE\nKIWI\n\rPOMEGRANATE\r\nCHERRY\r\nSTRAWBERRY"
t = test_input
for _ in range(0, 3):
t = t.replace("\r\n", "")
print(repr(t))
result2 = "".join(test_input.split("\r\n"))
print(repr(result2))
The output sent to the console is as follows:
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
Note that:
str.replace() replaces every occurrence of the target string, not just the left-most occurrence.
str.replace() replaces the target string, but not every character of the target string.
If you want to delete all new-line and carriage returns, something like the following will get the job done:
in_string = "\r\n-APPLE-\r-ORANGE-\n-KIWI-\n\r-POMEGRANATE-\r\n-CHERRY-\r\n-STRAWBERRY-"
out_string = "".join(filter(lambda ch: ch not in "\n\r", in_string))
print(repr(out_string))
# prints -APPLE--ORANGE--KIWI--POMEGRANATE--CHERRY--STRAWBERRY-
You can also just use
text = '''
As she said these words her foot slipped, and in another moment, splash! she
was up to her chin in salt water. Her first idea was that she had somehow
fallen into the sea, “and in that case I can go back by railway,”
she said to herself.”'''
text = ' '.join(text.splitlines())
print(text)
# As she said these words her foot slipped, and in another moment, splash! she was up to her chin in salt water. Her first idea was that she had somehow fallen into the sea, “and in that case I can go back by railway,” she said to herself.”
#write a file
enter code here
write_File=open("sample.txt","w")
write_File.write("line1\nline2\nline3\nline4\nline5\nline6\n")
write_File.close()
#open a file without new line of the characters
open_file=open("sample.txt","r")
open_new_File=open_file.read()
replace_string=open_new_File.replace("\n",." ")
print(replace_string,end=" ")
open_file.close()
OUTPUT
line1 line2 line3 line4 line5 line6

How to find/replace non printable / non-ascii characters using Python 3?

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding
In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.
It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)
Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

How to encode a python list

I'm having a hard time trying to encode a python list, I already did it with a text file in order to count specific words inside it, using re module.
This is the code:
# encoding text file
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
for line in f:
# Using re module to extract specific words
unicode_pattern = re.compile(r'\b\w{4,20}\b', re.UNICODE)
result = unicode_pattern.findall(line)
word_counts = Counter(result) # It creates a dictionary key and wordCount
Allwords = []
for clave in word_counts:
if word_counts[clave] >= 10: # We look for the most repeated words
word = clave
Allwords.append(word)
print Allwords
Part of the output looks like this:
[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']
If I print variable word the output looks as it should be. However, when I use append, all the words breaks again, as the example before.
I use this example:
[x.encode("utf-8") for x in Allwords]
The output looks exactly the same as before.
I also use this example:
Allwords.append(str(word.encode("utf-8")))
The output change, but the words don't look as they should be:
[...'recursos', 'Partidos', 'Constituci\xc3\xb3n', 'veh\xc3\xadculos', 'investigaci\xc3\xb3n', 'Pol\xc3\xadticos']
Some of the answers have given this example:
print('[' + ', '.join(Allwords) + ']')
The output looks like this:
[...recursos, Partidos, Constitución, vehículos, investigación, Políticos]
To be honest I do not want to print the list, just encode it, so that all items (words) are recognized.
I'm looking for something like this:
[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']
Any suggestions to solve the problem are appreciated
Thanks,
you might what to try
print('[' + ', '.join(Allwords) + ']')
Your Unicode string list is correct. When you print lists the items in the list display as their repr() function. When you print the items themselves, the items display as their str() function. It is only a display option, similar to printing integers as decimal or hexadecimal.
So print the individual words if you want to see them correctly, but for comparisons the content is correct.
It's worth noting that Python 3 changes the behavior of repr() and now will display non-ASCII characters without escape codes if the terminal supports them directly and the ascii() function reproduces the Python 2 repr() behavior.

Python: Equivalent of searchlines for characters/strings

I have a .txt doc full of text. I'd like to search it for specific characters (or ideally groups of characters (strings) , then do things with the charcter found, and the characters 2 in front/4behind the selected characters.
I made a version that searches lines for the character, but I cant find the equivalent for characters.
f = open("C:\Users\Calum\Desktop\Robopipe\Programming\data2.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "_" in line:
for l in searchlines[i:i+2]: print l, #if i+2 then prints line and the next
print
If I understand the problem, what you want is to repeatedly search one giant string, instead of a searching a list of strings one by one.
So, the first step is, don't use readlines, use read, so you get that one giant string in the first place.
Next, how do you repeatedly search for all matches in a string?
Well, a string is an iterable, just like a list is—it's an iterable of characters (which are themselves strings with length 1). So, you can just iterate over the string:
f = open(path)
searchstring = f.read()
f.close()
for i, ch in enumerate(searchstring):
if ch == "_":
print searchstring[i-4:i+2]
However, notice that this only works if you're only searching for a single-character match. And it will fail if you find a _ in the first four characters. And it can be inefficient to loop over a few MB of text character by character.* So, you probably want to instead loop over str.find:
i = 4
while True:
i = searchstring.find("_", i)
if i == -1:
break
print searchstring[i-4:i+2]
* You may be wondering how find could possibly be doing anything but the same kind of loop. And you're right, it's still iterating character by character. But it's doing it in optimized code provided by the standard library—with the usual CPython implementation, this means the "inner loop" is in C code rather than Python code, it doesn't have to "box up" each character to test it, etc., so it can be much, much faster.
You could use a regex for this:
The regex searches for any two characters (that are not _), an _, then any four characters that are not an underscore.
import re
with open(path) as f:
searchstring = f.read()
regex = re.compile("([^_]{2}_[^_]{4})")
for match in regex.findall(searchstring):
print match
With the input of:
hello_there my_wonderful_friend
The script returns:
lo_ther
my_wond
ul_frie

Categories

Resources