Replacing special patterns in a string, reading from a file - python

I'm trying to replace special patterns in string by tabs. This string ( if i may call it) is a result from reading a file, that has accents (I'm portuguese, so UTF-8 or LATIN-1 is the encoding language).
So imagine my input is:
Aubrecht, Christoph; Özceylan, Aubrecht Dilek; Klerx, Joachim; Freire, Sérgio (2013) “Future-oriented activities as a concept for improved disaster risk management. Disaster Advances”, 6(12), 1-10. (IF = 2.272) E-ISSN 2278-4543. REVISTA INDEXADA NO WEB OF SCIENCE
Aubrecht, Christoph; Özceylan, Dilek; Steinnocher, Klaus; Freire, Sérgio (2013), “Multi-level geospatial modeling of human exposure patterns and vulnerability indicators”. Natural Hazards, 68:147-163. (IF = 1.639).. ISSN: 0921-030X (print version). ISSN: 1573-0840 (electronic version. Accession Number: WOS:000322724000008
Some of those special patterns are :
') "' --> '\t'
'), "' --> '\t'
'),"' --> '\t'
') "' --> '\t'
'),«' --> '\t'
'), «' --> '\t'
') "' --> '\t'
Until now I've tried using a dictionary to replace all those characters, but happens that the dictionary doesn't recognize some of those patterns. I know the re.sub function is the "man" for this (python replace space with special characters between strings) but that's cool when you have a predifined string, but when you read from a file, how do you do it?
My code:
# -*- coding: utf-8 -*-
import Tkinter as tk
import codecs, string, sys, re
root = tk.Tk()
root.title("Final?")
f = open('INPUT TEXT', 'r')
with codecs.open('INPUT TEXT', encoding='latin1') as f:
sentence = f.read()
if isinstance(sentence, unicode):
sentence = sentence.encode('latin1')
def results1():
print '\n', sentence
print results1, '\n'
key = {0:') "', 1:'replace'}
regx = re.compile('\t\t{[0]}\t\t'.format(key))
print( regx.sub(key[1],results1) )
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i,j)
return text
reps = {' (':'\t', ') "':'\t', '), "':'\t', '),"':'\t', ') "':'\t', '),«':'\t', '), «':'\t', ') "':'\t', 'p.':'\t', ',':' '}
converts = replace_all(sentence, reps)
def converts():
sys.stdout = open('output.txt', 'w')
converts = replace_all(sentence, reps)
print '\n', converts
results = tk.Button(root, text='Resultados', width=25, command=resultadosnormais)
results.pack()
txt = tk.Button(root, text='Conversor resultados', width=25, command=conversortexto)
txt.pack()
root.mainloop()
I saw this post too but I can't seem to apply it on my code in specific :Re.sub not working for me
But somehow it stores the function at somewhere but it gives a error right after that:
File "C:\Users\Joao\Desktop\Tryout2.py", line 30, in <module>
regx = re.compile('\t\t{[0]}\t\t'.format(key))
error: unbalanced parenthesis

Related

Problem with using spacy to lemmatize text and transform into CSV

I’m using Anaconda and I want to lemmatize, tokenize and morphologically annotate a text using spacy. I have a text file which I want to transform into a CSV file with all annotations etc. using the following script:
import os
import re
import csv
import glob
from collections import Counter
nlp = spacy.load("de_core_news_md")
plaintextfolder = "" #here would be my file path
taggedfolder = "" #here would be my file path
language = "de"
doc = nlp("Dies ist ein Satz.")
for token in doc:
print(token.text,token.pos_,token.lemma_,token.morph)
nlp = spacy.load("de_core_news_md")
def read_plaintext(file):
with open(file, "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub("’", "'", text)
return text
def save_tagged(taggedfolder, filename, tagged):
taggedfilename = taggedfolder + "/" + filename + ".csv"
with open(taggedfilename, "w", encoding="utf-8") as outfile:
writer = csv.writer(outfile, delimiter='\t')
for token in tagged:
token = token.text,token.pos_,token.lemma_,token.morph
writer.writerow(token)
def main(plaintextfolder, taggedfolder, language):
print("\n--preprocess")
if not os.path.exists(taggedfolder):
os.makedirs(taggedfolder)
counter = 0
for file in glob.glob(plaintextfolder + "*.txt"):
filename, ext = os.path.basename(file).split(".")
counter +=1
print("next: file", counter, ":", filename)
text = read_plaintext(file)
tagged = nlp(text)
save_tagged(taggedfolder, filename, tagged)
main(plaintextfolder, taggedfolder, language)
What I would like to have at the end is a CSV file looking like this:
Dies PRON Dies Case=Nom|Gender=Neut|Number=Sing|PronType=Dem
ist AUX sein Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
ein DET einen Case=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art
Satz NOUN Satz Case=Nom|Gender=Masc|Number=Sing
But I only get a weird CSV file looking like this (I copied only the first lines):
"' PUNCT ' "
"D'i'e's X D'i'e's Foreign=Yes"
"' PUNCT ' "
"' PUNCT ' "
"i's't X i's't Foreign=Yes"
"' PUNCT ' "
If you could help me with this issue, I would really appreciate it!
So as a general note, when asking on StackOverflow you should reduce your problem to the smallest part, and there's too much going on in your code sample. That said...
I cannot reproduce your problem exactly, but I can get part of it. I was able to get some weird lines kind of similar to your if I didn't call .strip() on the text. So you need to make sure that you don't pass newlines to the CSV writer - it's not escaping them correctly. For example, if my input file contains "Das ist gut", the output looks like this:
Das PRON der Case=Nom|Gender=Neut|Number=Sing|PronType=Dem
ist AUX sein Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
gut ADV gut Degree=Pos
"
" SPACE "
" Case=Nom|Gender=Masc|Number=Sing
This looks wrong, but what happens is the fourth line just includes a \n inside quotes. The fifth line has the closing quote and then an opening quote for another quoted newline.
You may be able to fiddle with the newline settings in the csvwriter to fix it but it's probably easier to just strip newlines.
For your tokens like D'i'e's, that looks like either your input text was bad or something went wrong with your regex for quotes, though in the current form it looks OK.
Anyway, to understand what's going wrong, I suggest you debug your code step by step to understand what text you're reading in and what the input to the csvwriter is.

Locating and extracting a string from multiple text files in Python

I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?
import os
import glob
import csv
def check(filename):
if 'DELIVERY NOTIFICATION' in open(filename).read():
isDenied = True
print ("This claim was Denied")
print (isDenied)
elif 'Dear Customer:' in open(filename).read():
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
filename = infile
check(filename)
iterate()
Any help would be appreciated. this is what the text file looks like
Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT. WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------
update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.
import os
import glob
arrayDenied = []
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
check(infile)
def check(filename):
with open(filename, 'rt') as file_contents:
myText = file_contents.read()
if 'DELIVERY NOTIFICATION' in myText:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
print("Denied: " + myNumber)
arrayDenied.append(myNumber)
elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")
startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]
startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]
arrayApproved.append(myNumber + " - " + myClaimNumber)
else:
print("I don't know if this is approved or denied")
iterate()
with open('Approved.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayApproved:
writer.writerow([val])
with open('Denied.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayDenied:
writer.writerow([val])
print(arrayDenied)
print(arrayApproved)
Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.
If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.
# Read the text file into memory:
with open(filename, 'rt') as txt_file:
myText = txt_file.read()
if 'DELIVERY NOTIFICATION' in myText:
# Find the desired string and get the subsequent 18 characters:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
arrayDenied.append(myNumber)
You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.
Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.
import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
def check(filename):
file_contents = open(filename, 'r').read()
if 'DELIVERY NOTIFICATION' in file_contents:
isDenied = True
print ("This claim was Denied")
print (isDenied)
matches = re.finditer(pattern, test_str)
for match in matches:
print("Tracking Number = %s" % match.group().strip("."))
elif 'Dear Customer:' in file_contents:
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
Explanation:
r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
(?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
(?:(\.+)) matches one or more dots (.) (we strip these out after)
[A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers
More on Regex.
I think this solves your issue, just turn it into a function.
import re
string = 'Tracking Identification Number...1Z000000YW00000000'
no_dots = re.sub('\.', '', string) #Removes all dots from the string
matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"
try:
print (matchObj.group(1))
except:
print("No match!")
If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search

Python: Issue printing to file special characters (Spanish alphabet)

I'm making an an algorithm to classify words with the number of times they appear in a text given by a file.
There is my method:
def printToFile(self, fileName):
file_to_print = open(fileName, 'w')
file_to_print.write(str(self))
file_to_print.close()
and there is the str:
def __str__(self):
cadena = ""
self.processedWords = collections.OrderedDict(sorted(self.processedWords.items()))
for key in self.processedWords:
cadena += str(key) + ": " + str(self.processedWords[key]) + "\n"
return cadena.decode('string_escape')
When I print the data through console there is no issues, nevertheless, when I do through file appears random characters.
This is should be the output to the file
This is the output given
This looks like a encoding issue, try opening the file like this:
open("file name","w",encoding="utf8")
Utf8 is the most popular encoding but it might not be the real encoding, you might have to check out other encodings such as utf16

Add "\n" to specific line in text

Say i have a textfile with something like this:
Områder dorsalt i overgangssonen, midtre tredjedel med blodpunkter.R: Malignitet ikke påvist
How can i add a \n before each R: in the text for several documents?
This is the code i have so far:
import os
for root, dirs, files in os.walk(".", topdown=True):
for name in files:
if name != "merge_reports_into_metadata_csv.py" or name != "BakgrunnsData_v2.csv" or name != "remove_text_windows.py":
slash = "\\"
if root == ".":
slash = ""
f = open(root.strip(".").strip("\\") + slash + name, "r")
lines = f.readlines()
f.close()
f = open(root.strip(".").strip("\\") + slash + name, "w")
for line in lines:
if line != "R:" + "\n":
f.write(line)
else:
print("adding line space the word 'R:' from " + name)
f.close()
print("all 'R:'s are moved one line down")
You may use regex substitution with the re module:
In [1768]: text = u'Områder dorsalt i overgangssonen, midtre tredjedel med blodpunkter.R: Malignitet ikke påvist'
In [1771]: new_text = re.sub(r'(R:)', r'\n\1', text, flags=re.M)
In [1773]: print(new_text)
Områder dorsalt i overgangssonen, midtre tredjedel med blodpunkter.
R: Malignitet ikke påvist
You can read your file at once with f.read() and pass the text to re.sub.
If your file is rather large, I would recommend reading line by line and writing each line as it is replaced to a new file.
It looks to me like you can do this by a simple text replace:
# -*- coding: utf-8 -*-
text = "Områder dorsalt i overgangssonen, midtre tredjedel " \
"med blodpunkter.R: Malignitet ikke påvist\n"
print text.replace("R:", "\nR:")
If your pattern is more complex, or if it has spaces around it on occasions, then the other answers mentioning regular expressions are a good way to go.
You can replace all "R:" in text and
text.replace('R:','\nR:')

Swedish characters in python error

I am making a program that uses words with Swedish characters and stores them in a list. I can print Swedish characters before I put them into a list, but after they are put in, they do not appear normally, just a big mess of characters.
Here is my code:
# coding=UTF-8
def get_word(lines, eng=0):
if eng == 1: #function to get word in english
word_start = lines[1]
def do_format(word, lang):
if lang == "sv":
first_word = word
second_word = translate(word, lang)
element = first_word + " - " + second_word
elif lang == "en":
first_word = translate(word, lang)
second_word = word
element = first_word + " - " + second_word
return element
def translate(word, lang):
if lang == "sv":
return "ENGLISH"
if lang == "en":
return "SWEDISH"
translated = []
path = "C:\Users\LK\Desktop\Dropbox\Dokumentai\School\Swedish\V47.txt"
doc = open(path, 'r') #opens the documen
doc_list = [] #the variable that will contain list of words
for lines in doc.readlines(): #repeat as many times as there are lines
if len(lines) > 1: #ignore empty spaces
lines = lines.rstrip() #don't add "\n" at the end
doc_list.append(lines) #add to the list
for i in doc_list:
print i
for i in doc_list:
if "-" in i:
if i[0] == "-":
element = do_format(i[2:], "en")
translated.append(element)
else:
translated.append(i)
else:
element = do_format(i, "sv")
translated.append(element)
print translated
raw_input()
I can simplify the problem to a simple code as:
# -*- coding: utf-8 -*-
test_string = "ö"
test_list = ["å"]
print test_string, test_list
If I run that, I get this
ö ['\xc3\xa5']
There are multiple things to notice:
The broken character. This seems to happen because your python seems to output UTF-8 but your terminal seems to be configured to some ISO-8859-X mode (hence the two characters). I'd try to use proper unicode strings in Python 2! (always u"ö" instead of "ö"). And check your locale settings (locale command when on linux)
The weird string in the list. In Python print e will print out str(e). For lists (such as ["å"]) the implementation of __str__ is the same as __repr__. And since repr(some_list) will call repr on any of the elements contained in the list, you end up with the string you see.
Example for repr(string):
>>> print u"ö"
ö
>>> print repr(u"ö")
u'\xf6'
>>> print repr("ö")
'\xc3\xb6'
If you print list then it can be print as some structure. You should convert it to string for example by using join() string method. With your test code it may looks like:
print test_string, test_list
print('%s, %s, %s' % (test_string, test_list[0], ','.join(test_list)))
And output:
ö ['\xc3\xa5']
ö, å, å
I think in your main program you can:
print('%s' % (', '.join(translated)))
You can use codecs module to specify encoding of the read bytes.
import codecs
doc = codecs.open(path, 'r', encoding='utf-8') #opens the document
Files opened with codecs.open will give you unicode string after decoding the raw bytes with specified encoding.
In you code, prefix your string literals with u, to make them unicode string.
# -*- coding: utf-8 -*-
test_string = u"ö"
test_list = [u"å"]
print test_string, test_list[0]

Categories

Resources