I want to transform chunks of text into a database of single line entries database with regex. But I don't know why the regex group isn't recognized.
Maybe because the multiline flag isn't properly set.
I am a beginner at python.
import re
with open("a-j-0101.txt", encoding="cp1252") as f:
start=1
ecx=r"(?P<entrcnt>[0-9]{1,3}) célébrités ou évènements"
ec1=""
nmx=r"(?P<ename>.+)\r\nAfficher le.*"
nm1=""
for line in f:
if start == 1:
out = open('AST0101.txt' + ".txt", "w", encoding="cp1252") #utf8 cp1252
ec1 = re.search(ecx,line)
out.write(ec1.group("entrcnt"))
start=0
out.write(r"\r\n")
nm1 = re.search(nmx,line, re.M)
out.write(str(nm1.group("ename")).rstrip('\r\n'))
out.close()
But I get the error:
File "C:\work-python\transform-asth-b.py", line 16, in <module>
out.write(str(nm1.group("ename")).rstrip('\r\n'))
builtins.AttributeError: 'NoneType' object has no attribute 'group'
here is the input:
210 célébrités ou évènements ont été trouvés pour la date du 1er janvier.
Création de l'euro
Afficher le...
...
...
...
expected output:
210
Création de l'euro ;...
... ;...
... ;...
EDIT: I try to change nmx to match \n or \r\n but no result:
nmx=r"(?P<ename>.+)(\n|\r\n)Afficher le"
best regards
In this statement:
nm1 = re.search(nmx,line, re.M)
you get an NoneType object (nm1 = None), because no matches were found. So make more investigation on the nmx attribute, why you get no matches in the regex.
By the way if it´s possible to get a NoneType object, you can avoid this by preventing a NoneType:
If nm1 is not None:
out.write(str(nm1.group("ename")).rstrip('\r\n'))
else:
#handle your NoneType case
If you are reading a single line at a time, there is no way for a regex to match on a previous line you have read and then forgotten.
If you read a group of lines, you can apply a regex to the collection of lines, and the multiline flag will do something useful. But your current code should probably simply search for r'^Afficher le\.\.\.' and use the state machine (start == 0 or start == 1) to do this in the right context.
Related
Im working on a simple project python to practice , im trying to retreive data from file and do some test on a value
in my case i do retreive data as table from a file , and i do test the last value of the table if its true i add the whole line in another file
Here my data
AE300812 AFROUKH HAMZA 21 admis
AE400928 VIEGO SAN 22 refuse
AE400599 IBN KHYAT mohammed 22 admis
B305050 BOUNNEDI SALEM 39 refuse
here my code :
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
contenu = fichier.read()
tab = contenu.split()
for i in range(0,len(tab),5):
if tab[i+4]=="admis":
fichier2.write(tab[i]+" "+tab[i+1]+" "+tab[i+2]+" "+tab[i+3]+" "+tab[i+4]+" "+"\n")
fichier.close()
And here the following error :
if tab[i+4]=="admis":
IndexError: list index out of range
You look at tab[i+4], so you have to make sure you stop the loop before that, e.g. with range(0, len(tab)-4, 5). The step=5 alone does not guarantee that you have a full "block" of 5 elements left.
But why does this occur, since each of the lines has 5 elements? They don't! Notice how one line has 6 elements (maybe a double name?), so if you just read and then split, you will run out of sync with the lines. Better iterate lines, and then split each line individually. Also, the actual separator seems to be either a tab \t or double-spaces, not entirely clear from your data. Just split() will split at any whitespace.
Something like this (not tested):
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
for line in fichier:
tab = line.strip().split(" ") # actual separator seems to be tab or double-space
if tab[4]=="admis":
fichier2.write(tab[0]+" "+tab[1]+" "+tab[2]+" "+tab[3]+" "+tab[4]+" "+"\n")
Depending on what you actually want to do, you might also try this:
with open("concours.txt","r") as fichier, open("admis.txt","w") as fichier2:
for line in fichier:
if line.strip().endswith("admis"):
fichier2.write(line)
This should just copy the admis lines to the second file, with the origial double-space separator.
I am new to python and I wonder if there is an efficient way to find the original sentence from a text file by knowing an offset of a word. Suppose that I have a test.txt file like this:
test.txt
Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
Suppose that I know the offset of the word "wheat" which is [13,18].
My codes look like this:
import nltk
from nltk.tokenize import word_tokenize
with open("test.txt") as f:
list_phrase = f.readlines()
f.seek(0)
contents = f.read()
for index, phrase in enumerate(list_phrase):
j = word_tokenize(phrase)
if contents[13:18] in j:
print(list_phrase[index])
The output of my codes will print both sentences i.e ( "Ceci est une wheat phrase corn." and "This is the third wheat word." )
How to detect exactly the real phrase of a word by knowing its offset?
Note that the offset that I considered continues between many sentences (2 sentences in this case). For example, the offset of the word "barley" should be [61,67].
The desire output of the print above should be:
Ceci est une wheat phrase corn.
As we know that its offset is [13,18].
Any help for this would be much appreciated. Thank you so much!
If you are looking for raw speed then the standard library is probably the best approach to take.
# Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
for _ in range(10000000):
file.write("All work and no play makes Jack a dull boy.\n")
file.write("Finally we get to the line containing the word 'wheat'.\n")
Given the search_word and its offset in the line we're looking for we can calculate the limit for the string comparison.
search_word = 'wheat'
offset = 48
limit = offset + len(search_word)
The simplest approach is to iterate over the enumerated lines of text and perform a string comparison on each line.
with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
The runtime for this solution is 155 ms on a 2012 Mac mini (2.3GHz i7 CPU). That seems pretty fast for processing 10,000,001 lines but it can be improved upon by checking the length of the text before attempting the string comparison.
with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (len(text) >= limit) and (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
The runtime for the improved solution is 71 ms on the same computer. It's a significant improvement but of course mileage will vary depending on the text file.
Generated output:
Line 10000001: "Finally we get to the line containing the word 'wheat'."
EDIT: Including file offset information
with open('very-big.txt', 'r') as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if line_length >= limit and (text[offset:limit] == search_word):
print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
file_offset += line_length
Sample output:
[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."
Encore une fois
This code checks if the known offset of the text is between the values of the offset of the start of the current line and the end of the line. The text found at the offset is also verified.
long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""
import io
search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)
# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if file_offset < known_offset < (file_offset + line_length) \
and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
file_offset += line_length
Output:
[61,67]
Line: 2
Ceci est une deuxième phrase barley.
If you already know the position of the word, tokenizing is not what you want to do. By tokenizing, you change the sequence (for which you know the position) to a list of words, where you don't know which element is your word.
Therefore, you should leave it at the phrase and just compare the part of the phrase with your word:
with open("test.txt") as f:
list_phrase = f.readlines()
f.seek(0)
contents = f.read()
for index, phrase in enumerate(list_phrase):
if phrase[13:18].lower() == "wheat": ## .lower() is only necessary if the word might be in upper case.
print(list_phrase[index])
This would only return the sentences where wheat is at the position [13:18]. All other occurrences of wheat would not be recognized.
So I have a code that takes a .txt file and adds it to a variable as a string.
Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing.
Code:
def normalize(filename):
#Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
#File says: "Es una rubrica de evaluación." (among many emojis)
txt_raw = open(filename, "r", errors="ignore")
txt_read = txt_raw.read()
#Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs.
rem_accent_txt = txt_read.replace("ó", "o")
print(rem_accent_txt)
return
Expected output:
"Es una rubrica de evaluacion."
Current Output:
"Es una rubrica de evaluación."
It does not print an error or anything, it just prints it as it is.
I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file.
EDIT: SOLUTION!
Thanks to #juanpa.arrivillaga and #das-g I came up with this solution:
from unidecode import unidecode
def get_txt(filename):
txt_raw = open(filename, "r", encoding="utf8")
txt_read = txt_raw.read()
txt_decode = unidecode(txt_read)
print(txt_decode)
return txt_decode
Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode:
>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining), len(composed)
(2, 1)
>>> list(combining)
['o', '́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name, combining))
['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name, composed))
['LATIN SMALL LETTER O WITH ACUTE']
Just normalize your strings:
>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC", combining)
True
Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above?
As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode.
>>> emoji = "😀"
>>> print(emoji)
😀
>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'
I've got a small Python script that compares a word list imported from document A with a set of line endings in document B in order to copy the ones that don't match those rules to document C. Example:
A (word list):
salir
entrar
leer
B (line endings list):
ir
ar
C (those from A that do not match B):
leer
In general it works fine but I realized that it doesn't work with line endings that contain a Unicode character as ó - there is no error message and everything seems smooth but the list C does still contain words ending with ó.
Here is an excerpt of my code:
inputobj = codecs.open(A, "r")
ruleobj = codecs.open(B, "r")
nomatch = codecs.open(C, "w")
inputtext = inputobj.readlines()
ruletext = ruleobj.readlines()
for line in inputtext:
x = 0
line = line.strip()
for rule in ruletext:
rule = rule.strip()
if line.endswith(rule):
print "rule", rule, " in line", line
x= x+1
if x == 0:
nomatchlist.append(line)
for i in notmatchlist:
print >> nomatch, i
I've tried some code locally. It works well for the 'ó'.
Could you check the A & B are in the same encoding?
I have been trying to parse a file with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?
Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
[badcharref.xml]
<a></a>
[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
... print el.tag
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc
Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough.
BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
text = open(fname, "rb").read()
else:
# Assumes file is encoded in UTF-8.
text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
m = rx.search(text, pos)
if not m: break
mstart, mend = m.span()
target = m.group(1)
if target:
num = int(target)
else:
num = int(m.group(2), 16)
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
print mstart, m.group()
pos = mend
Output:
comments.xml
6615405
10205764
10213901
10213936
10214123
13292514
...
155656543
155656564
157344876
157722583
posts.xml
7607143
12982273
12982282
12982292
12982302
12982310
16085949
16085955
...
36303479
36303494 <<=== whoops
38942863
...
785292911
801282472
848911592
As #John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
set(['', '', '', '', '', '', '
', '', '', '', '', '', '', '', '
', '', '', ' ', '', '', '', '', ''])
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.
I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
except ET.ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
Source: http://effbot.org/zone/elementtree-13-intro.htm
I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
it = ET.iterparse(file(xml))
inside a try & except bracket:
try:
it = ET.iterparse(file(xml))
except:
print('iterparse error')
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.