Parsing structured data from file python

Parsing structured data from file python - python

The file has the following format:
Component_name - version - author#email.com - multi-line comment with new lines and other white space characters
\t ...continue multi-line comment
Component_name2 - version - author2#email.com - possibly multi-line comment with new lines and other white space characters
Component_name - version - author#email.com - possibly multi-line comment with new lines and other white space characters 2
Component_name - version - author2#email.com - possibly multi-line comment with new lines and other white space characters 2
and so on...
After parsing the output format should be grouped by component_name:
output = [
"component_name" -> ["version - author#email.com - comment 1", "version - author#email.com - comment 2", ...],
"component_name2" -> [...],
...
]
Currently, this is what I have so far to parse it:
reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\#]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)
textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())
This works well for single line comments but fails with multi-line comments. Also, I am not sure if Regex is the right tool for this. Is there a better/pythonic way to do this?
EDIT:
Multi-line comments are tab delimited \t on a new line (e.g. look at first entry above)
Comments are GIT commit messages and can contain JSON or code
Entries are separated by a newline character

I've had to deal with structured data files like this and ended up writing a state machine to parse the file. Something like this (rough pseudocode):
for line in file:
if line matches new_record_regex:
records.append(record)
record = {"version": field1, "author": field2, "comment": field3}
else:
record["comment"] += line

You might want to formalize the file format as a grammar and then use one of the many parsers / parser generators Python has to offer to interpret the file according to the grammar.

Related

Get rid of unicode decimal charater

I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have &#7899 => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody

#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]

This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example &#432 is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;h&#432&#417u cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;h&#432&#417u cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0

Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0

Remove quotes holding 2 words and remove comma between them

Following up on Python to replace a symbol between between 2 words in a quote
Extended input and expected output:
trying to replace comma between 2 words Durango and PC in the second line by & and then remove the quotes " as well. Same for third line with Orbis and PC and 4th line has 2 word combos in quotes that I would like to process "AAA - Character Tech, SOF - UPIs","Durango, Orbis, PC"
I would like to retain the rest of the lines using Python.
INPUT
2,SIN-Rendering,Core Tech - Rendering,PC,147,Reopened
2,Kenny Chong,Core Tech - Rendering,"Durango, PC",55,Reopened
3,SIN-Audio,AAA - Audio,"Orbis, PC",13,Open
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,"AAA - Character Tech, SOF - UPIs","Durango, Orbis, PC",29,Waiting For
...
...
...
Like these, there can be 100 lines in my sample. So the expected output is:
2,SIN-Rendering,Core Tech - Rendering,PC,147,Reopened
2,Kenny Chong,Core Tech - Rendering, Durango & PC,55,Reopened
3,SIN-Audio,AAA - Audio, Orbis & PC,13,Open
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,AAA - Character Tech & SOF - UPIs,Durango, Orbis & PC,29,Waiting For
...
...
...
So far, I could think of reading line by line and then if the line contains quote replace it with no character but then replacement of symbol inside is something I am stuck with.
Here is what I have right now:
for line in lines:
expr2 = re.findall('"(.*?)"', line)
if len(expr2)!=0:
expr3 = re.split('"',line)
expr4 = expr3[0]+expr3[1].replace(","," &")+expr3[2]
print >>k, expr4
else:
print >>k, line
but it does not consider the case in 4th line? There can be more than 3 combos as well. For eg.
3,SIN-Audio,"AAA - Audio, xxxx, yyyy","Orbis, PC","13, 22",Open
and wish to make this
3,SIN-Audio,AAA - Audio & xxxx & yyyy, Orbis & PC, 13 & 22,Open
How to achieve this, any suggestion? Learning Python.

So, by treating the input file as a .csv we can easily turn the lines into something easy to work with.
For example,
2,Kenny Chong,Core Tech - Rendering, Durango & PC,55,Reopened
is read as:
['2', 'Kenny Chong', 'Core Tech - Rendering', 'Durango, PC', '55', 'Reopened']
Then, by replacing all instances of , with _& (space) we would have the line:
['2', 'Kenny Chong', 'Core Tech - Rendering', 'Durango & PC', '55', 'Reopened']
And it replaces multiple instances of ,s within a line, and when finally writing we no longer have the original double quotes.
Here is the code, given that in.txt is your input file and it will write to out.txt.
import csv
with open('in.txt') as infile:
reader = csv.reader(infile)
with open('out.txt', 'w') as outfile:
for line in reader:
line = list(map(lambda s: s.replace(',', ' &'), line))
outfile.write(','.join(line) + '\n')
The fourth line is outputted as:
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,AAA - Character Tech & SOF - UPIs,Durango & Orbis & PC,29,Waiting For

Please check this once: I could not find a single expression that could do this. So did it in a bit elaborate way. Will update if I can find a better way(Python 3)
import re
st = "3,SIN-Audio,\"AAA - Audio, xxxx, yyyy\",\"Orbis, PC\",\"13, 22\",Open"
found = re.findall(r'\"(.*)\"',st)[0].split("\",\"")
final = ""
for word in found:
final = final + (" &").join(word.split(","))+","
result = re.sub(r'\"(.*)\"',final[:-1],st)
print(result)

How to join second line to end of first line in python?

I tried to read lines like below:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1
xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1
xT
Where each Even lines is a continuation of ODD lines but which is split by "\n\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s" so I want to replace those '\n\s(n)' to '' and join back to end of ODD lines .
FOR EXAMPLE:
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
TO
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
CODE:
import os
import sys
import re
lines=["A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1"," xQ,1xT","A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x"," H,1xY","A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1"," xT"]
for i in lines:
print i.replace(" ","")
Here,I just replaced spaces by empty space but i didnt get how to join those replaced EVEN lines to end of ODD lines.
So could some one help me to do the same.
Thanking you in advance.
Hi guys , First of all Many more thanks for your kind replies. I tried all the ways but the followed one works correct:
WILD= open("INPUT.txt", 'r')
merged = []
for line in WILD:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line.replace("\n",""))
OUTPUT:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1xT

Instead of that replace statement, you can just use str.strip to strip away whitespace at the beginning or the end of the string. Also, you can use zip to iterate pairs of lines.
for x, y in zip(l[::2],l[1::2]):
print "".join([x, y.strip()])
Or use next to get the next line if this is an iterator, like a file.
for x in iterator:
y = next(iterator)
print "".join([x, y.strip()])
Both ways, all the even lines (0, 2, ...) go to x and all the odd ones (1, 3, ...) to y.
Of course, this is assuming that all the entries in the list/file are spanning exactly two lines.
If they can span an arbitrary number of lines (just one, or two, or maybe five), then this will get more complicated. In this case, you might try something like this:
merged = []
for line in lines:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line)
Note: If thoses are indeed lines from a file, you might have to apply strip to all the lines, i.e. also x.strip() and merged.append(line.strip()), as each line will be terminated by \n which you might want to get rid of.

Read the entire file as a single string, then replace the entire whitespace with a single tab:
filepointer = open("INPUT.txt")
text = filepointer.read()
text = re.sub(r"\n\s{20,}", "\t", text)
This matches and removes sequences of a newline followed by 20 or more spaces, replacing them with a tab. (That way I don't have to count the precise number of spaces, and the program still works if some lines are slightly different).
If you don't want a tab between the joined lines, just use a space (" ") instead of "\t".
And if you must have the result as a list of lines, split text afterwards:
merged = text.splitlines()

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!

You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.

Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Trouble concatenating two strings

I am having trouble concatenating two strings. This is my code:
info = infl.readline()
while True:
line = infl.readline()
outfl.write(info + line)
print info + line
The trouble is that the output appears on two different lines. For example, output text looks like this:
490250633800 802788.0 953598.2
802781.968872 953674.839355 193.811523 1 0.126805 -999.000000 -999.000000 -999.000000
I want both strings on the same line.

There must be a '\n' character at the end of info. You can remove it with:
info = infl.readline().rstrip()

You should remove line breaks in the line and info variables like this :
line=line.replace("\n","")

readline will return a "\n" at the end of the string 99.99% of the time. You can get around this by calling rstrip on the result.
info = infl.readline().rstip()
while True:
#put it both places!
line = infl.readline().rstip()
outfl.write(info + line)
print info + line
readline's docs:
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line)...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing structured data from file python - python

You might want to formalize the file format as a grammar and then use one of the many parsers / parser generators Python has to offer to interpret the file according to the grammar.

Related

Get rid of unicode decimal charater

Remove quotes holding 2 words and remove comma between them

How to join second line to end of first line in python?

regular expressions in python using quotes

Trouble concatenating two strings

Categories

Resources