How to read a file block-wise in python - python

I am bit stuck in reading a file block-wise, and facing difficulty in getting some selective data in each block :
Here is my file content :
DATA.txt
#-----FILE-----STARTS-----HERE--#
#--COMMENTS CAN BE ADDED HERE--#
BLOCK IMPULSE DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=1021055:lr=1: \
USERID=ID=291821 NO_USERS=3 GROUP=ONE id_info=1021055 \
CREATION_DATE=27-JUNE-2013 SN=1021055 KEY ="22WS \
DE34 43RE ED54 GT65 HY67 AQ12 ES23 54CD 87BG 98VC \
4325 BG56"
BLOCK PASSION DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
USERID=ID=291821 NO_USERS=1 GROUP=ONE id_info=324356 \
CREATION_DATE=27-MAY-2012 SN=324356 KEY ="22WS \
DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
CD32 12QW"
BLOCK VICTOR DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
USERID=ID=291821 NO_USERS=5 GROUP=ONE id_info=324356 \
CREATION_DATE=27-MAY-2012 SN=324356 KEY ="22WS \
DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
CD32 12QW"
#--BLOCK--ENDS--HERE#
#--NEW--BLOCKS--CAN--BE--APPENDED--HERE--#
I am only interested in Block Name , NO_USERS, and id_info of each block .
these three data to be saved to a data-structure(lets say dict), which is further stored in a list :
[{Name: IMPULSE ,NO_USER=3,id_info=1021055},{Name: PASSION ,NO_USER=1,id_info=324356}. . . ]
any other data structure which can hold the info would also be fine.
So far i have tried getting the block names by reading line by line :
fOpen = open('DATA.txt')
unique =[]
for row in fOpen:
if "BLOCK" in row:
unique.append(row.split()[1])
print unique
i am thinking of regular expression approach, but i have no idea where to start with.
Any help would be appreciate.Meanwhile i am also trying , will update if i get something . Please help .

You could use groupy to find each block, use a regex to extract the info and put the values in dicts:
from itertools import groupby
import re
with open("test.txt") as f:
data = []
# find NO_USERS= 1+ digits or id_info= 1_ digits
r = re.compile("NO_USERS=\d+|id_info=\d+")
grps = groupby(f,key=lambda x:x.strip().startswith("BLOCK"))
for k,v in grps:
# if k is True we have a block line
if k:
# get name after BLOCK
name = next(v).split(None,2)[1]
# get lines after BLOCK and get the second of those
t = next(grps)[1]
# we want two lines after BLOCK
_, l = next(t), next(t)
d = dict(s.split("=") for s in r.findall(l))
# add name to dict
d["Name"] = name
# add sict to data list
data.append(d)
print(data)
Output:
[{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'},
{'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'},
{'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
Or without groupby as your file follows a format we just need to extract the second line after the BLOCK line:
with open("test.txt") as f:
data = []
r = re.compile("NO_USERS=\d+|id_info=\d+")
for line in f:
# if True we have a new block
if line.startswith("BLOCK"):
# call next twice to get thw second line after BLOCK
_, l = next(f), next(f)
# get name after BLOCK
name = line.split(None,2)[1]
# find our substrings from l
d = dict(s.split("=") for s in r.findall(l))
d["Name"] = name
data.append(d)
print(data)
Output:
[{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'},
{'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'},
{'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
To extract values you can iterate:
for dct in data:
print(dct["NO_USERS"])
Output:
3
1
5
If you want a dict of dicts and to access each section from 1-n you can store as nested dicts using from 1-n as tke key:
from itertools import count
import re
with open("test.txt") as f:
data, cn = {}, count(1)
r = re.compile("NO_USERS=\d+|id_info=\d+")
for line in f:
if line.startswith("BLOCK"):
_, l = next(f), next(f)
name = line.split(None,2)[1]
d = dict(s.split("=") for s in r.findall(l))
d["Name"] = name
data[next(cn)] = d
data["num_blocks"] = next(cn) - 1
Output:
from pprint import pprint as pp
pp(data)
{1: {'NO_USERS': '3', 'Name': 'IMPULSE', 'id_info': '1021055'},
2: {'NO_USERS': '1', 'Name': 'PASSION', 'id_info': '324356'},
3: {'NO_USERS': '5', 'Name': 'VICTOR', 'id_info': '324356'},
'num_blocks': 3}
'num_blocks' will tell you exactly how many blocks you extracted.

Related

How to arrange html sentences having different structures

I have few hundreds of html files look like the below.
<nonDerivativeTable>
<nonDerivativeHolding> #First Holding
<securityTitle>
<value>Stock</value>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Second Holding
<securityTitle>
<footnoteId id="F1"/>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Third Holding
<securityTitle>
<value>Option</value>
<footnoteId id="F2"/>
<footnoteId id="F3"/>
</securityTitle>
</nonDerivativeHolding>
</nonDerivativeTable>
Two variables that I would like to extract is security ('Stock' in #First holding, '' in #Second holding, and 'Option' in #Third holding) and security_footnote ('' in #First holding, 'F1; F2' in #Second holding, and 'F3' in #Third holding. But securityTitle and securityTitleFootnote do not always exist.
Also, sometimes there are multiple footnote IDs just like in the #third holding.
I want to write each rwo using data in each "Holding" tag allowing for empty values.
import csv
from bs4 import BeautifulSoup
with open('output.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile, )
soup = BeautifulSoup(doc, 'htmparser') #Let's say doc has the html.
try:
securityTitles = soup.select('securityTitle > value').text
except:
securitiyTitles = ''
try:
securityTitleFootnotes = '; 'join(soup.select('securityTitle > footnoteid').get('id')
except:
securityTitleFootnotes = ''
for securityTitle, securityTitleFootnote in zip(securitiyTitles, securityTitleFootnotes):
writer.writerow([securityTitle, securityTitleFootnote])
I want the result to be
Want Table
Note: one of url that I am trying to parse is "https://www.sec.gov/Archives/edgar/data/12927/0001225208-09-018738.txt". sentences that I uploaded are only part of the data.
Now I see that those are XML... rather than HTML.
You can find the contents for each nonDerivativeHolding, and then apply a custom list of handlers for each:
from bs4 import BeautifulSoup as soup
c = [i.securitytitle.contents for i in soup(s, 'html.parser').find_all('nonderivativeholding')]
h = [('value', lambda x:x.text), ('footnoteid', lambda x:x['id'])]
results = [[i for i in b if i != '\n'] for b in c]
r = [{a:(lambda x:'' if not x else x[0] if len(x) == 1 else x)([b(j) for j in i if j.name == a]) for a, b in h} for i in results]
Output:
[{'value': 'Stock', 'footnoteid': ''}, {'value': '', 'footnoteid': 'F1'}, {'value': 'Option', 'footnoteid': ['F2', 'F3']}]

parse a structured (structure of machine) text-file (config-file) into a structured table format

main goal is to get from a more or less readable config file into a table format which can be read from everyone witouth deeper understanding of the machine and their configuration standards.
i've got a config file:
******A MANO:111111 ,20190726,001,0914,06621242746
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:111111 ,20190726,001,0914,06621242746
******A MANO:222222 ,20190726,001,0914,06621242746
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:222222 ,20190726,001,0914,06621242746
There are several objects in the file always starting with 'A MANO:' and ending with 'E MANO:' followed by the object-number.
all the lines underneath are the attributes of the object (settings of the machine). Not all objects have the same amount of settings. There could be 55 Lines for one object and 199 for another.
what i tried so far:
from pyparsing import *
'''
grammar:
object_nr ::= Word(nums, exact=6)
num ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
'''
path_input = r'\\...\...'
with open(path_input) as input_file:
line = input_file.readline()
cnt = 1
object_nr_parser = Word(nums, exact=6)
for match, start, stop in object_nr_parser.scanString(input_file):
print(match, start, stop)
which gives me the printout:
['201907'] 116 122
['019211'] 172 178
the number it founds and the start and ending points in the string. But this numbers are not what I'm looking for nor correct. i can't even find the second number in the config-file.
is it the right way to solve this with pyparsing or is there a more convenient way to do it? Where did i do the mistake?
At the end it would be astounding if i would have an object for every machine with attributes which would be all the lines between the A MANO: and the E MANO:
expected result would be something like this:
{"object": "111111",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0"},
{"object": "222222",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0",
"line8":"PA2*2769*166140*210*12600*0*0*0*0",
"line9":"PA2*2769*166140*210*12600*0*0*0*0",
"line10":"PA2*2769*166140*210*12600*0*0*0*0"}
Not sure if that is the best solution for the purpose but it's the one that came into mind at this point.
One of the dirtiest ways to get the thing done would be using regex and replace the MANO with line break and all the line breaks with ';'. I don't think that this would be a solution one should use
You can parse it line by line:
import re
with open('file.txt', 'r') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
result = []
name = ''
i = 1
for line in lines:
if 'A MANO' in line:
name = re.findall('A MANO:(\d+)', line)[0]
result.append({'object': name})
i = 1
elif 'E MANO' not in line:
result[-1][f'line{i}'] = line
i += 1
Output:
[{
'object': '111111',
'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
'line2': 'ST*001*0001',
'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
'line5': 'PA1*0*100',
'line6': 'PA1*1*60',
'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
}, {
'object': '222222',
'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
'line2': 'ST*001*0001',
'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
'line5': 'PA1*0*100',
'line6': 'PA1*1*60',
'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
}
]
But I suggest using more compact output format:
import re
with open('file.txt', 'r') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
result = {}
name = ''
for line in lines:
if 'A MANO' in line:
name = re.findall('A MANO:(\d+)', line)[0]
result[name] = []
elif 'E MANO' not in line:
result[name].append(line)
Output:
{
'111111': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0'],
'222222': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0']
}

Parse vertical text in a file with repeated block

What is the best way for parsing below file? The blocks repeat multiple times.
The expected result is output to CSV file as:
{Place: REGION-1, Host: ABCD, Area: 44...}
I tried the code below, but it only iterates first blocks and than finishes.
with open('/tmp/t2.txt', 'r') as input_data:
for line in input_data:
if re.findall('(.*_RV)\n',line):
myDict={}
myDict['HOST'] = line[6:]
continue
elif re.findall('Interface(.*)\n',line):
myDict['INTF'] = line[6:]
elif len(line.strip()) == 0:
print(myDict)
Text file is below.
Instance REGION-1:
ABCD_RV
Interface: fastethernet01/01
Last state change: 0h54m44s ago
Sysid: 01441
Speaks: IPv4
Topologies:
ipv4-unicast
SAPA: point-to-point
Area Address(es):
441
IPv4 Address(es):
1.1.1.1
EFGH_RV
Interface: fastethernet01/01
Last state change: 0h54m44s ago
Sysid: 01442
Speaks: IPv4
Topologies:
ipv4-unicast
SAPA: point-to-point
Area Address(es):
442
IPv4 Address(es):
1.1.1.2
Instance REGION-2:
IJKL_RV
Interface: fastethernet01/01
Last state change: 0h54m44s ago
Sysid: 01443
Speaks: IPv4
Topologies:
ipv4-unicast
SAPA: point-to-point
Area Address(es):
443
IPv4 Address(es):
1.1.1.3
Or if you prefer an ugly regex route:
import re
region_re = re.compile("^Instance\s+([^:]+):.*")
host_re = re.compile("^\s+(.*?)_RV.*")
interface_re = re.compile("^\s+Interface:\s+(.*?)\s+")
other_re = re.compile("^\s+([^\s]+).*?:\s+([^\s]*){0,1}")
myDict = {}
extra = None
with open('/tmp/t2.txt', 'r') as input_data:
for line in input_data:
if extra: # value on next line from key
myDict[extra] = line.strip()
extra = None
continue
region = region_re.match(line)
if region:
if len(myDict) > 1:
print(myDict)
myDict = {'Place': region.group(1)}
continue
host = host_re.match(line)
if host:
if len(myDict) > 1:
print(myDict)
myDict = {'Place': myDict['Place'], 'Host': host.group(1)}
continue
interface = interface_re.match(line)
if interface:
myDict['INTF'] = interface.group(1)
continue
other = other_re.match(line)
if other:
groups = other.groups()
if groups[1]:
myDict[groups[0]] = groups[1]
else:
extra = groups[0]
# dump out final one
if len(myDict) > 1:
print(myDict)
output:
{'Place': 'REGION-1', 'Host': 'ABCD', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01441', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '441', 'IPv4': '1.1.1.1'}
{'Place': 'REGION-1', 'Host': 'EFGH', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01442', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '442', 'IPv4': '1.1.1.2'}
{'Place': 'REGION-2', 'Host': 'IJKL', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01443', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '443', 'IPv4': '1.1.1.3'}
This doesn't use much regex and could be more optimized. Hope it helps!
import re
import pandas as pd
from collections import defaultdict
_level_1 = re.compile(r'instance region.*', re.IGNORECASE)
with open('stack_formatting.txt') as f:
data = f.readlines()
"""
Format data so that it could be split easily
"""
data_blocks = defaultdict(lambda: defaultdict(str))
header = None
instance = None
for line in data:
line = line.strip()
if _level_1.match(line):
header = line
else:
if "_RV" in line:
instance = line
elif not line.endswith(":"):
data_blocks[header][instance] += line + ";"
else:
data_blocks[header][instance] += line
def parse_text(data_blocks):
"""
Generate a dict which could be converted easily to a pandas dataframe
:param data_blocks: splittable data
:return: dict with row values for every column
"""
final_data = defaultdict(list)
for key1 in data_blocks.keys():
for key2 in data_blocks.get(key1):
final_data['instance'].append(key1)
final_data['sub_instance'].append(key2)
for items in data_blocks[key1][key2].split(";"):
print(items)
if items.isspace() or len(items) == 0:
continue
a,b = re.split(r':\s*', items)
final_data[a].append(b)
return final_data
print(pd.DataFrame(parse_text(data_blocks)))
This worked for me but it's not pretty:
text=input_data
text=text.rstrip(' ').rstrip('\n').strip('\n')
#first I get ready to create a csv by replacing the headers for the data
text=text.replace('Instance REGION-1:',',')
text=text.replace('Instance REGION-2:',',')
text=text.replace('Interface:',',')
text=text.replace('Last state change:',',')
text=text.replace('Sysid:',',')
text=text.replace('Speaks:',',')
text=text.replace('Topologies:',',')
text=text.replace('SAPA:',',')
text=text.replace('Area Address(es):',',')
text=text.replace('IPv4 Address(es):',',')
#now I strip out the leading whitespace, cuz it messes up the split on '\n\n'
lines=[x.lstrip(' ') for x in text.split('\n')]
clean_text=''
#now that the leading whitespace is gone I recreate the text file
for line in lines:
clean_text+=line+'\n'
#Now split the data into groups based on single entries
entries=clean_text.split('\n\n')
#create one liners out of the entries so they can be split like csv
entry_lines=[x.replace('\n',' ') for x in entries]
#create a dataframe to hold the data for each line
df=pd.DataFrame(columns=['Instance REGION','Interface',
'Last state change','Sysid','Speaks',
'Topologies','SAPA','Area Address(es)',
'IPv4 Address(es)']).T
#now the meat and potatoes
count=0
for line in entry_lines:
data=line[1:].split(',') #split like a csv on commas
data=[x.lstrip(' ').rstrip(' ') for x in data] #get rid of extra leading/trailing whitespace
df[count]=data #create an entry for each split
count+=1 #incriment the count
df=df.T #transpose back to normal so it doesn't look weird
Output looks like this for me
Edit: Also, since you have various answers here, I test the performance of mine. It is mildly exponential as described by the equation y = 100.97e^(0.0003x)
Here are my timeit results.
Entries Milliseconds
18 49
270 106
1620 394
178420 28400

How can I match strings that may or may not exist in regex but have placeholders if the match does not exist

Suppose I have a big text file in the following form
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
There are in total 8 keys which will always be in the order of "Surname","Name","Age","Weight","Height","School","Siblings","Quote" which I know beforehand. As you can see, some profiles do not have the full set of variables. The only thing you can be sure will exist is the name.
I want to create a pandas dataframe with each observation as a row and each column as a key. In the case of James, since he does not have the entries in "School" and "Sibling" I would like the entries of those cells to be the numpy nan object.
My attempt is using something like (?:\[Surname: \"()\"\]) for every variable. But even for the single case of surname I run into problems. If surname does not exist, it returns no place holders just the empty list.
Update:
As an example, I would like the return for monica's profile to be
('','Monica','','33','','','','I am looking forward to christmas')
You can parse the file data, group the results, and pass to a dataframe:
import re
import pandas as pd
def group_results(d):
_group = [d[0]]
for a, b in d[1:]:
if a == 'Name' and not any(c == 'Name' for c, _ in _group):
_group.append([a, b])
elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
yield _group
_group = [[a, b]]
else:
if a == 'Name':
yield _group
_group = [[a, b]]
else:
_group.append([a, b])
yield _group
headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)
Output:
Surname Name ... Siblings Quote
0 Gordon James ... I want to be a pilot
1 Monica ... I am looking forward to christmas
[2 rows x 8 columns]
Building on #WiktorStribiżew comment, you could use groupby (from itertools) to group the lines into empty lines and data lines, for instance like this:
import re
from itertools import groupby
text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]
[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''
patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
re.compile('(\[Name: "(?P<name>\w+?)"\])'),
re.compile('(\[Age: "(?P<age>\d+?)"\])'),
re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
re.compile('\[Height: "(?P<height>\d+?)"\]'),
re.compile('\[Quote: "(?P<quote>.+?)"\]')]
records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
if non_empty:
lines = list(group)
record = {}
for line in lines:
for pattern in patterns:
match = pattern.search(line)
if match:
record.update(match.groupdict())
break
records.append(record)
for record in records:
print(record)
Output
{'weight': '46', 'quote': 'I want to be a pilot', 'age': '13', 'name': 'James', 'height': '12', 'surname': 'Gordon'}
{'weight': '33', 'quote': 'I am looking forward to christmas', 'name': 'Monica'}
{'height': '33', 'quote': 'I am looking forward to christmas', 'name': 'John'}
{'height': '44', 'surname': 'Gordon', 'quote': 'I am looking forward to christmas', 'name': 'James'}
Note: This creates a dictionary where the keys are the field names and the values are the values of each, this format does not match your intended output, but I believe is more complete that what you requested. In any case you can easily convert from this format into the desired tuple format.
Explanation
The groupby function from itertools groups the input data into contiguous groups of empty lines and record lines. Then you only need to process the groups that are not empty. The processing is simple for each line try to match a pattern if the pattern is matched break, assuming the lines are exclusive for each match update the record dictionary with the value of the field, leveraging named groups.
You can rewrite your data file. The code parses your original file into classes D, then uses csv.DictWriter to write it into a normal style csv that should be readable by pandas:
Create demo file:
fn = "t.txt"
with open (fn,"w") as f:
f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")
Itermediate class:
class D:
fields = ["Surname","Name","Age","Weight","Height","Quote"]
def __init__(self,textlines):
t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
self.data = {k:"" for k in D.fields}
self.data.update(t)
def surname(self): return self.data["Surname"]
def name(self): return self.data["Name"]
def age(self): return self.data["Age"]
def weight(self): return self.data["Weight"]
def height(self): return self.data["Height"]
def quote(self): return self.data["Quote"]
def get_data(self):
return self.data
Parsing and rewriting:
fn = "t.txt"
# list of all collected D-Instances
data = []
with open(fn) as f:
# each dataset contains all lines belonging to one "person"
dataset = []
surname = False
for line in f.readlines():
clean = line.strip().strip("[]")
if clean and (clean.startswith("Surname") or clean.startswith("Name")):
if any(e.startswith("Name") for e in dataset):
data.append(D(dataset))
dataset = []
if clean:
dataset.append(clean)
else:
if clean:
dataset.append(clean)
elif clean:
dataset.append(clean)
if dataset:
data.append(D(dataset))
import csv
with open("other.txt", "w", newline="") as f:
dw = csv.DictWriter(f,fieldnames=D.fields)
dw.writeheader()
for entry in data:
dw.writerow(entry.get_data())
Check what was written:
with open("other.txt","r") as f:
print(f.read())
Output:
Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""
Create a list of (key,value) tuples for each info block with re.findall(), and put them in separate dictionaries:
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']

How to search for multiple data from multiple lines and store them in dictionary?

Say I have a file with the following:
/* Full name: abc */
.....
.....(.....)
.....(".....) ;
/* .....
/* .....
..... : "....."
}
"....., .....
Car : true ;
House : true ;
....
....
Age : 33
....
/* Full name: xyz */
....
....
Car : true ;
....
....
Age : 56
....
I am only interested in full name, car, house and age of each person. There are many other lines of data with different format between the variable/attritbute that I am interested.
My code so far:
import re
initial_val = {'House': 'false', 'Car': 'false'}
with open('input.txt') as f:
records = []
current_record = None
for line in f:
if not line.strip():
continue
elif current_record is None:
people_name = re.search('.+Full name ?: (.+) ', line)
if people_name:
current_record = dict(initial_val, Name = people_name.group(1))
else:
continue
elif current_record is not None:
house = re.search(' *(House) ?: ?([a-z]+)', line)
if house:
current_record['House'] = house.group(2)
car = re.search(' *(Car) ?: ?([a-z]+)', line)
if car:
current_record['Car'] = car.group(2)
people_name = re.search('.+Full name ?: (.+) ', line)
if people_name:
records.append(current_record)
current_record = dict(initial_val, Name = people_name.group(1))
print records
What I get:
[{'Name': 'abc', 'House': 'true', 'Car': 'true'}]
My question:
How am I suppose to extract the data and store it in a dictionary like:
{'abc': {'Car': true, 'House': true, 'Age': 33}, 'xyz':{'Car': true, 'House': false, 'Age': 56}}
My purpose:
check whether each person has car, house and age, if no then return false
The I could print them in a table like this:
Name Car House Age
abc true true 33
xyz true false 56
Note that I am using Python 2.7 and I do not know what is the actual value of each variable/attribute (Eg. abc, true, true, 33) of each person.
What is the best solution to my question? Thanks.
Well, you just have to keep track of the current record:
def parse_name(line):
# first remove the initial '/* ' and final ' */'
stripped_line = line.strip('/* ')
return stripped_line.split(':')[-1]
WANTED_KEYS = ('Car', 'Age', 'House')
# default values for when the lines are not present for a record
INITIAL_VAL = {'Car': False, 'House': False, Age: -1}
with open('the_filename') as f:
records = []
current_record = None
for line in f:
if not line.strip():
# skip empty lines
continue
elif current_record is None:
# first record in the file
if line.startswith('/*'):
current_record = dict(INITIAL_VAL, name=parse_name(line))
else:
# this should probably be an error in the file contents
continue
elif line.startswith('/*'):
# this means that the current record finished, and a new one is starting
records.append(current_record)
current_record = dict(INITIAL_VAL, name=parse_name(line))
else:
key, val = line.split(':')
if key.strip() in WANTED_KEYS:
# we want to keep track of this field
current_record[key.strip()] = val.strip()
# otherwise just ignore the line
print('Name\tCar\tHouse\tAge')
for record in records:
print(record['name'], record['Car'], record['House'], record['Age'], sep='\t')
Note that for Age you may want to convert it to an integer using int:
if key == 'Age':
current_record['Age'] = int(val)
The above code produces a list of dictionaries, but it is easy enough to convert it to a dictionary of dicts:
new_records = {r['name']: dict(r) for r in records}
for val in new_records.values():
del val['name']
After this new_records will be something like:
{'abc': {'Car': True, 'House': True, Age: 20}, ...}
If you have other lines with a different format in between the interesting ones you can simply write a function that returns True or False depending on whether the line is in the format you require and use it to filter the lines of the file:
def is_interesting_line(line):
if line.startswith('/*'):
return True
elif ':' in line:
return True
for line in filter(is_interesting_line, f):
# code as before
Change is_interesting_line to suit your needs. In the end, if you have to handle several different formats etc. maybe using a regex would be better, in that case you could do something like:
import re
LINE_REGEX = re.compile(r'(/\*.*\*/)|(\w+\s*:.*)| <other stuff>')
def is_interesting_line(line):
return LINE_REGEX.match(line) is not None
If you want you can obtain fancier formatting for the table, but you probably first need to determine the maximum length of the name etc. or you can use something like tabulate to do that for you.
For example something like (not tested):
max_name_length = max(max(len(r['name']) for r in records), 4)
format_string = '{:<{}}\t{:<{}}\t{}\t{}'
print(format_string.format('Name', max_name_length, 'Car', 5, 'House', 'Age'))
for record in records:
print(format_string.format(record['name'], max_name_length, record['Car'], 5, record['House'], record['Age']))

Categories

Resources