Parsing Text Structured with Indents in Python - python

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):
Attendance records 8 F 1921-2010 Box 2
1921-1927, 1932-1944
1937-1939,1948-1966,
1971-1979, 1989-1994, 2010
Number of meetings attended each year 1 F 1991-1994 Box 2
Papers re: Safaris 10 F 1951-2011 Box 2
Incomplete; Includes correspondence
about beginning “Safaris” may also
include announcements, invitations,
reports, attendance, and charges; some
photographs.
See also: Correspondence and Minutes
So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).
So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:
import re
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
records.append({
"title": title,
"id": rec_id,
"years": years,
"location": location
})
elif not line[0].isalpha():
print "These are the notes, but attaching them to the above records is not clear"
print records`
and this produces:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010'},
{'id': '1 F',
'location': 'Box 2',
'title': 'Number of meetings attended each year',
'years': '1991-1994'},
{'id': '10 F',
'location': 'Box 2',
'title': 'Papers re: Safaris',
'years': '1951-2011'}]
But now I want to add to each record the notes to the effect of:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010',
'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010'
},
...]
What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.
Update
Just adding the condition suggested by answer below over the indented lines worked fine:
import re
import repr as _repr
from pprint import pprint
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
#print processed
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
if not line[0].isalpha():
record['notes'].append(line)
continue
record = { "title": title,
"id": rec_id,
"years": years,
"location": location,
"notes": []}
records.append(record)
pprint(records)

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:
records = []
with open('data.txt', 'r') as lines:
for line in lines:
if line.startswith ('\t'):
record ['notes'].append (line [1:])
continue
record = {'title': line, 'notes': [] }
records.append (record)
for record in records:
print ('Record is', record ['title'] )
print ('Notes are', record ['notes'] )
print ()

Related

Extracting data as python dict from CSV file

how do i extract the data is this CSV as a python dictionary without importing packages?
sample of the data:
User-ID;"ISBN";"Book-Rating"
276725;"034545104X";"0"
276726;"0155061224";"5"
276727;"0446520802";"0"
276729;"052165615X";"3"
def loadRatings():
# Get bookratings
try:
bookR = {}
for line in open('booktext.csv'):
(id,title) = line.split(';')[0:2]
bookR[id] = title
return bookR
except IOError as ioerr:
print('File error: ' + str(ioerr))
print(loadRatings())
but i need my result to be like
bookR = {User-ID: 276725, ISBN: 034545104X, Rating: 0}
this code will return
with open("booktext.csv") as f:
for i, line in enumerate(f):
# skip header
if i == 0:
continue
row_lst = line.replace("\n","").replace('"','').split(";")
if len(row_lst) == 3:
bookR = {
"User-ID": row_lst[0],
"ISBN": row_lst[1],
"Rating": row_lst[2]
}
print(bookR)
{'User-ID': '276725', 'ISBN': '034545104X', 'Rating': '0'}
{'User-ID': '276726', 'ISBN': '0155061224', 'Rating': '5'}
{'User-ID': '276727', 'ISBN': '0446520802', 'Rating': '0'}
{'User-ID': '276729', 'ISBN': '052165615X', 'Rating': '3'}
You always should use context manager with when working with files unless you really know and have a good reason why not to do that. Read more on that on https://stackoverflow.com/a/3012921/20646982
The description is vague in terms of what you are looking for, not clear either it should be a single dict of all items, or just a separate lines. In case you need a normal dict you can use this simple approach with just few formatting later depends on data type you are requiring.
I managed to recreate results like this:
with open('ex.csv',newline="") as f:
d = list(f.read().split(' '))
keys = d[0].split(';')
values = d[1:]
book = {}
for idx, key in enumerate(keys):
book[key] = []
for i in range(len(values)):
book[key].append(values[i].split(';')[idx])
Which produces results:
{'User-ID': ['276725', '276726', '276727', '276729'],
'"ISBN"': ['"034545104X"', '"0155061224"', '"0446520802"', '"052165615X"'],
'"Book-Rating"': ['"0"', '"5"', '"0"', '"3"']}
import csv
filename ="Geeks.csv"
# opening the file using "with"
# statement
with open(filename, 'r') as data:
for line in csv.DictReader(data):
print(line)

Python JSON append if value doesn't exist

I've got a json file with 30-ish, blocks of "dicts" where every block has and ID, like this:
{
"ID": "23926695",
"webpage_url": "https://.com",
"logo_url": null,
"headline": "aewafs",
"application_deadline": "2020-03-31T23:59:59",
}
Since my script pulls information in the same way from an API more than once, I would like to append new "blocks" to the json file only if the ID doesn't already exist in the JSON file.
I've got something like this so far:
import os
check_empty = os.stat('pbdb.json').st_size
if check_empty == 0:
with open('pbdb.json', 'w') as f:
f.write('[\n]') # Writes '[' then linebreaks with '\n' and writes ']'
output = json.load(open("pbdb.json"))
for i in jobs:
output.append({
'ID': job_id,
'Title': jobtitle,
'Employer' : company,
'Employment type' : emptype,
'Fulltime' : tid,
'Deadline' : deadline,
'Link' : webpage
})
with open('pbdb.json', 'w') as job_data_file:
json.dump(output, job_data_file)
but I would like to only do the "output.append" part if the ID doesn't exist in the Json file.
I am not able to complete the code you provided but I added an example to show how you can achieve the none duplicate list of jobs(hopefully it helps):
# suppose `data` is you input data with duplicate ids included
data = [{'id': 1, 'name': 'john'}, {'id': 1, 'name': 'mary'}, {'id': 2, 'name': 'george'}]
# using dictionary comprehension you can eliminate the duplicates and finally get the results by calling the `values` method on dict.
noduplicate = list({itm['id']:itm for itm in data}.values())
with open('pbdb.json', 'w') as job_data_file:
json.dump(noduplicate, job_data_file)
I'll just go with a database guys, thank you for your time, we can close this thread now

Sequentially filescraping text files - a smarter way?

I'm trying to scrape some text files into a DB - the format is similar to this with a couple of 1000 segments like this :
Posted By
Date
John Keys
31.08.2019, 10:10 AM
Peter Hall 200 150
Ed Parker 14 1
Posted By
Date
John Keys
31.08.2019, 10:15 AM
Rose Stone 200 150
Travis Anderson 14 1
The records that are important are the records that are coming right after "Date" - so the logic is :
inside_match_flag =0
for line in ins:
if inside_match_flag == 1:
inside_match_flag = 2 # add one to it as we will get all lines
if line == "Posted By": # until we see Posted By again (or EOF)
inside_match_flag =0 # we are now outside the segment
if line == "Date" : # lines after Dates are the ones we want
inside_match_flag =1 # the following lines are to be stored
So this is the way I've done it (the above is not the running code) before doing this by keeping track of a flag and depending on the flag_value I know what lines are most likely coming next.
The issue is of course this about 'the lines coming next' - as I'm reading line per line, I can't easy grab out these segments as I don't want to rely on loading the complete file into memory (as it can go huge).
But the code always gets ugly when I implement something like this - and thinking anyone here that would have a lot smarter approach to do this ?
And note - I am also interested if there would be a super-smart compact way to do this if it requires to load all into memory where code doesn't get so ugly, if all is in memory I guess I can just look for DATE field and save all lines between until it sees Posted By again.
Edit 1
Note the number of players can be more than 2 per game, so a record could also look like this :
Posted By
Date
John Keys
31.08.2019, 10:10 AM
Peter Hall 200 150
Ed Parker 54 1
Rose Stone 20 15
Travis Anderson 1 150
Posted By
...
....
My dream format would be to have an object like this - example based on the match above with 4 players :
{
"Game 1:"
{
"posted by" : "john keys"
"date" : "31.08.2019, 10:10 AM"
"players" : {
{ 1, "Peter Hall, "200", "150" }
{ 2, Ed Parker, "54", "1" }
{ 3 , Rose Stone, "20", "15" }
{ 4, Travis Anderson, "1", "150" }
}
}
}
Note : not 100% correct json format there - and doesn't have to be json, just an object as I will throw them into a SQLite database where it's stored per game which should be illustrated above.
Optimized and memory-efficient generator function approach which yields records on demand:
import pprint
def extract_records(fname):
def prepare_record(rec):
return {'posted by': rec[0], 'date': rec[1],
'players': [[i] + p.rsplit(maxsplit=2)
for i, p in enumerate(rec[2:], 1)]}
with open(fname) as f:
record = []
add_item = False
for line in f:
line = line.strip()
if line == 'Date':
add_item = True
continue
elif line == 'Posted By':
add_item = False
if record:
yield prepare_record(record)
record = []
continue
if add_item:
record.append(line)
if record:
yield prepare_record(record)
records_gen = extract_records('datafile.txt') # generator
for rec in records_gen:
pprint.pprint(rec) # further processing, ex. inserting into DB
The output (2 sample records):
{'date': '31.08.2019, 10:10 AM',
'players': [[1, 'Peter Hall', '200', '150'],
[2, 'Ed Parker', '14', '1'],
[3, 'Rose Stone', '20', '15'],
[4, 'Travis Anderson', '1', '150']],
'posted by': 'John Keys'}
{'date': '31.08.2019, 10:15 AM',
'players': [[1, 'Rose Stone', '200', '150'],
[2, 'Travis Anderson', '14', '1']],
'posted by': 'John Keys'}
There is no magic method for this specific case. Here is an example solution:
buf_size = ...
start_marker = "Posted by\n"
date_marker = "Date\n"
def parse_game(filename)
fh = open(filename)
page = ""
buffer = True # just the start value
while buffer:
buffer = fh.read(buf_size)
page += buffer
records = page.split(start_marker)
if buffer:
page = records.pop()
for record in records:
# skip everything before "Date" and split by lines
chunks = record.split(date_marker, 1)[-1].split("\n")
posted_by, date = chunks[:2]
players = [chunk.split() for chunk in chunks[2:]]
yield {
"posted_by": posted_by,
"date": date,
"players": players
}
If you can read the whole file into memory, it will be just:
def read_game(filename):
for record in open(filename).read().split(start_marker):
# skip everything before "Date" and split by lines
chunks = record.split(date_marker, 1)[-1].split("\n")
posted_by, date = chunks[:2]
players = [chunk.split() for chunk in chunks[2:]]
yield {
"posted_by": posted_by,
"date": date,
"players": players
}
This solution is very similar to Roman's. It is slightly less memory efficient (assuming you have buf_size of memory), but will result in less IO

Python - CSV File to Dict with Dataflow Template

I am trying to process a CSV file into a dict using a Dataflow template and Python.
As it is a template I have to use ReadFromText from the textio module, to be able to provide the path at runtime.
| beam.io.ReadFromText(contact_options.path)
All I need is to be able to extract the first line of this text/csv file, I can then use this data in DictReader as the fieldnames.
If I use split lines it brings back a each element of the text file in a list:
return element.splitlines()
or
csv_data = []
split_element = element.split('\n')
for row in split_element:
csv_data.append(row)
return csv_data
['phone_number', 'cid', 'first_name', 'last_name']
[' ', '101XXXXX', 'MurXXX', 'LevXXXX']
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX']
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']
Although If I then use say element[0], it just brings everythin back without the list brackets. I have also tried splitting by '\n', then using a for loop to produce a list object, although it produces almost the same result.
I cannot rely on using predetermined fieldnames as the csv files to be processed will all have different fieldnames and DictReader will not work effectively without fieldnames given.
EDIT:
The expected output is:
[{'phone_Number': '561XXXXX', 'first_Name': '', 'last_Name': 'BeXXXX', 'cid': '745XXXXX'}, {'phone_Number': '561XXXXX', 'first_Name': 'A', 'last_Name': 'BXXXX', 'cid': '61XXXXX'}]
EDIT:
Element contents:
"phone_Number","cid","first_Name","last_Name"
"5616XXXXX","745XXXX","","BeXXXXX"
"561XXXXXX","61XXXXX","A","BXXXXXXt"
"95XXXXXXX","6XXXXXX","A","BXXXXXX"
"727XXXXXX","98XXXXXX","A","CaXXXXXX"
Use Pandas to load the values and use first line as colheaders
import pandas as pd
a_big_list=[['phone_number', 'cid', 'first_name', 'last_name'],
[' ', '101XXXXX', 'MurXXX', 'LevXXXX'],
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX'],
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']]
df=pd.DataFrame(a_big_list[1:],columns=a_big_list[0])
df.to_dict('records')
#[{'cid': '101XXXXX',
'first_name': 'MurXXX',
'last_name': 'LevXXXX',
'phone_number': ' '},
{'cid': '109XXXXX',
'first_name': 'MerXXXX',
'last_name': 'CoXXXX',
'phone_number': '3052XXXXX'},
{'cid': '10XXXXXX',
'first_name': 'RoXXXX',
'last_name': 'MaXXXXX',
'phone_number': '954XXXXX'}]
I was able to figure this problem out with inspiration from #mad_'s answer, but this still didn't give me the correct answer initally, as I needed to first group my pcollection into one element. I found a way of doing this inspired from this answer from Jiayuan Ma, and slightly altered it as so:
class Group(beam.DoFn):
def __init__(self):
self._buffer = []
def process(self, element):
self._buffer.append(element)
def finish_bundle(self):
if len(self._buffer) != 0:
yield list(self._buffer)
self._buffer = []
lines = p | 'File reading' >> ReadFromText(known_args.input)
| 'Group' >> beam.ParDo(Group(known_args.N)
...
Thus it grouped the entire CSV file as one object, and then I was able to apply mad_'s method to turn it into a dictionary.

Python iterate over list and join lines without a special character to the previous item

I'm wondering if anyone has a sort of hacky / cool solution to this problem . I have a text file like so:
NAME:name
ID:id
PERSON:person
LOCATION:location
NAME:name
morenamestuff
ID:id
PERSON:person
LOCATION:location
JUNK
So I have some blocks that all contain lines that can be split into a dict, and some that cannot. How can I take lines without the : character and join them to the previous line? Here's what I'm currently doing
# loop through chunk
# the first element of dat is a Title, so skip that
key_map = dict(x.split(':') for x in dat[1:])
But I of course get an error because the second chunk has a line without the : character. So I wanted my dict to look something like this after correctly splitting it:
# there will be a key_map for each chunk of data
key_map['NAME'] == 'name morenamestuff' # 3rd line appended to previous
key_map['ID'] == 'id'
key_map['PERSON'] = 'person'
key_map['LOCATION'] = 'location
Solution
EDIT: Here's my final solution on github, and the full code here:
parseScript.py
import re
import string
bad_chars = '(){}"<>[] ' # characers we want to strip from the string
key_map = []
# parse file
with open("dat.txt") as f:
data = f.read()
data = data.strip('\n')
data = re.split('}|\[{', data)
# format file
with open("format.dat") as f:
formatData = [x.strip('\n') for x in f.readlines()]
data = filter(len, data)
# strip and split each station
for dat in data[1:-1]:
# perform black magic, don't even try to understand this
dat = dat.translate(string.maketrans("", "", ), bad_chars).split(',')
key_map.append(dict(x.split(':') for x in dat if ':' in x ))
if ':' not in dat[1]:key_map['NAME']+=dat[k][2]
for station in range(0, len(key_map)):
for opt in formatData:
print opt,":",key_map[station][opt]
print ""
dat.txt
View raw here
format.dat
NAME
STID
LONGITUDE
LATITUDE
ELEVATION
STATE
ID
out.dat
View raw here
When in doubt, write your own generator.
Add in itertools.groupby to chunk by groups of text delimited by whitespace breaks.
def chunker(s):
it = iter(s)
out = [next(it)]
for line in it:
if ':' in line or not line:
yield ' '.join(out)
out = []
out.append(line)
if out:
yield ' '.join(out)
usage:
from itertools import groupby
[dict(x.split(':') for x in g) for k,g in groupby(chunker(lines), bool) if k]
Out[65]:
[{'ID': 'id', 'LOCATION': 'location', 'NAME': 'name', 'PERSON': 'person'},
{'ID': 'id',
'LOCATION': 'location',
'NAME': 'name morenamestuff',
'PERSON': 'person'}]
(if those fields are always the same, I'd go with something like creating some namedtuples instead of a bunch of dicts)
from collections import namedtuple
Thing = namedtuple('Thing', 'ID LOCATION NAME PERSON')
[Thing(**dict(x.split(':') for x in g)) for k,g in groupby(chunker(lines), bool) if k]
Out[76]:
[Thing(ID='id', LOCATION='location', NAME='name', PERSON='person'),
Thing(ID='id', LOCATION='location', NAME='name morenamestuff', PERSON='person')]
Here is something that addresses all your requirements. It handles joining of multiple lines, ignoring blank lines, and ignoring junk lines that do not appear within a block. It is implemented as a generator that yields each dictionary as it is completed.
def parser(data):
d = {}
for line in data:
line = line.strip()
if not line:
if d:
yield d
d = {}
else:
if ':' in line:
key, value = line.split(':')
d[key] = value
else:
if d:
d[key] = '{} {}'.format(d[key], line)
if d:
yield d
When run with this data:
ignore me
NAME:name1
ID:id1
PERSON:person1
LOCATION:location1
NAME:name2
morenamestuff
ID:id2
PERSON:person2
LOCATION:location2
junk
and
other
stuff
NAME:name3
morenamestuff
and more
ID:id3
PERSON:person3
more person stuff
LOCATION:location3
JUNK
MORE JUNK
>>> for d in parser(open('data')):
... print d
{'PERSON': 'person1', 'LOCATION': 'location1', 'NAME': 'name1', 'ID': 'id1'}
{'PERSON': 'person2', 'LOCATION': 'location2', 'NAME': 'name2 morenamestuff', 'ID': 'id2'}
{'PERSON': 'person3 more person stuff', 'LOCATION': 'location3', 'NAME': 'name3 morenamestuff and more', 'ID': 'id3'}
You can grab the lot as a list:
>>> results = list(parser(open('data')))
>>> results
[{'PERSON': 'person1', 'LOCATION': 'location1', 'NAME': 'name1', 'ID': 'id1'}, {'PERSON': 'person2', 'LOCATION': 'location2', 'NAME': 'name2 morenamestuff', 'ID': 'id2'}, {'PERSON': 'person3 more person stuff', 'LOCATION': 'location3', 'NAME': 'name3 morenamestuff and more', 'ID': 'id3'}]
I don't find itertools or regex particularly nice to work with, here's a pure-python solution
separator = ':'
output = []
chunk = None
with open('/tmp/stuff.txt') as f:
for line in (x.strip() for x in f):
if not line:
# we are between 'chunks'
chunk, key = None, None
continue
if chunk is None:
# we are at the beginning of a new 'chunk'
chunk, key = {}, None
output.append(chunk)
if separator in line:
key, val = line.split(separator)
chunk[key] = val
else:
chunk[key] += line
not as elegant, as you requested, but this works
dat=[['NAME:name',
'ID:id',
'PERSON:person',
'LOCATION:location'],
['NAME:name',
'morenamestuff',
'ID:id',
'PERSON:person',
'LOCATION:location']]
k=1
key_map = dict(x.split(':') for x in dat[k] if ':' in x )
if ':' not in dat[k][1]:key_map['NAME']+=dat[k][1]
key_map>>
{'ID': 'id',
'LOCATION': 'location',
'NAME': 'namemorenamestuff',
'PERSON': 'person'}
Just add something to lines with no ":".
if line.find(':') == -1:
line=line+':None'
Then you won't get an error.

Categories

Resources