Parsing unstructured text file with Python - python

I have a text file, a few snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets#yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia#yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation#yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama#cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here is the output:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My issues are as follows:
First, there should be 2 records as output, and I'm only outputting one.
In the top block of text, my script has trouble knowing where the previous value ends, and the new one starts: 'Status: Fell Thru' should be one value, 'Primary closing party:', 'Buyer
Acceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' should be caught.
Once this is parsed correctly, I will need to parse again to separate keys from values (ie. 'Closing date':01/10/2010), ideally in a list of dicts.
I can't think of a better way other than using regex to pick out keys, and then grabbing the snippets of text that follow.
When complete, I'd like a csv w/a header row filled with keys, that I can import into pandas w/read_csv. I've spent quite a few hours on this one..

(This isn't a complete answer, but it's too long for a comment).
Field names can have spaces (e.g. MLS number)
Several fields can appear on each line (e.g. Home: Pager:)
The Notes field has the time in it, with a : in it
These mean you can't take your approach to identifying the fieldnames by regex. It's impossible for it to know whether "MLS" is part of the previous data value or the subsequent fieldname.
Some of the Home: Pager: lines refer to the Seller, some to the Buyer or the Lessee Agent or the Leasing Agent. This means the naive line-by-line approach I take below doesn't work either.
This is the code I was working on, it runs against your test data but gives incorrect output due to the above. It's here for a reference of the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
I'm now thinking it would be a much better idea to do some pre-processing tidyup steps over the data:
Strip out "Page n of n" and "Printed on ..." lines, and similar
Identify all valid field names, then break up the combined lines, meaning every line has one field only, fields start at the start of a line.
Run through and just process the Seller/Buyer/Agents blocks, replacing fieldnames with an identifying prefix, e.g. Email: -> Seller Email:.
Then write a record parser, which should be easy - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Store however you want (nb. that dictionary keys are unordered).

I suppose it is easier to start a new record by hitting the word "Page".
Just share a little bit of my own experience - it just too difficult to write a generalized parser.
The situation isn't that bad given the data here. Instead of using a simple list to store an entry, use an object. Add all other fields as attributes/values to the object.

Related

How to extract certain paragraph from text file

def extract_book_info(self):
books_info = []
for file in os.listdir(self.book_folder_path):
title = "None"
author = "None"
release_date = "None"
last_update_date = "None"
language = "None"
producer = "None"
with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
book_info = content.readlines()
for lines in book_info:
if lines.startswith('Title'):
title = lines.strip().split(': ')
elif lines.startswith('Author'):
try:
author = lines.strip().split(': ')
except IndexError:
author = 'Empty'
elif lines.startswith('Release date'):
release_date = lines.strip().split(': ')
elif lines.startswith('Last updated'):
last_update_date = lines.strip().split(': ')
elif lines.startswith('Produce by'):
producer = lines.strip().split(': ')
elif lines.startswith('Language'):
language = lines.strip().split(': ')
elif lines.startswith('***'):
pass
books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))
with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
for book_info in books_info:
book_file.write(book_info.__str__() + "\n")
I was using this code tried to extract the book title , author , release_date ,
last_update_date, language, producer, book_path).
This the the output I achieve:
['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;
This is the output I should achieved.
May I know what method I should used to achieve the following output
The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;
This is the example of input:
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]
Language: English
Character set encoding: UTF-8
Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez
*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***
cover
str.split gives you a list as a result. You're using it to assign to a single value instead.
'Title: Sherlock Holmes'.split(':') # => ['Title', 'Sherlock Holmes']
What I can gather from your requirement you want to access the second element from the split every time. You can do so by:
...
for lines in book_info:
if lines.startswith('Author'):
_, author = lines.strip().split(':')
elif...
Be careful since this can throw an IndexError if there is no second element in a split result. (That's why there's a try on the author param in your code)
Also, avoid calling __str__ directly. That's what the str() function calls for you anyway. Use that instead.

Split Australian addresses into street_address, suburb, state and postcode

I have scraped addresses from a webiste but their format is not consistent, for instance:
address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
I have tried to split them by space ' ' but couldn't get the desired result.
I have tried:
if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]
And my desired output should be like this:
Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356
How can I split the full address into these specific fields?
Update:
I have parsed out the desired fields using regex pattern:
regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(\d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode
It is working for some addresses like with address_3 but with address_1, address_2 this pattern is not working I am getting None Type error:
File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
How can I fix this?
you can use regular expression but probably need multiple pattern, some thing like this:
import re
match = None
if (match := re.search( r'(.*?\d+-\d+),? (.+?) ([A-Z ]+) ([A-Z]+) (\d+)$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(\d+-\d+) (.+?), (.+?), ([A-Z]+), (\d+)$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')
try 're' package. You can do t using regular expressions like this
import re
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*\d+-\d+)[, ]+(\w*\s*\w+\s+\w+)[, ]+(\w*\s*\w+)[, ]+(\w+)[, ]+(\d+)", add)[0]))
parentheses in pattern part of re.findall will help you capture wanted parts.

Extract data from text file using Python (or any language)

I have a text file that looks like:
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
.... and so on
I need to extract each entry to a csv file, so the data should look like: first name, last name, phone, email, etc. I don't even know where to start on something like this.
first of all you'll need to open the text file in read mode.
I'd suggest using a context manager like so:
with open('path/to/your/file.txt', 'r') as file:
for line in file.readlines():
# do something with the line (it is a string)
as for managing the info you could build some intermediate structure, for example a dictionary or a list of dictionaries, and then translate that into a CSV file with the csv module.
you could for example split the file whenever there is a blank line, maybe like this:
with open('Downloads/test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
else: # otherwise split the line and create new dict
line_items = line.split(r' ')
print(line_items)
entry[line_items[0]] = line_items[1]
print(my_list)
this code won't work because your text is not formatted in a consistent way: you need to find a way to make the split between "title" and "content" (like "first name" and "bob") in a consistent way. I suggest maybe looking at regex and fixing the txt file by making spacing more consistent.
assuming the data resides in a:
a="""
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
"""
line_sep = "\n" # CHANGE ME ACCORDING TO DATA
fields = ["First Name", "Last name", "Phone",
"Email", "Date of Birth", "Preferred Method of Contact",
"Desired Appointment Date", "Desired Appointment Time",
"City", "Location", "IP Address", "User-Agent","Referrer"]
records = a.split(line_sep * 2)
all_records = []
for record in records:
splitted_record = record.split(line_sep)
one_record = {}
csv_record = []
for f in fields:
found = False
for one_field in splitted_record:
if one_field.startswith(f):
data = one_field[len(f):].strip()
one_record[f] = data
csv_record.append(data)
found = True
if not found:
csv_record.append("")
all_records.append(";".join(csv_record))
one_record will have the record as dictionary and csv_record will have it as a list of fields (ordered as fields variable)
Edited to add: ignore this answer, the code from Koko Jumbo looks infinitely more sensible and actually gives you a CVS file at the end of it! It was a fun exercise though :)
Just to expand on fcagnola's code a bit.
If it's a quick and dirty one-off, and you know that the data will be consistently presented, the following should work to create a list of dictionaries with the correct key/value pairing. Each line is processed by splitting the line and comparing the line number (reset to 0 with each new dict) against an array of values that represent where the boundary between key and value falls.
For example, "First Name Bob" becomes ["First","Name","Bob"]. The function has been told that linenumber= 0 so it checks entries[linenumber] to get the value "2", which it uses to join the key name (items 0 & 1) and then join the data (items 2 onwards). The end result is ["First Name", "Bob"] which is then added to the dictionary.
class Extract:
def extractEntry(self,linedata,lineindex):
# Hardcoded list! The quick and dirty part.
# This is specific to the example data provided. The entries
# represent the index to be used when splitting the string
# between the key and the data
entries = (2,2,1,1,3,4,3,3,1,1,2,2,1)
return self.createNewEntry(linedata,entries[lineindex])
def createNewEntry(self,linedata,dataindex):
list_data = linedata.split()
key = " ".join(list_data[:dataindex])
data = " ".join(list_data[dataindex:])
return [key,data]
with open('test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
extr = Extract() # class for splitting the entries into key/value
x = 0
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
x = 0
else: # otherwise split the line and create new dict
extracted_data = extr.extractEntry(line,x)
entry[extracted_data[0]] = extracted_data[1]
x += 1
my_list.append(entry)
print(my_list)

Separating text/text processing using regex

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.
You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

Append a text word into a text file after searching for specific word in python

i want to read a text file and want to a specific word and then want to append some other word next to it.
For example:
I want to find first name in a file like John and then want to append last name with "John" like John Smith.
Here is the code i have written up till now.
usrinp = input("Enter name: ")
lines = []
with open('names.txt','rt') as in_file:
for line in in_file:
lines.append(line.rstrip('\n'))
for element in lines:
if usrinp in element is not -1:
print(lines[0]+" Smith")
print(element)
Thats what text file looks like:
My name is FirstName
My name is FirstName
My name is FirstName
FirstName is a asp developer
Java developer is FirstName
FirstName is a python developer
Using replace is one way to do it.
Input file (names.txt):
My name is John
My name is John
My name is John
John is a asp developer
Java developer is John
John is a python developer
Script:
name = 'John'
last_name = 'Smith'
with open('names.txt','r') as names_file:
content = names_file.read()
new = content.replace(name, ' '.join([name, last_name]))
with open('new_names.txt','w') as new_names_file:
new_names_file.write(new)
Output file (new_names.txt):
My name is John Smith
My name is John Smith
My name is John Smith
John Smith is a asp developer
Java developer is John Smith
John Smith is a python developer
search_string = 'john'
file_content = open(file_path,'r+')
lines = []
flag = 0
for line in file_content:
line = line.lower()
stripped_line = line
if search_string in line:
flag = 1
stripped_line = line.strip('\n')+' '+'smith \n'
lines.append(stripped_line)
file_content.close()
if(flag == 1):
file_content = open(file_path,'w')
file_content.writelines(lines)
file_content.close()
**OUTPUT**
My name is FirstName
My name is FirstName
My name is FirstName
FirstName is a asp developer
Java developer is john smith
FirstName is a developer

Categories

Resources