I have some YAML data that looks like this:
people:
- name: "John"
age: "20"
family:
- mother: "Alice"
father: "Jeff"
brother: "Tim"
sister: "Enid"
- name: "Jake"
age: "23"
family:
- mother: "Meg"
father: "Rick"
brother: "Carl"
sister: "Maddy"
How do I print out the name of Jake's mother?
I'm really just trying to print out all family members for each person, but when I make a loop to go through each entry, the code views "name" "age" and "family" as strings.
My code looks like this:
(I have the yaml loaded as a dictionary "persons")
for a in persons["people"]:
if a == "family":
for b in persons["people"][a]:
print(b)
people is a list of dictionaries. If you iterate over it with for a in persons["people"], then in each iteration of the loop, a will be a dictionary with the keys name, age, and family. You're looking for the entry where name is Jake, so:
for a in persons["people"]:
if a["name"] == "Jake":
...
For reasons that are unclear from your question, family is a list of dictionaries, with only a single item in the list in each of your examples. You want the value of the key mother from this dictionary, which gets us:
for a in persons["people"]:
if a["name"] == "Jake":
mother_name = a["family"][0]["mother"]
If you have control over the format of the data, consider making family a single dictionary instead of a list of dictionaries:
people:
- name: "John"
age: "20"
family:
mother: "Alice"
father: "Jeff"
brother: "Tim"
sister: "Enid"
- name: "Jake"
age: "23"
family:
mother: "Meg"
father: "Rick"
brother: "Carl"
sister: "Maddy"
With that data structure, your code becomes:
for a in persons["people"]:
if a["name"] == "Jake":
mother_name = a["family"]["mother"]
if __name__ == '__main__':
with open('./file.yml') as file:
people = yaml.full_load(file)
for person in people["people"]:
print(f"{person['name']} parents: ")
family = person["family"][0]
for role, name in family.items():
print(role + ": " + name)
print("")
The output will be
John parents:
mother: Alice
father: Jeff
brother: Tim
sister: Enid
Jake parents:
mother: Meg
father: Rick
brother: Carl
sister: Maddy
I have a list in the following format:
list_names = ['Name: Mark - Age: 42 - Country: NL',
'Name: Katherine - Age: 23 - Country: NL',
'Name: Tom - Age: 31 - Country: NL']
As you can see, all the information is set in one string. What I need is to order this list based on the age, which is located somewhere in the middle of the string.
How can I do this?
The key to sorting the list by the names stored in the strings is, to extract the age from the string. Once, you have defined a function which does that, you can use the key argument of the .sort method. Using regular expressions, extracting the age is simple. A solution could look as follows.
import re
pattern = re.compile(r'age:\s*(\d+)', re.IGNORECASE)
def extract_age(s):
return int(pattern.search(s).group(1))
list_names = ['Name: Mark - Age: 42 - Country: NL',
'Name: Katherine - Age: 23 - Country: NL',
'Name: Tom - Age: 31 - Country: NL']
list_names.sort(key=extract_age)
print(list_names)
You can use regex to capture the age and use it as the sort key.
import re
list_names = ['Name: Mark - Age: 42 - Country: NL',
'Name: Katherine - Age: 23 - Country: NL',
'Name: Tom - Age: 31 - Country: NL']
def get_age(value):
match = re.search("Age: (\d+)", value)
return int(match.group(1))
list_names_sorted = sorted(list_names, key=get_age)
print(list_names_sorted)
Output (pretty printed):
[
'Name: Katherine - Age: 23 - Country: NL',
'Name: Tom - Age: 31 - Country: NL',
'Name: Mark - Age: 42 - Country: NL'
]
When I was reading multiple files and exporting it, I realised that the values on these 4 column got overwritten by the latest value. Every file has the same iat cell location. I will like to know if this can be looped and values not getting overwritten.
name = df.iat[1,1]
age = df.iat[2,1]
height = df.iat[2,2]
address = df.iat[2,3]
Details = {'Name':name, 'Age':age,'Height':height,'Address':address}
df1 = pd.Series(Details).to_Frame()
df1 = df1.T
For example,
(1st Data):
Name: John
Age: 20
Height: 1.7m
Address: Bla Bla Bla
(2nd Data):
Name: Jack
Age: 21
Height: 1.7m
Address: Blah Blah Blah
(3rd Data):
Name: Jane
Age: 20
Height: 1.62m
Address: Blah Blah
You can loop and append your values to list.
name, age, height, address = [], [], [], []
for df in dfs:
name.append(df.iat[1,1])
age.append(df.iat[2,1])
height.append(df.iat[2,2])
address.append(df.iat[2,3])
Details = {'Name':name, 'Age':age,'Height':height,'Address':address}
df1 = pd.DataFrame(Details)
I have a text file, a few snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets#yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia#yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation#yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama#cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here is the output:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My issues are as follows:
First, there should be 2 records as output, and I'm only outputting one.
In the top block of text, my script has trouble knowing where the previous value ends, and the new one starts: 'Status: Fell Thru' should be one value, 'Primary closing party:', 'Buyer
Acceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' should be caught.
Once this is parsed correctly, I will need to parse again to separate keys from values (ie. 'Closing date':01/10/2010), ideally in a list of dicts.
I can't think of a better way other than using regex to pick out keys, and then grabbing the snippets of text that follow.
When complete, I'd like a csv w/a header row filled with keys, that I can import into pandas w/read_csv. I've spent quite a few hours on this one..
(This isn't a complete answer, but it's too long for a comment).
Field names can have spaces (e.g. MLS number)
Several fields can appear on each line (e.g. Home: Pager:)
The Notes field has the time in it, with a : in it
These mean you can't take your approach to identifying the fieldnames by regex. It's impossible for it to know whether "MLS" is part of the previous data value or the subsequent fieldname.
Some of the Home: Pager: lines refer to the Seller, some to the Buyer or the Lessee Agent or the Leasing Agent. This means the naive line-by-line approach I take below doesn't work either.
This is the code I was working on, it runs against your test data but gives incorrect output due to the above. It's here for a reference of the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
I'm now thinking it would be a much better idea to do some pre-processing tidyup steps over the data:
Strip out "Page n of n" and "Printed on ..." lines, and similar
Identify all valid field names, then break up the combined lines, meaning every line has one field only, fields start at the start of a line.
Run through and just process the Seller/Buyer/Agents blocks, replacing fieldnames with an identifying prefix, e.g. Email: -> Seller Email:.
Then write a record parser, which should be easy - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Store however you want (nb. that dictionary keys are unordered).
I suppose it is easier to start a new record by hitting the word "Page".
Just share a little bit of my own experience - it just too difficult to write a generalized parser.
The situation isn't that bad given the data here. Instead of using a simple list to store an entry, use an object. Add all other fields as attributes/values to the object.
I have a little database text file db.txt:
(peter)
name = peter
surname = asd
year = 23
(tom)
name = tom
surname = zaq
year = 22
hobby = sport
(paul)
name = paul
surname = zxc
hobby = music
job = teacher
How to get all data section from for example tom? I want to get in variable:
(tom)
name = tom
surname = zaq
year = 22
hobby = sport
Then i want to change data:
replace("year = 22", "year = 23")
and get:
(tom)
name = tom
surname = zaq
year = 23
hobby = sport
Now add(job) and delete(surname) data:
(tom)
name = tom
year = 23
hobby = sport
job = taxi driver
And finally rewrite that changed section to old db.txt file:
(peter)
name = peter
surname = asd
year = 23
(tom)
name = tom
year = 23
hobby = sport
job = taxi driver
(paul)
name = paul
surname = zxc
hobby = music
job = teacher
Any solutions or hints how to do it? Thanks a lot!
Using PyYAML as suggested by #aitchnyu and making a little modifications on the original format makes this an easy task:
import yaml
text = """
peter:
name: peter
surname: asd
year: 23
tom:
name: tom
surname: zaq
year: 22
hobby: sport
paul:
name: paul
surname: zxc
hobby: music
job: teacher
"""
persons = yaml.load(text)
persons["tom"]["year"] = persons["tom"]["year"]*4 # Tom is older now
print yaml.dump(persons, default_flow_style=False)
Result:
paul:
hobby: music
job: teacher
name: paul
surname: zxc
peter:
name: peter
surname: asd
year: 23
tom:
hobby: sport
name: tom
surname: zaq
year: 88
Of course, you should read "text" from your file (db.txt) and write it after finished
Addendum to Sebastien's comment: use an in-memory SQLite DB. SQLite is already embedded in Python, so its just a few lines to set up.
Also, unless that format cannot be changed, consider YAML for the text. Python can readily translate to/from YAML and Python objects (an object composed of python dicts, lists, strings, real numbers etc) in a single step.
http://pyyaml.org/wiki/PyYAML
So my suggestion is a YAML -> Python object -> SQLite DB and back again.