Extract part of string based on a template in Python

Extract part of string based on a template in Python - python

I'd like to use Python to read in a list of directories and store data in variables based on a template such as /home/user/Music/%artist%/[%year%] %album%.
An example would be:
artist, year, album = None, None, None
template = "/home/user/Music/%artist%/[%year%] %album%"
path = "/home/user/Music/3 Doors Down/[2002] Away From The Sun"
if text == "%artist%":
artist = key
if text == "%year%":
year = key
if text == "%album%":
album = key
print(artist)
# 3 Doors Down
print(year)
# 2002
print(album)
# Away From The Sun
I can do the reverse easily enough with str.replace("%artist%", artist) but how can extract the data?

If your folder structure template is reliable the following should work without the need for regular expressions.
path = "/home/user/Music/3 Doors Down/[2002] Away From The Sun"
path_parts = path.split("/") # divide up the path into array by slashes
print(path_parts)
artist = path_parts[4] # get element of array at index 4
year = path_parts[5][1:5] # get characters at index 1-5 for the element of array at index 5
album = path_parts[5][7:]
print(artist)
# 3 Doors Down
print(year)
# 2002
print(album)
# Away From The Sun
# to put the path back together again using an F-string (No need for str.replace)
reconstructed_path = f"/home/user/Music/{artist}/[{year}] {album}"
print(reconstructed_path)
output:
['', 'home', 'user', 'Music', '3 Doors Down', '[2002] Away From The Sun']
3 Doors Down
2002
Away From The Sun
/home/user/Music/3 Doors Down/[2002] Away From The Sun

The following works for me:
from difflib import SequenceMatcher
def extract(template, text):
seq = SequenceMatcher(None, template, text, True)
return [text[c:d] for tag, a, b, c, d in seq.get_opcodes() if tag == 'replace']
template = "home/user/Music/%/[%] %"
path = "home/user/Music/3 Doors Down/[2002] Away From The Sun"
artist, year, album = extract(template, path)
print(artist)
print(year)
print(album)
Output:
3 Doors Down
2002
Away From The Sun
Each template placeholder can be any single character as long as the character is not present in the value to be returned.

Related

why python if while ends in a dead loop

order = 2
selected = 0
while selected < 21: # because I can only select 20 rows the most once.
current_tr = driver.find_element_by_xpath('/ html / body / table / tbody / tr / td / div / div[3] / table / tbody / tr[%d]' % order) # form line 1. below the table's header
if current_tr.get_attribute("bgcolor") is None: # no bgcolor means not yet reviewed
driver.find_element_by_xpath("//td[2]/div/a").click() # check the onclick content
div_content = driver.find_element_by_xpath("//td[2]/div/div").text # fetch onclick content
driver.find_element_by_xpath("//td[2]/div/div/a").click() # close the onclick content
print(div_content)
if "car" in div_content: #judge if certain string exists in onclick content
list_content = div_content.split("【car】")
car_close = list_content[1].strip() # fetch the content
list_car = car_close.split(" ")
car = list_doi[0]
print(car)
orderminus = order - 1
driver.find_element_by_xpath('// *[ # id = "%d"] / td[6] / a' % orderminus).click() # pick this row,
time.sleep(1)
selected = selected + 1
order = order + 0 #if this row is picked, the row will disappear, so the order won't change
else: ###problem is here, the else branch seems like never been executed ? otherwise the if always stands? no, not possible. there are ones exclude "car", the problem occurs at the first instance of div_content without "car"
order = order + 1 # if "car" is not in
time.sleep(1)
else: # if already reviewed, order + 1
order = order + 1
above is my code using selenium to navigate the webpage with a table.
First judgement: if the current row is reviewed,
not yet reviewed? ok, print the info;
already reviewed？skip it.
then plus judgement: if there certain string "car" in the info:
no? skip;
yes, click it, the row disappear;
But currently when I am running this, the actual status is :
when doing the plus judement, if the string "car" is not in the info,
it keeps printing the info, it seems it not doing the else branch, is doing the line 6_9 in this snippet, always, dead end loop.
Why? anybody give me a clue?
to make things clear， i have simplified my code as below:
list = []
list.append("ff122")
list.append("carff")
list.append("ff3232")
list.append("ffcar")
list.append("3232")
order = 0
selected = 0
while selected < 6:
current_tr = list[order]
print("round %d %s" % (order, current_tr))
if "ff" in current_tr:
print("ff is in current_tr")
if "car" in current_tr:
print("car")
selected = selected + 1
order = order + 0
else:
order = order + 1
print("order is %d" % order)
else: # if already reviewed, order + 1
order = order + 1
print("order is %d" % order)
everybody can run this, what I need to do is firstly filter the "ff", if "ff" exists, then filter "car". both two conditions TRUE, selected +1, until selected reach certain number. in real instance, don't doubt that the list is long enough.

How to compare 2 list where string matches element in alternate list

Hi I'm in the process of learning so you may have to bear with me. I have 2 lists I'd like to compare whilst keeping any matches and append them whilst appending any non matches to another output list.
Heres my code:
def EntryToFieldMatch(Entry, Fields):
valid = []
invalid = []
for c in Entry:
count = 0
for s in Fields:
count +=1
if s in c:
valid.append(c)
elif count == len(Entry):
invalid.append(s)
Fields.remove(s)
print valid
print "-"*50
print invalid
def main():
vEntry = ['27/04/2014', 'Hours = 28', 'Site = Abroad', '03/05/2015', 'Date = 28-04-2015', 'Travel = 2']
Fields = ['Week_Stop', 'Date', 'Site', 'Hours', 'Travel', 'Week_Start', 'Letters']
EntryToFieldMatch(vEntry, Fields)
if __name__ = "__main__":
main()
the output seems fine except its not returning all the fields in the 2 output lists. This is the output I receive:
['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------
['Week_Start', 'Letters']
I just have no idea why the second list doesn't include "Week_Stop". I've run the debugger and followed the code through a few times to no avail. I've read about sets but I didn't see any way to return fields that match and discard fields that don't.
Also im open to suggestion's if anybody knows of a way to simplify this whole process, I'm not asking for free code, just a nod in the right direction.
Python 2.7, Thanks

You only have two conditions, either it is in the string or the count is equal to the length of Entry, neither of which catch the first element 'Week_Stop', the length goes from 7-6-5 catching Week_Start but never gets to 0 so you never reach Week_Stop.
A more efficient way would be to use sets or a collections.OrderedDict if you want to keep order:
from collections import OrderedDict
def EntryToFieldMatch(Entry, Fields):
valid = []
# create orderedDict from the words in Fields
# dict lookups are 0(1)
st = OrderedDict.fromkeys(Fields)
# iterate over Entry
for word in Entry:
# split the words once on whitespace
spl = word.split(None, 1)
# if the first word/word appears in our dict keys
if spl[0] in st:
# add to valid list
valid.append(word)
# remove the key
del st[spl[0]]
print valid
print "-"*50
# only invalid words will be left
print st.keys()
Output:
['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------
['Week_Stop', 'Week_Start', 'Letters']
For large lists this would be significantly faster than your quadratic approach. Having 0(1) dict lookups means your code goes from quadratic to linear, every time you do in Fields that is an 0(n) operation.
Using a set the approach is similar:
def EntryToFieldMatch(Entry, Fields):
valid = []
st = set(Fields)
for word in Entry:
spl = word.split(None,1)
if spl[0] in st:
valid.append(word)
st.remove(spl[0])
print valid
print "-"*50
print st
The difference using sets is order is not maintained.

Using list comprehension:
def EntryToFieldMatch(Entries, Fields):
# using list comprehension
# (typically they go on one line, but they can be multiline
# so they look more like their for loop equivalents)
valid = [entry for entry in Entries
if any([field in entry
for field in Fields])]
invalidEntries = [entry for entry in Entries
if not any([field in entry
for field in Fields])]
missedFields = [field for field in Fields
if not any([field in entry
for entry in Entries])]
print 'valid entries:', valid
print '-' * 80
print 'invalid entries:', invalidEntries
print '-' * 80
print 'missed fields:', missedFields
vEntry = ['27/04/2014', 'Hours = 28', 'Site = Abroad', '03/05/2015', 'Date = 28-04-2015', 'Travel = 2']
Fields = ['Week_Stop', 'Date', 'Site', 'Hours', 'Travel', 'Week_Start', 'Letters']
EntryToFieldMatch(vEntry, Fields)
valid entries: ['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------------------------------------
invalid entries: ['27/04/2014', '03/05/2015']
--------------------------------------------------------------------------------
missed fields: ['Week_Stop', 'Week_Start', 'Letters']

file processing in python

I'm working on text file processing using Python.
I've got a text file (ctl_Files.txt) which has the following content/ or similar to this:
------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM
Comment:
edited objects.
Items:
edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
rename $/Systems/DB/Expences/Loader/AAC.txt.
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
To process this file I wrote the following code:
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\\Users\\md_sarfaraz\\Desktop\\ctl_Files.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
+ (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\\Users\\md_sarfaraz\\Desktop\\processed_ctl_Files.txt", "w+")
file.write(row)
file.close()
and got the following result/File (processed_ctl_Files.txt):
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader add $/Systems/DB/Expences/Loader/AAA.txt add $/Systems/DB/Expences/Loader/BBB.txt add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader edit $/Systems/DB/Expences/Loader/AAA.txt edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892 rename $/Systems/DB/Rascal/Expences/AAC.txt.
But, I want the result like this:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
rename $/Systems/DB/Rascal/Expences/AAC.txt.
or it would be great if we can get results like this :
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename $/Systems/DB/Rascal/Expences/AAC.txt.
Let me know how I can do this. Also, I'm very new to Python so please ignore if I've written some lousy or redundant code. And help me to improve this.

This solution is not as short and probably not as effective as the answer utilizing regular expressions, but it should be quite easy to understand. The solution does make it easier to use the parsed data because each section data is stored into a dictionary.
ctl_file = "ctl_Files.txt" # path of source file
processed_ctl_file = "processed_ctl_Files.txt" # path of destination file
#Tags - used for spliting the information
changeset_tag = 'Changeset:'
user_tag = 'User:'
date_tag = 'Date:'
comment_tag = 'Comment:'
items_tag = 'Items:'
checkin_tag = 'Check-in Notes:'
section_separator = "------------------------"
changesets = []
#open and read the input file
with open(ctl_file, 'r') as read_file:
first_section = True
changeset_dict = {}
items = []
comment_stage = False
items_stage = False
checkin_dict = {}
# Read one line at a time
for line in read_file:
# Check which tag matches the current line and store the data to matching key in the dictionary
if changeset_tag in line:
changeset = line.split(":")[1].strip()
changeset_dict[changeset_tag] = changeset
elif user_tag in line:
user = line.split(":")[1].strip()
changeset_dict[user_tag] = user
elif date_tag in line:
date = line.split(":")[1].strip()
changeset_dict[date_tag] = date
elif comment_tag in line:
comment_stage = True
elif items_tag in line:
items_stage = True
elif checkin_tag in line:
pass # not implemented due to example file not containing any data
elif section_separator in line: # new section
if first_section:
first_section = False
continue
tmp = changeset_dict
changesets.append(tmp)
changeset_dict = {}
items = []
# Set stages to false just in case
items_stage = False
comment_stage = False
elif not line.strip(): # empty line
if items_stage:
changeset_dict[items_tag] = items
items_stage = False
comment_stage = False
else:
if comment_stage:
changeset_dict[comment_tag] = line.strip() # Only works for one line comment
elif items_stage:
items.append(line.strip())
#open and write to the output file
with open(processed_ctl_file, 'w') as write_file:
for changeset in changesets:
row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
distance = len(row)
items = changeset[items_tag]
join_string = "\n" + distance * " "
items_part = str.join(join_string, items)
row += items_part + "\n"
write_file.write(row)
Also, try to use variable names which describes its content. Names like tag1, tag2, etc. does not say much about the variable content. This makes code difficult to read, especially when scripts gets longer. Readability might seem unimportant in most cases, but when re-visiting old code it takes much longer to understand what the code does with non describing variables.

I would start by extracting the values into variables. Then create a prefix from the first few tags. You can count the number of characters in the prefix and use that for the padding. When you get to items, append the first one to the prefix and any other item can be appended to padding created from the number of spaces that you need.
# keywords used in the tag "Items: "
keywords = ['add', 'delete', 'edit', 'source', 'rename']
#passing the count - occurence to the loop
for cs in val.split(tag1)[1:]:
changeset = cs.split(tag2)[0].strip()
user = cs.split(tag2)[1].split(tag3)[0].strip()
date = cs.split(tag3)[1].split(tag4)[0].strip()
comment = cs.split(tag4)[1].split(tag5)[0].strip()
items = cs.split(tag5)[1].split(tag6)[0].strip().split()
notes = cs.split(tag6)
prefix = '{0}|{1}|{2}|{3}'.format(changeset, user, date, comment)
space_count = len(prefix)
i = 0
while i < len(items):
# if we are printing the first item, add it to the other text
if i == 0:
pref = prefix
# otherwise create padding from spaces
else:
pref = ' '*space_count
# add all keywords
words = ''
for j in range(i, len(items)):
if items[j] in keywords:
words += ' ' + items[j]
else:
break
if i >= len(items): break
row += '{0}|{1} {2}\n'.format(pref, words, items[j])
i += j - i + 1 # increase by the number of keywords + the param
This seems to do what you want, but I am not sure if this is the best solution. Maybe it is better to process the file line by line and print the values straight to the stream?

You can use a regular expression to search for 'add', 'edit' etc.
import re
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("wibble.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
prevlen = 0
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' )
distance = len(row) - prevlen
row += re.sub("\s\s+([edit]|[add]|[delete]|[rename])", r"\n"+r" "*distance+r"\1", (val.split(tag5)[count].split(tag6)[0])) + '\r'
prevlen = len(row)
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("wobble.txt", "w+")
file.write(row)
file.close()

regex to parse well-formated multi-line data dictionary

I am trying to read and parse a data dictionary for the Census Bureau's American Community Survey Public Use Microsample data release, as found here.
It is reasonably well formated, although with a few lapses where a few explanatory notes are inserted.
I think my preferred outcome is to either get a dataframe with one row per variable, and serialize all value labels for a given variable into one dictionary stored in a value dictionary field in the same row (although a hierarchical json-like format would not be bad, but more complicated.
I got the following code:
import pandas as pd
import re
import urllib2
data = urllib2.urlopen('http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict13.txt')
## replace newline characters so we can use dots and find everything until a double
## carriage return (replaced to ||) with a lookahead assertion.
data=data.replace('\n','|')
datadict=pd.DataFrame(re.findall("([A-Z]{2,8})\s{2,9}([0-9]{1})\s{2,6}\|\s{2,4}([A-Za-z\-\(\) ]{3,85})",data,re.MULTILINE),columns=['variable','width','description'])
datadict.head(5)
+----+----------+-------+------------------------------------------------+
| | variable | width | description |
+----+----------+-------+------------------------------------------------+
| 0 | RT | 1 | Record Type |
+----+----------+-------+------------------------------------------------+
| 1 | SERIALNO | 7 | Housing unit |
+----+----------+-------+------------------------------------------------+
| 2 | DIVISION | 1 | Division code |
+----+----------+-------+------------------------------------------------+
| 3 | PUMA | 5 | Public use microdata area code (PUMA) based on |
+----+----------+-------+------------------------------------------------+
| 4 | REGION | 1 | Region code |
+----+----------+-------+------------------------------------------------+
| 5 | ST | 2 | State Code |
+----+----------+-------+------------------------------------------------+
So far so good. The list of variables is there, along with the width in characters of each.
I can expand this and get additional lines (where the value labels live), like so:
datadict_exp=pd.DataFrame(
re.findall("([A-Z]{2,9})\s{2,9}([0-9]{1})\s{2,6}\|\s{4}([A-Za-z\-\(\)\;\<\> 0-9]{2,85})\|\s{11,15}([a-z0-9]{0,2})[ ]\.([A-Za-z/\-\(\) ]{2,120})",
data,re.MULTILINE))
datadict_exp.head(5)
+----+----------+-------+---------------------------------------------------+---------+--------------+
| id | variable | width | description | value_1 | label_1 |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 0 | DIVISION | 1 | Division code | 0 | Puerto Rico |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 1 | REGION | 1 | Region code | 1 | Northeast |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 2 | ST | 2 | State Code | 1 | Alabama/AL |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 3 | NP | 2 | Number of person records following this housin... | 0 | Vacant unit |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 4 | TYPE | 1 | Type of unit | 1 | Housing unit |
+----+----------+-------+---------------------------------------------------+---------+--------------+
So that gets the first value and associated label. My regex issue is here how to repeat the multi-line match starting with \s{11,15} and to the end--i.e. some variables have tons of unique values (ST or state code is followed by some 50 lines, denoting the value and label for each state).
I changed early on the carriage return in the source file with a pipe, thinking that I could then shamelessly rely on the dot to match everything until a double carriage return, indicating the end of that particular variable, and here is where I got stuck.
So--how to repeat a multi-line pattern an arbitrary number of times.
(A complication for later is that some variables are not fully enumerated in the dictionary, but are shown with valid ranges of values. NP for example [number of persons associated with the same household], is denoted with ``02..20` following a description. If I don't account for this, my parsing will miss such entries, of course.)

This isn't a regex, but I parsed PUMSDataDict2013.txt and PUMS_Data_Dictionary_2009-2013.txt (Census ACS 2013 documentation, FTP server) with this Python 3x script below. I used pandas.DataFrame.from_dict and pandas.concat to create a hierarchical dataframe, also below.
Python 3x function to parse PUMSDataDict2013.txt and PUMS_Data_Dictionary_2009-2013.txt:
import collections
import os
def parse_pumsdatadict(path:str) -> collections.OrderedDict:
r"""Parse ACS PUMS Data Dictionaries.
Args:
path (str): Path to downloaded data dictionary.
Returns:
ddict (collections.OrderedDict): Parsed data dictionary with original
key order preserved.
Raises:
FileNotFoundError: Raised if `path` does not exist.
Notes:
* Only some data dictionaries have been tested.[^urls]
* Values are all strings. No data types are inferred from the
original file.
* Example structure of returned `ddict`:
ddict['title'] = '2013 ACS PUMS DATA DICTIONARY'
ddict['date'] = 'August 7, 2015'
ddict['record_types']['HOUSING RECORD']['RT']\
['length'] = '1'
['description'] = 'Record Type'
['var_codes']['H'] = 'Housing Record or Group Quarters Unit'
ddict['record_types']['HOUSING RECORD'][...]
ddict['record_types']['PERSON RECORD'][...]
ddict['notes'] =
['Note for both Industry and Occupation lists...',
'* In cases where the SOC occupation code ends...',
...]
References:
[^urls]: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/
PUMSDataDict2013.txt
PUMS_Data_Dictionary_2009-2013.txt
"""
# Check arguments.
if not os.path.exists(path):
raise FileNotFoundError(
"Path does not exist:\n{path}".format(path=path))
# Parse data dictionary.
# Note:
# * Data dictionary keys and values are "codes for variables",
# using the ACS terminology,
# https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html
# * The data dictionary is not all encoded in UTF-8. Replace encoding
# errors when found.
# * Catch instances of inconsistently formatted data.
ddict = collections.OrderedDict()
with open(path, encoding='utf-8', errors='replace') as fobj:
# Data dictionary name is line 1.
ddict['title'] = fobj.readline().strip()
# Data dictionary date is line 2.
ddict['date'] = fobj.readline().strip()
# Initialize flags to catch lines.
(catch_var_name, catch_var_desc,
catch_var_code, catch_var_note) = (None, )*4
var_name = None
var_name_last = 'PWGTP80' # Necessary for unformatted end-of-file notes.
for line in fobj:
# Replace tabs with 4 spaces
line = line.replace('\t', ' '*4).rstrip()
# Record type is section header 'HOUSING RECORD' or 'PERSON RECORD'.
if (line.strip() == 'HOUSING RECORD'
or line.strip() == 'PERSON RECORD'):
record_type = line.strip()
if 'record_types' not in ddict:
ddict['record_types'] = collections.OrderedDict()
ddict['record_types'][record_type] = collections.OrderedDict()
# A newline precedes a variable name.
# A newline follows the last variable code.
elif line == '':
# Example inconsistent format case:
# WGTP54 5
# Housing Weight replicate 54
#
# -9999..09999 .Integer weight of housing unit
if (catch_var_code
and 'var_codes' not in ddict['record_types'][record_type][var_name]):
pass
# Terminate the previous variable block and look for the next
# variable name, unless past last variable name.
else:
catch_var_code = False
catch_var_note = False
if var_name != var_name_last:
catch_var_name = True
# Variable name is 1 line with 0 space indent.
# Variable name is followed by variable description.
# Variable note is optional.
# Variable note is preceded by newline.
# Variable note is 1+ lines.
# Variable note is followed by newline.
elif (catch_var_name and not line.startswith(' ')
and var_name != var_name_last):
# Example: "Note: Public use microdata areas (PUMAs) ..."
if line.lower().startswith('note:'):
var_note = line.strip() # type(var_note) == str
if 'notes' not in ddict['record_types'][record_type][var_name]:
ddict['record_types'][record_type][var_name]['notes'] = list()
# Append a new note.
ddict['record_types'][record_type][var_name]['notes'].append(var_note)
catch_var_note = True
# Example: """
# Note: Public Use Microdata Areas (PUMAs) designate areas ...
# population. Use with ST for unique code. PUMA00 applies ...
# ...
# """
elif catch_var_note:
var_note = line.strip() # type(var_note) == str
if 'notes' not in ddict['record_types'][record_type][var_name]:
ddict['record_types'][record_type][var_name]['notes'] = list()
# Concatenate to most recent note.
ddict['record_types'][record_type][var_name]['notes'][-1] += ' '+var_note
# Example: "NWAB 1 (UNEDITED - See 'Employment Status Recode' (ESR))"
else:
# type(var_note) == list
(var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
ddict['record_types'][record_type][var_name] = collections.OrderedDict()
ddict['record_types'][record_type][var_name]['length'] = var_len
# Append a new note if exists.
if len(var_note) > 0:
if 'notes' not in ddict['record_types'][record_type][var_name]:
ddict['record_types'][record_type][var_name]['notes'] = list()
ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
catch_var_name = False
catch_var_desc = True
var_desc_indent = None
# Variable description is 1+ lines with 1+ space indent.
# Variable description is followed by variable code(s).
# Variable code(s) is 1+ line with larger whitespace indent
# than variable description. Example:"""
# PUMA00 5
# Public use microdata area code (PUMA) based on Census 2000 definition for data
# collected prior to 2012. Use in combination with PUMA10.
# 00100..08200 .Public use microdata area codes
# 77777 .Combination of 01801, 01802, and 01905 in Louisiana
# -0009 .Code classification is Not Applicable because data
# .collected in 2012 or later
# """
# The last variable code is followed by a newline.
elif (catch_var_desc or catch_var_code) and line.startswith(' '):
indent = len(line) - len(line.lstrip())
# For line 1 of variable description.
if catch_var_desc and var_desc_indent is None:
var_desc_indent = indent
var_desc = line.strip()
ddict['record_types'][record_type][var_name]['description'] = var_desc
# For lines 2+ of variable description.
elif catch_var_desc and indent <= var_desc_indent:
var_desc = line.strip()
ddict['record_types'][record_type][var_name]['description'] += ' '+var_desc
# For lines 1+ of variable codes.
else:
catch_var_desc = False
catch_var_code = True
is_valid_code = None
if not line.strip().startswith('.'):
# Example case: "01 .One person record (one person in household or"
if ' .' in line:
(var_code, var_code_desc) = line.strip().split(
sep=' .', maxsplit=1)
is_valid_code = True
# Example inconsistent format case:"""
# bbbb. N/A (age less than 15 years; never married)
# """
elif '. ' in line:
(var_code, var_code_desc) = line.strip().split(
sep='. ', maxsplit=1)
is_valid_code = True
else:
raise AssertionError(
"Program error. Line unaccounted for:\n" +
"{line}".format(line=line))
if is_valid_code:
if 'var_codes' not in ddict['record_types'][record_type][var_name]:
ddict['record_types'][record_type][var_name]['var_codes'] = collections.OrderedDict()
ddict['record_types'][record_type][var_name]['var_codes'][var_code] = var_code_desc
# Example case: ".any person in group quarters)"
else:
var_code_desc = line.strip().lstrip('.')
ddict['record_types'][record_type][var_name]['var_codes'][var_code] += ' '+var_code_desc
# Example inconsistent format case:"""
# ADJHSG 7
# Adjustment factor for housing dollar amounts (6 implied decimal places)
# """
elif (catch_var_desc and
'description' not in ddict['record_types'][record_type][var_name]):
var_desc = line.strip()
ddict['record_types'][record_type][var_name]['description'] = var_desc
catch_var_desc = False
catch_var_code = True
# Example inconsistent format case:"""
# WGTP10 5
# Housing Weight replicate 10
# -9999..09999 .Integer weight of housing unit
# WGTP11 5
# Housing Weight replicate 11
# -9999..09999 .Integer weight of housing unit
# """
elif ((var_name == 'WGTP10' and 'WGTP11' in line)
or (var_name == 'YOEP12' and 'ANC' in line)):
# type(var_note) == list
(var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
ddict['record_types'][record_type][var_name] = collections.OrderedDict()
ddict['record_types'][record_type][var_name]['length'] = var_len
if len(var_note) > 0:
if 'notes' not in ddict['record_types'][record_type][var_name]:
ddict['record_types'][record_type][var_name]['notes'] = list()
ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
catch_var_name = False
catch_var_desc = True
var_desc_indent = None
else:
if (catch_var_name, catch_var_desc,
catch_var_code, catch_var_note) != (False, )*4:
raise AssertionError(
"Program error. All flags to catch lines should be set " +
"to `False` by end-of-file.")
if var_name != var_name_last:
raise AssertionError(
"Program error. End-of-file notes should only be read "+
"after `var_name_last` has been processed.")
if 'notes' not in ddict:
ddict['notes'] = list()
ddict['notes'].append(line)
return ddict
Create the hierarchical dataframe (formatted below as Jupyter Notebook cells):
In [ ]:
import pandas as pd
ddict = parse_pumsdatadict(path=r'/path/to/PUMSDataDict2013.txt')
tmp = dict()
for record_type in ddict['record_types']:
tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index')
df_ddict = pd.concat(tmp, names=['record_type', 'var_name'])
df_ddict.head()
Out[ ]:
# Click "Run code snippet" below to render the output from `df_ddict.head()`.
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>length</th>
<th>description</th>
<th>var_codes</th>
<th>notes</th>
</tr>
<tr>
<th>record_type</th>
<th>var_name</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="5" valign="top">HOUSING RECORD</th>
<th>ACCESS</th>
<td>1</td>
<td>Access to the Internet</td>
<td>{'b': 'N/A (GQ)', '1': 'Yes, with subscription...</td>
<td>NaN</td>
</tr>
<tr>
<th>ACR</th>
<td>1</td>
<td>Lot size</td>
<td>{'b': 'N/A (GQ/not a one-family house or mobil...</td>
<td>NaN</td>
</tr>
<tr>
<th>ADJHSG</th>
<td>7</td>
<td>Adjustment factor for housing dollar amounts (...</td>
<td>{'1000000': '2013 factor (1.000000)'}</td>
<td>[Note: The value of ADJHSG inflation-adjusts r...</td>
</tr>
<tr>
<th>ADJINC</th>
<td>7</td>
<td>Adjustment factor for income and earnings doll...</td>
<td>{'1007549': '2013 factor (1.007549)'}</td>
<td>[Note: The value of ADJINC inflation-adjusts r...</td>
</tr>
<tr>
<th>AGS</th>
<td>1</td>
<td>Sales of Agriculture Products (Yearly sales)</td>
<td>{'b': 'N/A (GQ/vacant/not a one family house o...</td>
<td>[Note: no adjustment factor is applied to AGS.]</td>
</tr>
</tbody>
</table>

for loop to insert things into a tkinter window

I have Tkinter program that has to add a significant amount of data to the window so I tried to write a for loop to take care of it but since I have to use a string variable for the name of the object that Tkinter is running .insert() on the object. I didn't explain it very well here is the method
def fillWindow(self):
global fileDirectory
location = os.path.join(fileDirectory, family + '.txt')
file = open(location, 'r')
ordersDict = {}
for line in file:
(key, value) = line.split(':', 1)
ordersDict[key] = value
for key in ordersDict:
ordersDict[key] = ordersDict[key][:-2]
for item in ordersDict:
if item[0] == '#':
if item[1] == 'o':
name = 'ordered%s' %item[2:]
right here is the problem line because I have the variable that matches the name of the entry object already created but 'name' is actually a string variable so it gives me the error "AttributeError: 'str' object has no attribute 'insert'"
name.insert(0,ordersDict[item])
here is the entire class. It makes a Tkinter window and fills it with a sort of shipping screen so all the entries are for how many orders of a certain thing are needed. I'm also very new so I know that I do things the long way a lot.
class EditShippingWindow(Tkinter.Toplevel):
def __init__(self, student):
Tkinter.Toplevel.__init__(self)
self.title('Orders')
family = student
## Window Filling
ageGroupLabel = Tkinter.Label(self,text='Age Group')
ageGroupLabel.grid(row=0,column=0)
itemColumnLabel = Tkinter.Label(self,text='Item')
itemColumnLabel.grid(row=0, column=1)
costColumnLabel = Tkinter.Label(self,text='Cost')
costColumnLabel.grid(row=0, column=2)
orderedColumnLabel = Tkinter.Label(self,text='Ordered')
orderedColumnLabel.grid(row=0, column=3)
paidColumnLabel = Tkinter.Label(self,text='Paid')
paidColumnLabel.grid(row=0, column=4)
receivedColumnLabel = Tkinter.Label(self,text='Received')
receivedColumnLabel.grid(row=0, column=5)
#Item Filling
column1list = ['T-Shirt (2T):$9.00', 'T-Shirt (3T):$9.00', 'T-Shirt (4T):$9.00',
'Praise Music CD:$10.00', ':', 'Vest L(Size 6):$10.00', 'Vest XL(Size 8):$10.00',
'Hand Book (KJ/NIV):$8.75', 'Handbook Bag:$6.00', 'Memory CD (KJ/NIV):$10.00',
':', 'Vest L(size 10):$10.00', 'Vest XL(Size 12):$10.00', 'Hand Glider (KJ/NIV/NKJ):$10.00',
'Wing Runner (KJ/NIV/NKJ):$10.00', 'Sky Stormer (KJ/NIV/NKJ):$10.00', 'Handbook Bag:$5.00',
'Memory CD (S/H/C):$10.00', 'Hand Glider Freq. Flyer:$8.00', 'Wing Runner Freq. Flyer:$8.00',
'Sky Stormer Handbook:$8.00' , ':', 'Uniform T-Shirt Size (10/12/14):$13.00',
'Uniform T-Shirt Size(10/12/14):$13.00', 'Uniform T-Shirt(Adult S / M / L / XL):$13.00',
'3rd & 4th Gr. Book 1 (KJ / NIV / NKJ):$8.75', '3rd & 4th Gr. Book 2 (KJ / NIV / NKJ):$8.75',
'4th & 5th Gr. Book 1 (KJ / NIV / NKJ):$8.75', '4th & 5th Gr. Book 2 (KJ / NIV / NKJ):$8.75',
'Memory CD 3rd & 4th Gr. Book (1/2):$10.00', 'Drawstring Backpack:$5.50']
column1num = 1
for item in column1list:
num = str(column1num)
(title, price) = item.split(':')
objectName1 = 'column1row' + num
objectName1 = Tkinter.Label(self,text=title)
objectName1.grid(row=column1num, column=1)
objectName2 = 'column1row' + num
objectName2 = Tkinter.Label(self,text=price)
objectName2.grid(row=column1num, column=2)
column1num += 1
#Ordered Paid Recieved Filler
for i in range(32):
if i == 11 or i == 22 or i == 0 or i == 5:
pass
else:
width = 10
# First Column
title1 = 'ordered' + str(i)
self.title1 = Tkinter.Entry(self,width=width)
self.title1.grid(row=i,column=3)
#self.title1.insert(0, title1)
#Second
title2 = 'paid' + str(i)
self.title2 = Tkinter.Entry(self,width=width)
self.title2.grid(row=i,column=4)
#self.title2.insert(0, title2)
#Third
title3 = 'received' + str(i)
self.title3 = Tkinter.Entry(self,width=width)
self.title3.grid(row=i,column=5)
#self.title3.insert(0, title3)
## Methods
def fillWindow(self):
global fileDirectory
location = os.path.join(fileDirectory, family + '.txt')
file = open(location, 'r')
ordersDict = {}
for line in file:
(key, value) = line.split(':', 1)
ordersDict[key] = value
for key in ordersDict:
ordersDict[key] = ordersDict[key][:-2]
for item in ordersDict:
if item[0] == '#':
if item[1] == 'o':
self.name = 'ordered%s' %item[2:]
self.name.insert(0,ordersDict[item])
fillWindow(self)

It looks like you have a conceptual error there: inside this method, the variable "name" does not exist up to the last line on the first listing. Then it is created, and points to an ordinary Python string -- if you are using a "name" variable elsewhere on your class that variable does not exist inside this method.
For an easy fix of your existing code, try calling the variable as "self.name" instead of just name where it is created, and on your last line in this method use:
self.name.insert(0,ordersDict[item]) instead.
The self. prefix will turn your variable into an instance variable, which is shared across methods on the same instance of the class.
On a side note, you don' t need even the dictionary much less three consecutive for loops on this method, just insert the relevant values you extract from "line" in your text variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract part of string based on a template in Python - python

Related

why python if while ends in a dead loop

How to compare 2 list where string matches element in alternate list

file processing in python

regex to parse well-formated multi-line data dictionary

for loop to insert things into a tkinter window

Categories

Resources