Matching states and cities with possibly multiple words - python

I have a Python list like the following elements:
['Alabama[edit]',
'Auburn (Auburn University)[1]',
'Florence (University of North Alabama)',
'Jacksonville (Jacksonville State University)[2]',
'Livingston (University of West Alabama)[2]',
'Montevallo (University of Montevallo)[2]',
'Troy (Troy University)[2]',
'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
'Tuskegee (Tuskegee University)[5]',
'Alaska[edit]',
'Fairbanks (University of Alaska Fairbanks)[2]',
'Arizona[edit]',
'Flagstaff (Northern Arizona University)[6]',
'Tempe (Arizona State University)',
'Tucson (University of Arizona)',
'Arkansas[edit]',
'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
'Fayetteville (University of Arkansas)[7]']
The list is not complete, but is sufficient to give you an idea of what's in it.
The data is structured like this:
There is a name of a US state and following the state name, there are some names of cities IN THAT STATE. The state name, as you can see ends in "[edit]", and the cities' name either end in a bracket with a number (for example "1", or "[2]"), or with a university's name within parenthesis (for example "(University of North Alabama)").
(Find the full reference file for this problem here)
I ideally want a Python dictionary with the state names as the index, and all the cities' names in that state in a nested listed as a value to that particular index. So, for example the dictionary should be like:
{'Alabama': ['Auburn', 'Florence', 'Jacksonville'...], 'Arizona': ['Flagstaff', 'Temple', 'Tucson', ....], ......}
Now, I tried the following solution, to weed out the unnecessary parts:
import numpy as np
import pandas as pd
def get_list_of_university_towns():
'''
Returns a DataFrame of towns and the states they are in from the
university_towns.txt list. The format of the DataFrame should be:
DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ],
columns=["State", "RegionName"] )
The following cleaning needs to be done:
1. For "State", removing characters from "[" to the end.
2. For "RegionName", when applicable, removing every character from " (" to the end.
3. Depending on how you read the data, you may need to remove newline character '\n'.
'''
fhandle = open("university_towns.txt")
ftext = fhandle.read().split("\n")
reftext = list()
for item in ftext:
reftext.append(item.split(" ")[0])
#pos = reftext[0].find("[")
#reftext[0] = reftext[0][:pos]
towns = list()
dic = dict()
for item in reftext:
if item == "Alabama[edit]":
state = "Alabama"
elif item.endswith("[edit]"):
dic[state] = towns
towns = list()
pos = item.find("[")
item = item[:pos]
state = item
else:
towns.append(item)
return ftext
get_list_of_university_towns()
A snippet of my output generated by my code looks like this:
{'Alabama': ['Auburn',
'Florence',
'Jacksonville',
'Livingston',
'Montevallo',
'Troy',
'Tuscaloosa',
'Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
'Arkansas': ['Arkadelphia',
'Conway',
'Fayetteville',
'Jonesboro',
'Magnolia',
'Monticello',
'Russellville',
'Searcy'],
'California': ['Angwin',
'Arcata',
'Berkeley',
'Chico',
'Claremont',
'Cotati',
'Davis',
'Irvine',
'Isla',
'University',
'Merced',
'Orange',
'Palo',
'Pomona',
'Redlands',
'Riverside',
'Sacramento',
'University',
'San',
'San',
'Santa',
'Santa',
'Turlock',
'Westwood,',
'Whittier'],
'Colorado': ['Alamosa',
'Boulder',
'Durango',
'Fort',
'Golden',
'Grand',
'Greeley',
'Gunnison',
'Pueblo,'],
'Connecticut': ['Fairfield',
'Middletown',
'New',
'New',
'New',
'Storrs',
'Willimantic'],
'Delaware': ['Dover', 'Newark'],
'Florida': ['Ave',
'Boca',
'Coral',
'DeLand',
'Estero',
'Gainesville',
'Orlando',
'Sarasota',
'St.',
'St.',
'Tallahassee',
'Tampa'],
'Georgia': ['Albany',
'Athens',
'Atlanta',
'Carrollton',
'Demorest',
'Fort',
'Kennesaw',
'Milledgeville',
'Mount',
'Oxford',
'Rome',
'Savannah',
'Statesboro',
'Valdosta',
'Waleska',
'Young'],
'Hawaii': ['Manoa'],
But, there is one error in the output: States with a space in their names (e.g. "North Carolina") are not included. I can the the reason behind it.
I thought of using regular expressions, but since I have yet to study about them, I do not know how to form one. Any ideas as to how it could be done with or without the use of Regex?

Praise the power of regular expressions then:
states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)
cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)
transformed = '\n'.join(lst_)
result = {state.group('state'): [city.group(0).rstrip()
for city in cities_rx.finditer(state.group('cities'))]
for state in states_rx.finditer(transformed)}
print(result)
This yields
{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']}
Explanation:
The idea is to split the task up into several smaller tasks:
Join the complete list with \n
Separate states
Separate towns
Use a dict comprehension for all found items
First subtask
transformed = '\n'.join(your_list)
Second subtask
^ # match start of the line
(?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?) # afterwards match anything up to
(?=^.*\[edit\]$|\Z) # ... either another state or the very end of the string
See the demo on regex101.com.
Third subtask
^[^()\n]+ # match start of the line, anything not a newline character or ( or )
See another demo on regex101.com.
Fourth subtask
result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}
This is roughly equivalent to:
for state in states_rx.finditer(transformed):
# state is in state.group('state')
for city in cities_rx.finditer(state.group('cities')):
# city is in city.group(0), possibly with whitespaces
# hence the rstrip
Lastly, some timing issues:
import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965
So running the above a 100.000 times took me round 12 seconds on my computer, so it should be reasonably fast.

You [c/sh]ould change
fhandle = open("university_towns.txt")
ftext = fhandle.read().split("\n")
# to
with open("university_towns.txt","r") as f:
d = f.readlines()
# file is autoclosed here, lines are autosplit by readlines()
No regex solution:
def save(state,city,dic):
'''convenience fnkt to add or create set entry with list of city'''
if state in dic:
dic[state].append(city)
else:
dic[state] = [] # fix for glitch
dic = {}
state = ""
with open("university_towns.txt","r") as f:
d = f.readlines()
for n in d: # iterate all lines
if "[edit]" in n: # handles states
act_state = n.replace("[edit]","").strip() # clean up state
# needed in case 2 states w/o cities follow right after each other
save(act_state,"", dic) # create state in dic, no cities
state = n.replace("[edit]","").strip() # clean up state
else:
# splits at ( takes first and splits at [ takes first removes blanks
# => get city name before ( or [
city = n.split("(")[0].split("[")[0].strip()
save(state,city,dic) # adds city to state in dic
print (dic)
Yields (re-formatted):
{
'Alabama' : ['Auburn', 'Florence', 'Jacksonville', 'Livingston',
'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'],
'Alaska' : ['Fairbanks'],
'Arizona' : ['Flagstaff', 'Tempe', 'Tucson'],
'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']
}

Let's solve your problem step by step :
First step:
collect all the data and here i am using putting a track word whenever any state name appear it put a word 'pos_flag' so with the help of this word we will track and chunk:
import re
pattern='\w+(?=\[edit\])'
track=[]
with open('mon.txt','r') as f:
for line in f:
match=re.search(pattern,line)
if match:
track.append('pos_flag')
track.append(line.strip().split('[')[0])
else:
track.append(line.strip().split('(')[0])
it will give something like this output:
['pos_flag', 'Alabama', 'Auburn ', 'Florence ', 'Jacksonville ', 'Livingston ', 'Montevallo ', 'Troy ', 'Tuscaloosa ', 'Tuskegee ', 'pos_flag', 'Alaska', 'Fairbanks ', 'pos_flag', 'Arizona', 'Flagstaff ', 'Tempe ', 'Tucson ', 'pos_flag', 'Arkansas', 'Arkadelphia ', 'Conway ', 'Fayetteville ', 'Jonesboro ', 'Magnolia ', 'Monticello ', 'Russellville ', 'Searcy ', 'pos_flag',
As you can see before every state name there is a word 'pos_flag' now let's use this word and do some stuff:
Second step:
Track the index of all the 'pos_flag words' in list:
index_no=[]
for index,value in enumerate(track):
if value=='pos_flag':
index_no.append(index)
This will give output something like this :
[0, 10, 13, 18, 28, 55, 66, 75, 79, 93, 111, 114, 119, 131, 146, 161, 169, 182, 192, 203, 215, 236, 258, 274, 281, 292, 297, 306, 310, 319, 331, 338, 371, 391, 395, 419, 432, 444, 489, 493, 506, 512, 527, 551, 559, 567, 581, 588, 599, 614]
We have now index no and we can chunk the link with these index numbers :
Last step:
chunk the list with using index no and set first word as dict key and rest of as dict values:
city_dict={}
for i in range(0,len(index_no),1):
try:
value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
city_dict[value_1[1]]=value_1[2:]
except IndexError:
city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]
print(city_dict)
output:
since dict are not ordered in python 3.5 so order of output is different from input file :
{'Kentucky': ['Bowling Green ', 'Columbia ', 'Georgetown ', 'Highland Heights ', 'Lexington ', 'Louisville ', 'Morehead ', 'Murray ', 'Richmond ', 'Williamsburg ', 'Wilmore '], 'Mississippi': ['Cleveland ', 'Hattiesburg ', 'Itta Bena ', 'Oxford ', 'Starkville '], 'Wisconsin': ['Appleton ', 'Eau Claire ', 'Green Bay ', 'La Crosse ', 'Madison ', 'Menomonie ', 'Milwaukee ',
full_code:
import re
pattern='\w+(?=\[edit\])'
track=[]
with open('mon.txt','r') as f:
for line in f:
match=re.search(pattern,line)
if match:
track.append('pos_flag')
track.append(line.strip().split('[')[0])
else:
track.append(line.strip().split('(')[0])
index_no=[]
for index,value in enumerate(track):
if value=='pos_flag':
index_no.append(index)
city_dict={}
for i in range(0,len(index_no),1):
try:
value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
city_dict[value_1[1]]=value_1[2:]
except IndexError:
city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]
print(city_dict)
Second solution:
If you want to use regex then try this small solution :
import re
pattern='((\w+\[edit\])(?:(?!^\w+\[edit\]).)*)'
with open('file.txt','r') as f:
prt=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)
for line in prt:
dict_p={}
match = []
match.append(line.group(1))
dict_p[match[0].split('\n')[0].strip().split('[')[0]]= [i.split('(')[0].strip() for i in match[0].split('\n')[1:][:-1]]
print(dict_p)
it will give:
{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee']}
{'Alaska': ['Fairbanks']}
{'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
{'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy']}
{'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla Vista', 'University Park, Los Angeles', 'Merced', 'Orange', 'Palo Alto', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University District, San Bernardino', 'San Diego', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Turlock', 'Westwood, Los Angeles', 'Whittier']}
{'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort Collins', 'Golden', 'Grand Junction', 'Greeley', 'Gunnison', 'Pueblo, Colorado']}
demo :

I tried to eliminate the need for more than one regex.
import re
def mkdict(data):
state, dict = None, {}
rx = re.compile(r'^(?:(.+\[edit\])|([^\(\n:]+))', re.M)
for m in rx.finditer(data):
if m.groups()[0]:
state = m.groups()[0].rstrip('[edit]')
dict[state] = []
else:
dict[state].append(m.groups()[1].rstrip())
return dict
if __name__ == '__main__':
import sys, timeit, functools
data = sys.stdin.read()
print(timeit.Timer(functools.partial(mkdict, data)).timeit(10**3))
print(mkdict(data))
Try it online.

Related

How to create a nested dictionary from a text file

So, my file looks like this :
Intestinal infectious diseases (001-003)
001 Cholera
002 Typhoid and paratyphoid fevers
003 Other salmonella infections
Tuberculosis (004-006)
004 Primary tuberculous infection
005 Pulmonary tuberculosis
006 Other respiratory tuberculosis
.
.
.
I'm supposed to make a nested dictionary with the disease group as keys and the dictionary containing the disease code and name, as value for the first dictionary. I'm having some trouble separating the disease codes into their own disease groups. Here's what I've done so far:
import json
icd9_encyclopedia={}
lines = []
f = open("icd9_info.txt", 'r')
for line in f:
line = line.rstrip("\n")
if line[0].isnumeric() == True:
icd9_encyclopedia[line] = ???
f.close()
solution
import itertools
from pathlib import Path
# load text lines
lines = Path('data.txt').read_text().split('\n')
# build output dictionary
icd9_encyclopedia = {
# build single group dictionary
group_name: {
int(code): disease_name
# split each disease line into code and text name
for disease_string in disease_strings
for (code, _, disease_name) in [disease_string.partition(' ')]
}
# get groups separated by an empty line
# isolate first item in each group as its name
for x, (group_name, *disease_strings) in itertools.groupby(lines, bool) if x
}
result
{'Intestinal infectious diseases (001-003)': {1: 'Cholera',
2: 'Typhoid and paratyphoid '
'fevers',
3: 'Other salmonella infections'},
'Tuberculosis (004-006)': {4: 'Primary tuberculous infection',
5: 'Pulmonary tuberculosis',
6: 'Other respiratory tuberculosis'}}
Here's another take on the problem that uses just basic Python:
from pprint import pprint
icd9_encyclopedia={}
key = None
item = {}
with open("icd9_info.txt") as f:
for line in f:
line = line.strip()
if not line[0].isdigit():
# Start a new item
if key:
# Store the prior item in the main dictionary
icd9_encyclopedia[key] = item
# Initialize the new item
key = line
item = {}
else:
# A detail entry - add it to the current item
num, rest = line.split(' ', 1)
item[num] = rest
# Store the final item to the dictionary
if key:
icd9_encyclopedia[key] = item
pprint(icd9_encyclopedia)
Result:
{'Intestinal infectious diseases (001-003)': {'001': 'Cholera',
'002': 'Typhoid and paratyphoid '
'fevers',
'003': 'Other salmonella '
'infections'},
'Tuberculosis (004-006)': {'004': 'Primary tuberculous infection',
'005': 'Pulmonary tuberculosis',
'006': 'Other respiratory tuberculosis'}}
I used defaultdict to easily make a nested dictionary, as follows:
from collections import defaultdict
icd9_encyclopedia = defaultdict(dict)
disease_group = ""
with open("icd9_info.txt", 'r') as f:
for line in [i[:-1] for i in f.readlines()]: # [:-1] to remove '\n' for each line
if line == "": # skip if blank line
continue
if not line[0].isdigit():
disease_group = line # temporarily save current disease group name for the following lines
else:
code, name = line.split(maxsplit=1)
icd9_encyclopedia[disease_group][code] = name
for key, value in icd9_encyclopedia.items():
print(key, value)
#Intestinal infectious diseases (001-003) {'001': 'Cholera', '002': 'Typhoid and paratyphoid fevers', '003': 'Other salmonella infections'}
#Tuberculosis (004-006) {'004': 'Primary tuberculous infection', '005': 'Pulmonary tuberculosis', '006': 'Other respiratory tuberculosis'}
You can see more detail about defaultdict here: https://www.geeksforgeeks.org/defaultdict-in-python/
validInt checks weather the data is a valid integer
def validInt(data):
try:
int(data)
except Exception as e:
return False
pass
return True
encyclo = {}
with open("file.data",'r') as f:
lines = f.readlines()
for line in lines:
if len(line.strip()) == 0:#line should not be empty
continue
first = line.split(' ')[0]
if validInt(first):
di = encyclo[list(encyclo.keys())[-1]] # returns a dictionary
di[first] = line[len(first):] # inserting data to dictionary len(first) is used to skip the numeric part
else:
encyclo[line] = {}
for key, value in encyclo.items():#displaying data
print(key, value)
$ python3 test.py
Intestinal infectious diseases (001-003)
{'001': ' Cholera\n', '002': ' Typhoid and paratyphoid fevers\n', '003': ' Other salmonella infections\n'}
Tuberculosis (004-006)
{'004': ' Primary tuberculous infection\n', '005': ' Pulmonary tuberculosis\n', '006': ' Other respiratory tuberculosis\n'}

How to set contents of a file that don't start with "\t" as keys, and those who start with "\t" and end with "\n" as values to the key before them?

I want make a dictionary that looks like this: { 'The Dorms': {'Public Policy' : 50, 'Physics Building' : 100, 'The Commons' : 120}, ...}
This is the list :
['The Dorms\n', '\tPublic Policy, 50\n', '\tPhysics Building, 100\n', '\tThe Commons, 120\n', 'Public Policy\n', '\tPhysics Building, 50\n', '\tThe Commons, 60\n', 'Physics Building\n', '\tThe Commons, 30\n', '\tThe Quad, 70\n', 'The Commons\n', '\tThe Quad, 15\n', '\tBiology Building, 20\n', 'The Quad\n', '\tBiology Building, 35\n', '\tMath Psych Building, 50\n', 'Biology Building\n', '\tMath Psych Building, 75\n', '\tUniversity Center, 125\n', 'Math Psych Building\n', '\tThe Stairs by Sherman, 50\n', '\tUniversity Center, 35\n', 'University Center\n', '\tEngineering Building, 75\n', '\tThe Stairs by Sherman, 25\n', 'Engineering Building\n', '\tITE, 30\n', 'The Stairs by Sherman\n', '\tITE, 50\n', 'ITE']
This is my code:
def load_map(map_file_name):
# map_list = []
map_dict = {}
map_file = open(map_file_name, "r")
map_list = map_file.readlines()
for map in map_file:
map_content = map.strip("\n").split(",")
map_list.append(map_content)
for map in map_list:
map_dict[map[0]] = map[1:]
print(map_dict)
if __name__ == "__main__":
map_file_name = input("What is the map file? ")
load_map(map_file_name)
Since your file's content is apparently literal Python data, you should use ast.literal_eval to parse it not some ad-hoc method.
Then you can just loop around your values and process them:
def load_map(mapfile):
with open(mapfile, encoding='utf-8') as f:
data = ast.literal_eval(f.read())
m = {}
current_section = None
for item in data:
if not item.startswith('\t'):
current_section = m[item.strip()] = {}
else:
k, v = item.split(',')
current_section[k.strip()] = int(v.strip())
print(m)

Python - iterating over a list and passing each element as a parameter to a function

I've been going through the DataQuest "Introduction to Python" program, and in the "Modules" section of the course, there's a section on defining functions. I got through that section but then decided I wanted to try to extend it. The original code, which works, is:
#!/usr/bin/env python3
import csv
f = open("nfl.csv", 'r')
nfl = list(csv.reader(f))
# Define your function here.
def nfl_wins(team):
count = 0
for row in nfl:
if row[2] == team:
count += 1
return count
cowboys_wins = nfl_wins("Dallas Cowboys")
falcons_wins = nfl_wins("Atlanta Falcons")
print(cowboys_wins)
print(falcons_wins)
My code, which is intended to iterate over a list containing all the teams and give the number of wins for each team, is as follows:
#!/usr/bin/env python3
import csv
f = open('nfl.csv', 'r')
nfl = list(csv.reader(f))
# Define your function here.
def nfl_wins(team):
print(team)
count = 0
for row in nfl:
if row[2] == team:
count = count + 1
return count
nfl_teams = ['Denver Broncos', 'Detroit Lions', 'Green Bay Packers', 'Houston Texans', 'Indianapolis Colts', 'Jacksonville Jaguars', 'Kansas City Chiefs', 'Miami Dolphins', 'Minnesota Vikings', 'New England Patriots', 'New Orleans Saints', 'New York Giants', 'New York Jets', 'Oakland Raiders', 'Philadelphia Eagles', 'Pittsburgh Steelers', 'San Diego Chargers', 'San Francisco 49ers', 'Seattle Seahawks', 'St. Louis Rams', 'Tampa Bay Buccaneers', 'Tennessee Titans', 'Washington Redskins']
for squad in nfl_teams:
print(squad, nfl_wins(squad))
When I just iterate over the "nfl_teams" list and print the names, it works fine. When I try to pass the team names to the "nfl_wins" function, I get the following error:
$ python3 nfl_wins.py
Denver Broncos
Traceback (most recent call last):
File "nfl_wins.py", line 20, in <module>
print(squad, nfl_wins(squad))
File "nfl_wins.py", line 13, in nfl_wins
if row[2] == team:
IndexError: list index out of range
My environment is Python 3.4.3 on Cygwin running under Windows 7 Enterprise.

I am trying to import a set of lists in python and call a random item from one of the lists

#Set of lists I want to import into my python program called "setlist.txt"
---------------------------------------------------------------------------------------------
Tripolee = ('Saeed Younan', 'Matrixxman', 'Pete Tong', 'Dubfire', 'John Digweed', 'Carl Cox')
Ranch = ('Dabin', 'Galantis', 'Borgeous', 'Shpongle', 'ODESZA', 'Kaskade')
Sherwood = ('Nadus', 'Mr. Carmack', 'Wave Racer', 'Lido', 'Goldlink', 'Four Tet', 'Flume')
Jubilee = ('Chaz French', 'MartyParty', 'Sango', 'Brodinski', 'Phutureprimitive', 'EOTO')
The Hangar = ('Vourteque', 'The Gentlemen Callers', 'Bart&Baker', 'Jaga Jazzist', 'JPOD')
Forest = ('Vibe Street', 'Lafa Taylor', 'Vaski', 'Little People', 'jackLNDN', 'MartyParty')
---------------------------------------------------------------------------------------------
#program
from sys import exit
from random import randint
from sys import argv
script, setlist = argv
setlist = open(setlist)
print "Here is the setlist for day 1"
print setlist.read()
print "%r is playing on the Tripolee stage" % random.choice(setlist.readline(2))
I have a bunch more code in between all this that I"m not putting up here but basically that last line what I'm having trouble with.
Probably not the best format for your file but you can split and use ast.literal_eval:
from ast import literal_eval
with open("in.txt") as f:
choices = [literal_eval(line.split(" = ")[-1]) for line in f]
Which will give you a list of tuples which you can pass to random.choice:
[('Saeed Younan', 'Matrixxman', 'Pete Tong', 'Dubfire', 'John Digweed', 'Carl Cox'), ('Dabin', 'Galantis', 'Borgeous', 'Shpongle', 'ODESZA', 'Kaskade'), ('Nadus', 'Mr. Carmack', 'Wave Racer', 'Lido', 'Goldlink', 'Four Tet', 'Flume'), ('Chaz French', 'MartyParty', 'Sango', 'Brodinski', 'Phutureprimitive', 'EOTO'), ('Vourteque', 'The Gentlemen Callers', 'Bart&Baker', 'Jaga Jazzist', 'JPOD'), ('Vibe Street', 'Lafa Taylor', 'Vaski', 'Little People', 'jackLNDN', 'MartyParty')]
I have no idea where setlist is supposed to come from, you file is what looks like tuple assignments. setlist.readline(2) would read 2 bytes or actually in your case nothing as you have already exhausted the file iterator calling read.
I would suggest after extracting using literal_eval putting your file in a more usable format, maybe creating a dict using the name as the key and dumping the dict.
from ast import literal_eval
with open("in.txt") as f:
choices = {}
for line in f:
ven, tpl = line.split(" = ")
choices[ven] = literal_eval(tpl)
print(choices)
Output:
{'Jubilee': ('Chaz French', 'MartyParty', 'Sango', 'Brodinski', 'Phutureprimitive', 'EOTO'), 'Tripolee': ('Saeed Younan', 'Matrixxman', 'Pete Tong', 'Dubfire', 'John Digweed', 'Carl Cox'), 'The Hangar': ('Vourteque', 'The Gentlemen Callers', 'Bart&Baker', 'Jaga Jazzist', 'JPOD'), 'Ranch': ('Dabin', 'Galantis', 'Borgeous', 'Shpongle', 'ODESZA', 'Kaskade'), 'Sherwood': ('Nadus', 'Mr. Carmack', 'Wave Racer', 'Lido', 'Goldlink', 'Four Tet', 'Flume'), 'Forest': ('Vibe Street', 'Lafa Taylor', 'Vaski', 'Little People', 'jackLNDN', 'MartyParty')}
You can persist the dict using json.dump or the pickle module so your data will be in a lot easier format to each time.
To make it a little clearer what you have below is the content of your .txt file:
---------------------------------------------------------------------------------------------
Tripolee = ('Saeed Younan', 'Matrixxman', 'Pete Tong', 'Dubfire', 'John Digweed', 'Carl Cox')
Ranch = ('Dabin', 'Galantis', 'Borgeous', 'Shpongle', 'ODESZA', 'Kaskade')
Sherwood = ('Nadus', 'Mr. Carmack', 'Wave Racer', 'Lido', 'Goldlink', 'Four Tet', 'Flume')
Jubilee = ('Chaz French', 'MartyParty', 'Sango', 'Brodinski', 'Phutureprimitive', 'EOTO')
The Hangar = ('Vourteque', 'The Gentlemen Callers', 'Bart&Baker', 'Jaga Jazzist', 'JPOD')
Forest = ('Vibe Street', 'Lafa Taylor', 'Vaski', 'Little People', 'jackLNDN', 'MartyParty')
To print the venue and set list you can use dict.items:
for ven, set_l in choices.items():
print("Set list for {}: {}".format(ven, ", ".join(set_l)))
Output:
Set list for Jubilee: Chaz French, MartyParty, Sango, Brodinski, Phutureprimitive, EOTO
Set list for Tripolee: Saeed Younan, Matrixxman, Pete Tong, Dubfire, John Digweed, Carl Cox
Set list for The Hangar: Vourteque, The Gentlemen Callers, Bart&Baker, Jaga Jazzist, JPOD
Set list for Ranch: Dabin, Galantis, Borgeous, Shpongle, ODESZA, Kaskade
Set list for Sherwood: Nadus, Mr. Carmack, Wave Racer, Lido, Goldlink, Four Tet, Flume
Set list for Forest: Vibe Street, Lafa Taylor, Vaski, Little People, jackLNDN, MartyParty
When you open the file and call read you now have all the content in your file stored as a string. You then print the string, next you try random.choice(setlist.readline(2)), readline(2) is trying to read two bytes which it cannot even do as the file pointer is at the end of the file as you have already called read so you see an empty string outputted.
If you want to get a random string from the first tuple:
choices = [literal_eval(line.split(" = ")[-1]) for line in f]
from random import choice
print(choice(choices[0]))

Python: Parse a list of strings into a dictionnary

This is somewhat complicated. I have a list that looks like this:
['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
In my list, the '\n' is what separate a story. What I would like to do is to create a dictionary from the above list that would like this:
dict = {ID1: [19841018, 'Plunging oil... cut in the price'], ID2: [19841018, 'The U.S. dollar... the foreign-exchange markets']}
You can see that my KEY of my dictionnary is the ID and the items are the year and the combination of the stories. Is that doable?
My IDs, are in this format J00100394, J00384932. So they all start with J00.
The tricky part is split your list by any value, so i've take this part from here.Then i've parsed the list parts to built the res dict
>>> import itertools
>>> def isplit(iterable,splitters):
... return [list(g) for k,g in itertools.groupby(iterable,lambda x:x in splitters) if not k]
...
>>> l = ['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
>>> res = {}
>>> for sublist in isplit(l,('\n',)):
... id_parts = sublist[0].split()
... story = ' '.join (sentence.strip() for sentence in sublist[1:])
... res[id_parts[1].strip()] = [id_parts[0].strip(), story]
...
>>> res
{'ID2': ['19841018', 'The U.S. dollar... the foreign-exchange markets late New York trading'], 'ID1': ['19841018', 'Plunging oil... cut in the price']}
I code an answer that use generator. The idea is that every time that start an id token the generator return the last key computed. You can costumize by change the check_fun() and how to mix the part of the description.
def trailing_carriage(s):
if s.endswith('\n'):
return s[:-1]
return s
def check_fun(s):
"""
:param s:Take a string s
:return: None if s dosn't match the ID rules. Otherwise return the
name,value of the token
"""
if ' ' in s:
id_candidate,name = s.split(" ",1)
try:
return trailing_carriage(name),int(id_candidate)
except ValueError:
pass
def parser_list(list, check_id_prefix=check_fun):
name = None #key dict
id_candidate = None
desc = "" #description string
for token in list:
check = check_id_prefix(token)
if check is not None:
if name is not None:
"""Return the previous coputed entry"""
yield name,id_val,desc
name,id_val = check
else:
"""Append the description"""
desc += trailing_carriage(token)
if name is not None:
"""Flush the last entry"""
yield name,id_val,desc
>>> list = ['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
>>> print {k:[i,d] for k,i,d in parser_list(list)}
{'ID2': [19841018, ' Plunging oil... cut in the price The U.S. dollar... the foreign-exchange markets late New York trading '], 'ID1': [19841018, ' Plunging oil... cut in the price ']}

Categories

Resources