How to fix "string indices must be integers" - python

I am working with JSON files from Foursquare. And I keep getting this error "string indices must be integers"
This is my dataset, county_merge
county density lat lng
0 Alameda 2532.292000 37.609029 -121.899142
1 Alpine 30.366667 38.589393 -119.834501
2 Amador 218.413333 38.449089 -120.591102
3 Butte 329.012500 39.651927 -121.585844
4 Calaveras 214.626316 38.255818 -120.498149
5 Colusa 393.388889 39.146558 -122.220956
6 Contra Costa 1526.334000 37.903481 -121.917535
7 Del Norte 328.485714 41.726177 -123.913280
8 El Dorado 444.043750 38.757414 -120.527613
9 Fresno 654.509259 36.729529 -119.708861
10 Glenn 477.985714 39.591277 -122.377866
11 Humboldt 392.427083 40.599742 -123.899773
12 Imperial 796.919048 33.030549 -115.359567
13 Inyo 127.561905 36.559533 -117.407471
14 Kern 608.326471 35.314570 -118.753822
15 Kings 883.560000 36.078481 -119.795634
16 Lake 608.338462 39.050541 -122.777656
17 Lassen 179.664706 40.768558 -120.730998
18 Los Angeles 2881.756000 34.053683 -118.242767
19 Madera 486.887500 37.171626 -119.773799
20 Marin 1366.937143 38.040914 -122.619964
21 Mariposa 48.263636 37.570148 -119.903659
22 Mendocino 198.010345 39.317649 -123.412640
23 Merced 1003.309091 37.302957 -120.484327
24 Modoc 100.856250 41.545049 -120.743600
25 Mono 133.145455 37.953393 -118.939876
26 Monterey 946.090323 36.600256 -121.894639
27 Napa 592.020000 38.297137 -122.285529
28 Nevada 338.892857 39.354033 -120.808984
29 Orange 1992.962500 33.750038 -117.870493
30 Placer 492.564000 39.101206 -120.765061
31 Plumas 87.817778 39.943099 -120.805952
32 Riverside 976.692105 33.953355 -117.396162
33 Sacramento 1369.729032 38.581572 -121.494400
34 San Benito 577.637500 36.624809 -121.117738
35 San Bernardino 612.176636 34.108345 -117.289765
36 San Diego 1281.848649 32.717421 -117.162771
37 San Francisco 7279.000000 37.779281 -122.419236
38 San Joaquin 1282.122222 37.937290 -121.277372
39 San Luis Obispo 627.285185 35.282753 -120.659616
40 San Mateo 1594.372973 37.496904 -122.333057
41 Santa Barbara 1133.525806 34.422132 -119.702667
42 Santa Clara 2090.724000 37.354113 -121.955174
43 Santa Cruz 1118.844444 36.974942 -122.028526
44 Shasta 180.137931 40.796512 -121.997919
45 Sierra 115.681818 39.584907 -120.530573
46 Siskiyou 202.170000 41.500722 -122.544354
47 Solano 871.818182 38.221894 -121.916355
48 Sonoma 926.674286 38.511080 -122.847339
49 Stanislaus 1181.864000 37.550087 -121.050143
50 Sutter 552.355556 38.950967 -121.697088
51 Tehama 206.862500 40.125133 -122.201553
52 Trinity 63.056250 40.605326 -123.171268
53 Tulare 681.425806 36.251647 -118.852583
54 Tuolumne 349.471429 38.056944 -119.991935
55 Ventura 1465.400000 34.343649 -119.295171
56 Yolo 958.890909 38.718454 -121.905900
{'meta': {'code': 200, 'requestId': '5cab80f04c1f6715df4e698d'},
'response': {'venues': [{'id': '4b9bf2abf964a520573936e3',
'name': 'Bishop Ranch Veterinary Center & Urgent Care',
'location': {'address': '2000 Bishop Dr',
'lat': 37.77129467449237,
'lng': -121.97112176203284,
'labeledLatLngs': [{'label': 'display',
'lat': 37.77129467449237,
'lng': -121.97112176203284}],
'distance': 19143,
'postalCode': '94583',
'cc': 'US',
'city': 'San Ramon',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['2000 Bishop Dr',
'San Ramon, CA 94583',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'venuePage': {'id': '463205329'},
'referralId': 'v-1554743537',
'hasPerk': False},
{'id': '4b9acbfef964a5209dd635e3',
'name': 'San Francisco SPCA Veterinary Hospital',
'location': {'address': '201 Alabama St',
'crossStreet': 'at 16th St.',
'lat': 37.766633450405465,
'lng': -122.41214303998395,
'labeledLatLngs': [{'label': 'display',
'lat': 37.766633450405465,
'lng': -122.41214303998395}],
'distance': 48477,
'postalCode': '94103',
'cc': 'US',
'city': 'San Francisco',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['201 Alabama St (at 16th St.)',
'San Francisco, CA 94103',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'referralId': 'v-1554743537',
'hasPerk': False},
{'id': '4b00d8ecf964a5204d4122e3',
'name': 'Pleasanton Veterinary Hospital',
'location': {'address': '3059B Hopyard Rd Ste B',
'lat': 37.67658,
'lng': -121.89778,
'labeledLatLngs': [{'label': 'display',
'lat': 37.67658,
'lng': -121.89778}],
'distance': 7520,
'postalCode': '94588',
'cc': 'US',
'city': 'Pleasanton',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['3059B Hopyard Rd Ste B',
'Pleasanton, CA 94588',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'referralId': 'v-1554743537',
'hasPerk': False},
This is JSON file I am working on. And, I am tyring to exctract name, latitude, longitude, and city name.
results = requests.get(url).json()
results
names=county_merge['county']
la=county_merge['lat']
ln=county_merge['lng']
venues_list = []
venues_list.append([(
names,
la,
ln,
v['response']['venues'][0]['name'],
v['response']['venues'][0]['location']['lat'],
v['response']['venues'][0]['location']['lng'],
v['response']['venues'][0]['location']['city']) for v in results])
I am expecting this will give me the several line of list.
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
.
.
.
.
But, It only gives me an error and frustration.
TypeError Traceback (most recent call last)
<ipython-input-44-321b1c667727> in <module>
11 v['response']['venues'][0]['location']['lat'],
12 v['response']['venues'][0]['location']['lng'],
---> 13 v['response']['venues'][0]['location']['city']) for v in results])
<ipython-input-44-321b1c667727> in <listcomp>(.0)
11 v['response']['venues'][0]['location']['lat'],
12 v['response']['venues'][0]['location']['lng'],
---> 13 v['response']['venues'][0]['location']['city']) for v in results])
TypeError: string indices must be integers
Do you have any idea to fix this code?

[v for v in results]
gives you
['meta', 'response']
So you got the keys of results, which are strings. I think you want
venues_list.append([(
names,
la,
ln,
v['name'],
v['location']['lat'],
v['location']['lng'],
v['location']['city']) for v in results['response']['venues'])

Related

Renaming new split columns with prefix

I have a dataframe, which includes two columns which are dicts.
type possession_team
0 {'id': 35, 'name': 'Starting XI'} {'id':9101,'name':'San Diego Wave'}
1 {'id': 35, 'name': 'Starting XI'} {'id':9101,'name':'San Diego Wave'}
2 {'id': 18, 'name': 'Half Start'} {'id':9101,'name':'San Diego Wave'}
3 {'id': 18, 'name': 'Half Start'} {'id':9101,'name':'San Diego Wave'}
4 {'id': 30, 'name': 'Pass'} {'id':9101,'name':'San Diego Wave'}
I use
pd.concat([df, df['type'].apply(pd.Series)], axis = 1).drop('type', axis = 1)
to split the columns manually at the minute. How would I use this code, but also add a prefix to the resulting columns that it creates? The prefix being that of the resulting columns that it creates, so I would have;
type_id type_name
0 35 'Starting XI'
1 35 'Starting XI'
2 18 'Half Start'
3 18 'Half Start'
4 30 'Pass'
IIUC, and assuming dictionaries, you could do:
df['type_id'] = df['type'].str['id']
df['type_name'] = df['type'].str['name']
For a more generic approach:
for c in df['type'].explode().unique():
df[f'type_{c}'] = df['type'].str[c]
And even more generic (apply to all columns):
for col in ['type', 'possession_team']: # or df.columns
for c in df[col].explode().unique():
df[f'{col}_{c}'] = df[col].str[c]
output:
type possession_team \
0 {'id': 35, 'name': 'Starting XI'} {'id': 9101, 'name': 'San Diego Wave'}
1 {'id': 35, 'name': 'Starting XI'} {'id': 9101, 'name': 'San Diego Wave'}
2 {'id': 18, 'name': 'Half Start'} {'id': 9101, 'name': 'San Diego Wave'}
3 {'id': 18, 'name': 'Half Start'} {'id': 9101, 'name': 'San Diego Wave'}
4 {'id': 30, 'name': 'Pass'} {'id': 9101, 'name': 'San Diego Wave'}
type_id type_name possession_team_id possession_team_name
0 35 Starting XI 9101 San Diego Wave
1 35 Starting XI 9101 San Diego Wave
2 18 Half Start 9101 San Diego Wave
3 18 Half Start 9101 San Diego Wave
4 30 Pass 9101 San Diego Wave

json_normalize with multiple record paths

I'm using the example given in the json_normalize documentation given here pandas.json_normalize — pandas 1.0.3 documentation, I can't unfortunately paste my actual JSON but this example works. Pasted from the documentation:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
What if the JSON was the one below instead where info is an array instead of a dict:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': [{'governor': 'Rick Scott'},
{'governor': 'Rick Scott 2'}],
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': [{'governor': 'John Kasich'},
{'governor': 'John Kasich 2'}],
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
How would you get the following output using json_normalize:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Dade 12345 Florida FL Rick Scott 2
2 Broward 40000 Florida FL Rick Scott
3 Broward 40000 Florida FL Rick Scott 2
4 Palm Beach 60000 Florida FL Rick Scott
5 Palm Beach 60000 Florida FL Rick Scott 2
6 Summit 1234 Ohio OH John Kasich
7 Summit 1234 Ohio OH John Kasich 2
8 Cuyahoga 1337 Ohio OH John Kasich
9 Cuyahoga 1337 Ohio OH John Kasich 2
Or if there is another way to do it, please do let me know.
json_normalize is designed for convenience rather than flexibility. It can't handle all forms of JSON out there (and JSON is just too flexible to write a universal parser for).
How about calling json_normalize twice and then merge. This assumes each state only appear once in your JSON:
counties = json_normalize(data, 'counties', ['state', 'shortname'])
governors = json_normalize(data, 'info', ['state'])
result = counties.merge(governors, on='state')

Vectorize get() method in python

I'm trying to vectorize the get() method from one column containing dictionaries to another column in the same dataframe. For example, I would like the cities in the address column dictionaries to populate the address.city column.
df = pd.DataFrame({'address': [{'city': 'Lake Ashley', 'state': 'MN', 'street': '56833 Baker Branch', 'zip': '15884'},
{'city': 'Reginaldfurt', 'state': 'MO',
'street': '045 Bennett Motorway Suite 404', 'zip': '68916'},
{'city': 'East Stephaniefurt', 'state': 'VI', 'street': '908 Matthew Ports Suite 313', 'zip': '15956-9706'}],
'address.city': [None, None, None],
'address.street': [None, None, None]})
I was trying
df['address.city'].apply(df.address.get('city'))
but that doesn't work. I figured I was close since df.address[0].get('city') does extract the city value for that row. As you can imagine I want to do the same for address.street.
I think what you want is below the following. However, You can parse the address column like this
df.address.apply(pd.Series).add_prefix('address.')
# or
# pd.DataFrame(df.address.tolist()).add_prefix('address.')
address.city address.state address.street address.zip
0 Lake Ashley MN 56833 Baker Branch 15884
1 Reginaldfurt MO 045 Bennett Motorway Suite 404 68916
2 East Stephaniefurt VI 908 Matthew Ports Suite 313 15956-9706
This answers your question:
df['address.city'] = df.address.apply(lambda d: d['city'])
df
address address.city address.street
0 {'city': 'Lake Ashley', 'state': 'MN', 'street... Lake Ashley None
1 {'city': 'Reginaldfurt', 'state': 'MO', 'stree... Reginaldfurt None
2 {'city': 'East Stephaniefurt', 'state': 'VI', ... East Stephaniefurt None

In a pandas dataframe how do i add a field that is a running total with a group by

I have the following dataframe:
import pandas
mydata = [{'city': 'London', 'age': 75, 'fdg': 1.78},
{'city': 'Paris', 'age': 22, 'fdg': 1.56},
{'city': 'Paris', 'age': 32, 'fdg': 1.56},
{'city': 'New York', 'age': 37, 'fdg': 1.56},
{'city': 'London', 'age': 24, 'fdg': 1.56},
{'city': 'London', 'age': 22, 'fdg': 1.56},
{'city': 'New York', 'age': 60, 'fdg': 1.56},
{'city': 'Paris', 'age': 22, 'fdg': 1.56},
]
df = pandas.DataFrame(mydata)
age city fdg
0 75 London 1.78
1 22 Paris 1.56
2 32 Paris 1.56
3 37 New York 1.56
4 24 London 1.56
5 22 London 1.56
6 60 New York 1.56
7 22 Paris 1.56
I'd like to add a field to the end called age_total which will be a cumulative total of the age field. The cumulative calculation would work over a group by of city - So row 1 for London would be 75, row 2 for Paris would be 22 and row 3 for Paris would be 54 - (22+32)
df['age_total']=df.groupby('city').cumsum()['age']

writing to CSV in python

My csv writer currently does not produced row by row it just jumbles it up. Any help would be great, basically i need csv with the 4 lines in yields sections below in one colulmn.
tweets_df=tweets_df.dropna()
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
print(regex_getter(i))
yields
Burlington, VT
Minneapolis, MN
Bloomington, IN
Irvine, CA
with open('Bernie.csv', 'w') as mycsvfile:
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
row = regex_getter(i)
writer.writerow([i])
def regex_getter(entry):
txt = entry
re1='((?:[a-z][a-z]+))' # Word 1
re2='(,)' # Any Single Character 1
re3='(\\s+)' # White Space 1
re4='((?:(?:AL)|(?:AK)|(?:AS)|(?:AZ)|(?:AR)|(?:CA)|(?:CO)|(?:CT)|(?:DE)|(?:DC)|(?:FM)|(?:FL)|(?:GA)|(?:GU)|(?:HI)|(?:ID)|(?:IL)|(?:IN)|(?:IA)|(?:KS)|(?:KY)|(?:LA)|(?:ME)|(?:MH)|(?:MD)|(?:MA)|(?:MI)|(?:MN)|(?:MS)|(?:MO)|(?:MT)|(?:NE)|(?:NV)|(?:NH)|(?:NJ)|(?:NM)|(?:NY)|(?:NC)|(?:ND)|(?:MP)|(?:OH)|(?:OK)|(?:OR)|(?:PW)|(?:PA)|(?:PR)|(?:RI)|(?:SC)|(?:SD)|(?:TN)|(?:TX)|(?:UT)|(?:VT)|(?:VI)|(?:VA)|(?:WA)|(?:WV)|(?:WI)|(?:WY)))(?![a-z])' # US State 1
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
c1=m.group(2)
ws1=m.group(3)
usstate1=m.group(4)
return str((word1 + c1 +ws1 + usstate1))
What my data looks without the regex method, it basically takes out all data that is not in City, State format. It excluded everything not like Raleigh, NC for example.
for i in tweets_df.ix[:,0]:
print(i)
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
I would do it this way:
states = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NA': 'National',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'
}
# sample DF
data = """\
location
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
"""
df = pd.read_csv(io.StringIO(data), sep=r'\|')
re_states = r'.*,\s*(?:' + '|'.join(states.keys()) + ')'
df.loc[df.location.str.contains(re_states), 'location'].to_csv('filtered.csv', index=False)
Explanation:
In [3]: df
Out[3]:
location
0 Indiana, USA
1 Burlington, VT
2 United States
3 Saint Paul - Minneapolis, MN
4 Inland Valley, The Pass, S. CA
5 In the Dreamatorium
6 Nova Scotia;Canada
7 North Carolina, USA
8 INTP. West Michigan
9 Los Angeles, California
10 Waterbury Connecticut
11 Right side of the tracks
generated RegEx:
In [9]: re_states
Out[9]: '.*,\\s*(?:VA|AK|ND|CA|CO|AR|MD|DC|KY|LA|OR|VT|IL|CT|OH|GA|WA|AS|NC|MN|NH|ID|HI|NA|MA|MS|WV|VI|FL|MO|MI|AL|ME|GU|NM|SD|WY|AZ|MP|DE|RI|PA|
NJ|WI|OK|TN|TX|KS|IN|NV|NY|NE|PR|UT|IA|MT|SC)'
Search mask:
In [10]: df.location.str.contains(re_states)
Out[10]:
0 False
1 True
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
Name: location, dtype: bool
Filtered DF:
In [11]: df.loc[df.location.str.contains(re_states)]
Out[11]:
location
1 Burlington, VT
3 Saint Paul - Minneapolis, MN
Now just spool it to CSV:
df.loc[df.location.str.contains(re_states), 'location'].to_csv('d:/temp/filtered.csv', index=False)
filtered.csv:
"Burlington, VT"
"Saint Paul - Minneapolis, MN"
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Categories

Resources