Renaming new split columns with prefix - python

I have a dataframe, which includes two columns which are dicts.
type possession_team
0 {'id': 35, 'name': 'Starting XI'} {'id':9101,'name':'San Diego Wave'}
1 {'id': 35, 'name': 'Starting XI'} {'id':9101,'name':'San Diego Wave'}
2 {'id': 18, 'name': 'Half Start'} {'id':9101,'name':'San Diego Wave'}
3 {'id': 18, 'name': 'Half Start'} {'id':9101,'name':'San Diego Wave'}
4 {'id': 30, 'name': 'Pass'} {'id':9101,'name':'San Diego Wave'}
I use
pd.concat([df, df['type'].apply(pd.Series)], axis = 1).drop('type', axis = 1)
to split the columns manually at the minute. How would I use this code, but also add a prefix to the resulting columns that it creates? The prefix being that of the resulting columns that it creates, so I would have;
type_id type_name
0 35 'Starting XI'
1 35 'Starting XI'
2 18 'Half Start'
3 18 'Half Start'
4 30 'Pass'

IIUC, and assuming dictionaries, you could do:
df['type_id'] = df['type'].str['id']
df['type_name'] = df['type'].str['name']
For a more generic approach:
for c in df['type'].explode().unique():
df[f'type_{c}'] = df['type'].str[c]
And even more generic (apply to all columns):
for col in ['type', 'possession_team']: # or df.columns
for c in df[col].explode().unique():
df[f'{col}_{c}'] = df[col].str[c]
output:
type possession_team \
0 {'id': 35, 'name': 'Starting XI'} {'id': 9101, 'name': 'San Diego Wave'}
1 {'id': 35, 'name': 'Starting XI'} {'id': 9101, 'name': 'San Diego Wave'}
2 {'id': 18, 'name': 'Half Start'} {'id': 9101, 'name': 'San Diego Wave'}
3 {'id': 18, 'name': 'Half Start'} {'id': 9101, 'name': 'San Diego Wave'}
4 {'id': 30, 'name': 'Pass'} {'id': 9101, 'name': 'San Diego Wave'}
type_id type_name possession_team_id possession_team_name
0 35 Starting XI 9101 San Diego Wave
1 35 Starting XI 9101 San Diego Wave
2 18 Half Start 9101 San Diego Wave
3 18 Half Start 9101 San Diego Wave
4 30 Pass 9101 San Diego Wave

Related

How to access data and handle missing data in a dictionaries within a dataframe

Given, df:
import pandas as pd
import numpy as np
data =\
{'Col1': [1, 2, 3],
'Person': [{'ID': 10001,
'Data': {'Address': {'Street': '1234 Street A',
'City': 'Houston',
'State': 'Texas',
'Zip': '77002'}},
'Age': 30,
'Income': 50000},
{'ID': 10002,
'Data': {'Address': {'Street': '7892 Street A',
'City': 'Greenville',
'State': 'Maine',
'Zip': np.nan}},
'Age': np.nan,
'Income': 63000},
{'ID': 10003, 'Data': {'Address': np.nan}, 'Age': 56, 'Income': 85000}]}
df = pd.DataFrame(data)
Input Dataframe:
Col1 Person
0 1 {'ID': 10001, 'Data': {'Address': {'Street': '1234 Street A', 'City': 'Houston', 'State': 'Texas', 'Zip': '77002'}}, 'Age': 30, 'Income': 50000}
1 2 {'ID': 10002, 'Data': {'Address': {'Street': '7892 Street A', 'City': 'Greenville', 'State': 'Maine', 'Zip': nan}}, 'Age': nan, 'Income': 63000}
2 3 {'ID': 10003, 'Data': {'Address': nan}, 'Age': 56, 'Income': 85000}
My expected output dataframe is df[['Col1', 'Income', 'Age', 'Street', 'Zip']] where Income, Age, Street, and Zip come from within Person:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A nan
2 3 85000 56.0 NaN nan
Using list comprehension, we can create most of these columns.
df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']
Output:
0 30.0
1 NaN
2 56.0
Name: Age, dtype: float64
However, dealing with np.nan values inside a nested dictionary is a real pain. Let's look at getting data from a nested dictionary data where one of the values is nan.
df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
We get an AttributeError:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
AttributeError: 'float' object has no attribute 'get'
Let's use the .str accessor with dictionary keys to fetch this data.
There is little documentation in pandas that shows how you can use .str.get or .str[] to fetch values from dictionary objects in a dataframe column/pandas series.
df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']
Output:
0 1234 Street A
1 7892 Street A
2 NaN
Name: Street, dtype: object
And, likewise with
df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']
Leaving us with the columns to build the desired dataframe
df[['Col1', 'Income', 'Age', 'Street', 'Zip']]
from dictionaries.
Output:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Col1": [1, 2, 3],
"Person": [
{
"ID": 10001,
"Data": {
"Address": {
"Street": "1234 Street A",
"City": "Houston",
"State": "Texas",
"Zip": "77002",
}
},
"Age": 30,
"Income": 50000,
},
{
"ID": 10002,
"Data": {
"Address": {
"Street": "7892 Street A",
"Zip": np.nan,
"City": "Greenville",
"State": "Maine",
}
},
"Age": np.nan,
"Income": 63000,
},
{
"ID": 10003,
"Data": {"Address": np.nan},
"Age": 56, "Income": 85000
},
],
})
row_dic_list = df.to_dict(orient='records') # convert to dict
# remain = ['Col1', 'Income', 'Age', 'Street', 'Zip']
new_row_dict_list = []
# Iterate over each row to generate new data
for row_dic in row_dic_list:
col1 = row_dic['Col1']
person_dict = row_dic['Person']
age = person_dict['Age']
income = person_dict['Income']
address = person_dict["Data"]["Address"]
street = np.nan
zip_v = np.nan
if isinstance(address, dict):
street = address["Street"]
zip_v = address["Zip"]
new_row_dict = {
'Col1': col1,
'Income': income,
'Age': age,
'Street': street,
'Zip': zip_v,
}
new_row_dict_list.append(new_row_dict)
# Generate a dataframe from each new row of data
new_df = pd.DataFrame(new_row_dict_list)
print(new_df)
"""
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN
"""

How to only keep rows in a DataFrame based on column values found in seperate dataframe

Say I have a main_df that I'm working on:
ID Name
1 Bob
2 Bill
3 Bacon
4 Bernie
and I have a separate_df with certain IDs that holds other data:
ID Location
1 California
3 New York
How can I filter out the data in main_df so that only the rows with IDs found in separate_df remain?
ID Name
1 Bob
3 Bacon
Use Series.isin
main_df = pd.DataFrame([{'ID': 1, 'Name': 'Bob'}, {'ID': 2, 'Name': 'Bill'},
{'ID': 3, 'Name': 'Bacon'}, {'ID': 4, 'Name': 'Bernie'}])
loc_df = pd.DataFrame([{'ID': 1, 'Location': 'California'},
{'ID': 3, 'Location': 'New York'}, ])
main_df = main_df[main_df['ID'].isin(loc_df['ID'])]

How to convert multiple columns of list of dictionary, set, tuple to columns

I have some structure and data
CusID Name Shop Item Card Type Price
1 Paul Pascal [{"Food":"001","Water":"Melon","Dessert":"Mango"}] [{"Main":"Yes", "Second":""}] {"VIP":"YES"} 24000
2 Mark Casio [{"Food":"001","Water":"Apple","Dessert":"Mango"}] [{"Main":"", "Second":"Yes"}] {"VIP":"YES"} 30800
3 Bill Nike [{"Food":"004","Water":"","Dessert":""}] [] {} 900
I want to split Item, Card, and Type columns. This is the expected output
Name Shop Food Water Dessert Card_Main Card_Second VIP Price
Paul Pascal 1 Melon Mango Yes YES 24000
Mark Casio 1 Apple Mango Yes YES 30800
Bill Nike 4 900
Code for the dataframe:
d = [{'CusID': 1, 'Name': 'Paul', 'Shop': 'Pascal',
'Item': [{"Food":"001","Water":"Melon","Dessert":"Mango"}],
'Card': [{"Main":"Yes", "Second":""}], 'Type': {"VIP":"YES"}, 'Price': 24000},
{'CusID': 2, 'Name': 'Mark', 'Shop': 'Casio', 'Item': [{"Food":"001","Water":"Apple","Dessert":"Mango"}],
'Card': [{"Main":"", "Second":"Yes"}], 'Type': {"VIP":"YES"}, 'Price': 30800},
{'CusID': 3, 'Name': 'Bill', 'Shop': 'Nike', 'Item': [{"Food":"004","Water":"","Dessert":""}],
'Card': [], 'Type': {}, 'Price': 900}]
df = pd.DataFrame(d)
Here is my code for dataframe.
There are ' ' in dictionary of list but dataframe look like no difference!
d = [{'CusID': 1, 'Name': 'Paul', 'Shop': 'Pascal',
'Item': '[{"Food":"001","Water":"Melon","Dessert":"Mango"}]',
'Card': '[{"Main":"Yes", "Second":""}]', 'Type': '{"VIP":"YES"}', 'Price': 24000},
{'CusID': 2, 'Name': 'Mark', 'Shop': 'Casio', 'Item': '[{"Food":"001","Water":"Apple","Dessert":"Mango"}]',
'Card': '[{"Main":"", "Second":"Yes"}]', 'Type': '{"VIP":"YES"}', 'Price': 30800},
{'CusID': 3, 'Name': 'Bill', 'Shop': 'Nike', 'Item': '[{"Food":"004","Water":"","Dessert":""}]',
'Card': [], 'Type': {}, 'Price': 900}]
df = pd.DataFrame(d)
Not that dynamic but this can be solved using:
a = pd.DataFrame(df.pop('Item').str[0].dropna().tolist())
b = pd.DataFrame(df.pop('Card').str[0].dropna().tolist()).add_prefix('Card_')
c = pd.DataFrame(df.pop('Type').tolist())
out = df.join(i for i in [a,b,c]).fillna('')
print(out)
CusID Name Shop Price Food Water Dessert Card_Main Card_Second VIP
0 1 Paul Pascal 24000 001 Melon Mango Yes YES
1 2 Mark Casio 30800 001 Apple Mango Yes YES
2 3 Bill Nike 900 004

How to fix "string indices must be integers"

I am working with JSON files from Foursquare. And I keep getting this error "string indices must be integers"
This is my dataset, county_merge
county density lat lng
0 Alameda 2532.292000 37.609029 -121.899142
1 Alpine 30.366667 38.589393 -119.834501
2 Amador 218.413333 38.449089 -120.591102
3 Butte 329.012500 39.651927 -121.585844
4 Calaveras 214.626316 38.255818 -120.498149
5 Colusa 393.388889 39.146558 -122.220956
6 Contra Costa 1526.334000 37.903481 -121.917535
7 Del Norte 328.485714 41.726177 -123.913280
8 El Dorado 444.043750 38.757414 -120.527613
9 Fresno 654.509259 36.729529 -119.708861
10 Glenn 477.985714 39.591277 -122.377866
11 Humboldt 392.427083 40.599742 -123.899773
12 Imperial 796.919048 33.030549 -115.359567
13 Inyo 127.561905 36.559533 -117.407471
14 Kern 608.326471 35.314570 -118.753822
15 Kings 883.560000 36.078481 -119.795634
16 Lake 608.338462 39.050541 -122.777656
17 Lassen 179.664706 40.768558 -120.730998
18 Los Angeles 2881.756000 34.053683 -118.242767
19 Madera 486.887500 37.171626 -119.773799
20 Marin 1366.937143 38.040914 -122.619964
21 Mariposa 48.263636 37.570148 -119.903659
22 Mendocino 198.010345 39.317649 -123.412640
23 Merced 1003.309091 37.302957 -120.484327
24 Modoc 100.856250 41.545049 -120.743600
25 Mono 133.145455 37.953393 -118.939876
26 Monterey 946.090323 36.600256 -121.894639
27 Napa 592.020000 38.297137 -122.285529
28 Nevada 338.892857 39.354033 -120.808984
29 Orange 1992.962500 33.750038 -117.870493
30 Placer 492.564000 39.101206 -120.765061
31 Plumas 87.817778 39.943099 -120.805952
32 Riverside 976.692105 33.953355 -117.396162
33 Sacramento 1369.729032 38.581572 -121.494400
34 San Benito 577.637500 36.624809 -121.117738
35 San Bernardino 612.176636 34.108345 -117.289765
36 San Diego 1281.848649 32.717421 -117.162771
37 San Francisco 7279.000000 37.779281 -122.419236
38 San Joaquin 1282.122222 37.937290 -121.277372
39 San Luis Obispo 627.285185 35.282753 -120.659616
40 San Mateo 1594.372973 37.496904 -122.333057
41 Santa Barbara 1133.525806 34.422132 -119.702667
42 Santa Clara 2090.724000 37.354113 -121.955174
43 Santa Cruz 1118.844444 36.974942 -122.028526
44 Shasta 180.137931 40.796512 -121.997919
45 Sierra 115.681818 39.584907 -120.530573
46 Siskiyou 202.170000 41.500722 -122.544354
47 Solano 871.818182 38.221894 -121.916355
48 Sonoma 926.674286 38.511080 -122.847339
49 Stanislaus 1181.864000 37.550087 -121.050143
50 Sutter 552.355556 38.950967 -121.697088
51 Tehama 206.862500 40.125133 -122.201553
52 Trinity 63.056250 40.605326 -123.171268
53 Tulare 681.425806 36.251647 -118.852583
54 Tuolumne 349.471429 38.056944 -119.991935
55 Ventura 1465.400000 34.343649 -119.295171
56 Yolo 958.890909 38.718454 -121.905900
{'meta': {'code': 200, 'requestId': '5cab80f04c1f6715df4e698d'},
'response': {'venues': [{'id': '4b9bf2abf964a520573936e3',
'name': 'Bishop Ranch Veterinary Center & Urgent Care',
'location': {'address': '2000 Bishop Dr',
'lat': 37.77129467449237,
'lng': -121.97112176203284,
'labeledLatLngs': [{'label': 'display',
'lat': 37.77129467449237,
'lng': -121.97112176203284}],
'distance': 19143,
'postalCode': '94583',
'cc': 'US',
'city': 'San Ramon',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['2000 Bishop Dr',
'San Ramon, CA 94583',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'venuePage': {'id': '463205329'},
'referralId': 'v-1554743537',
'hasPerk': False},
{'id': '4b9acbfef964a5209dd635e3',
'name': 'San Francisco SPCA Veterinary Hospital',
'location': {'address': '201 Alabama St',
'crossStreet': 'at 16th St.',
'lat': 37.766633450405465,
'lng': -122.41214303998395,
'labeledLatLngs': [{'label': 'display',
'lat': 37.766633450405465,
'lng': -122.41214303998395}],
'distance': 48477,
'postalCode': '94103',
'cc': 'US',
'city': 'San Francisco',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['201 Alabama St (at 16th St.)',
'San Francisco, CA 94103',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'referralId': 'v-1554743537',
'hasPerk': False},
{'id': '4b00d8ecf964a5204d4122e3',
'name': 'Pleasanton Veterinary Hospital',
'location': {'address': '3059B Hopyard Rd Ste B',
'lat': 37.67658,
'lng': -121.89778,
'labeledLatLngs': [{'label': 'display',
'lat': 37.67658,
'lng': -121.89778}],
'distance': 7520,
'postalCode': '94588',
'cc': 'US',
'city': 'Pleasanton',
'state': 'CA',
'country': 'United States',
'formattedAddress': ['3059B Hopyard Rd Ste B',
'Pleasanton, CA 94588',
'United States']},
'categories': [{'id': '4d954af4a243a5684765b473',
'name': 'Veterinarian',
'pluralName': 'Veterinarians',
'shortName': 'Veterinarians',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_veterinarian_',
'suffix': '.png'},
'primary': True}],
'referralId': 'v-1554743537',
'hasPerk': False},
This is JSON file I am working on. And, I am tyring to exctract name, latitude, longitude, and city name.
results = requests.get(url).json()
results
names=county_merge['county']
la=county_merge['lat']
ln=county_merge['lng']
venues_list = []
venues_list.append([(
names,
la,
ln,
v['response']['venues'][0]['name'],
v['response']['venues'][0]['location']['lat'],
v['response']['venues'][0]['location']['lng'],
v['response']['venues'][0]['location']['city']) for v in results])
I am expecting this will give me the several line of list.
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
[County name, 38.xxxx, -120.xxxx, XXX veterinary clinic, 38.xxxx, -120.xxxx, san diego]
.
.
.
.
But, It only gives me an error and frustration.
TypeError Traceback (most recent call last)
<ipython-input-44-321b1c667727> in <module>
11 v['response']['venues'][0]['location']['lat'],
12 v['response']['venues'][0]['location']['lng'],
---> 13 v['response']['venues'][0]['location']['city']) for v in results])
<ipython-input-44-321b1c667727> in <listcomp>(.0)
11 v['response']['venues'][0]['location']['lat'],
12 v['response']['venues'][0]['location']['lng'],
---> 13 v['response']['venues'][0]['location']['city']) for v in results])
TypeError: string indices must be integers
Do you have any idea to fix this code?
[v for v in results]
gives you
['meta', 'response']
So you got the keys of results, which are strings. I think you want
venues_list.append([(
names,
la,
ln,
v['name'],
v['location']['lat'],
v['location']['lng'],
v['location']['city']) for v in results['response']['venues'])

In a pandas dataframe how do i add a field that is a running total with a group by

I have the following dataframe:
import pandas
mydata = [{'city': 'London', 'age': 75, 'fdg': 1.78},
{'city': 'Paris', 'age': 22, 'fdg': 1.56},
{'city': 'Paris', 'age': 32, 'fdg': 1.56},
{'city': 'New York', 'age': 37, 'fdg': 1.56},
{'city': 'London', 'age': 24, 'fdg': 1.56},
{'city': 'London', 'age': 22, 'fdg': 1.56},
{'city': 'New York', 'age': 60, 'fdg': 1.56},
{'city': 'Paris', 'age': 22, 'fdg': 1.56},
]
df = pandas.DataFrame(mydata)
age city fdg
0 75 London 1.78
1 22 Paris 1.56
2 32 Paris 1.56
3 37 New York 1.56
4 24 London 1.56
5 22 London 1.56
6 60 New York 1.56
7 22 Paris 1.56
I'd like to add a field to the end called age_total which will be a cumulative total of the age field. The cumulative calculation would work over a group by of city - So row 1 for London would be 75, row 2 for Paris would be 22 and row 3 for Paris would be 54 - (22+32)
df['age_total']=df.groupby('city').cumsum()['age']

Categories

Resources