Adding value to dataframe based on dict - python

I have problem with a list of dicts like this:
list_validation = [{'name': 'Alice', 'street': 'Baker Street', 'stamp': 'T05', 'city': 'London'}, {'name': 'Margaret', 'street': 'Castle Street', 'stamp': 'T01', 'city': 'Cambridge'}, {'name': 'Fred', 'street': 'Baker Street', 'stamp': 'T012', 'city': 'London'}]
Now in my dataframe there are columns
df = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Castle Street'],
'stamp': ['', 'T03', '', ''],
'city': ['', 'London', '', ''],
'other irrelevant columns for this task' : [1, 2, 3, 4]
})
What I want is to fill the gaps of the stamp columns and the city columns, so it looks like this:
df2 = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Downing Street'],
'stamp': ['T012', 'T03', 'T05', 'T01'],
'city': ['London', 'London', 'London', 'Cambridge'],
'other irrelevant columns for this task' : [1, 2, 3, 4]
})
I have been trying this, but it is not working and going great:
new_dict = df[['name', 'street', 'stamp', 'city']].to_dict()
list(new_dict)
for l in list_validation:
for row in new_dict:
if l['name'] == row['name'] and l['street'] == row['street']:
row['stamp'] = l['stamp']
row['city'] = l['city']

This is one approach iterate over each row in the dataframe and fill the missing values from the list.
List Definition:
list_validation = [{'name': 'Alice', 'street': 'Baker Street', 'stamp': 'T05', 'city': 'London'}, {'name': 'Margaret', 'street': 'Castle Street', 'stamp': 'T01', 'city': 'Cambridge'}, {'name': 'Fred', 'street': 'Baker Street', 'stamp': 'T012', 'city': 'London'}]
DataFrame Definition:
df = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Castle Street'],
'stamp': ['', 'T03', '', ''],'city': ['', 'London', '', ''],'other irrelevant columns for this task' : [1, 2, 3, 4]})
Logic
for r,i in df.iterrows():
name_in_df = i['name']
# if pd.isna(i['stamp']):
if not i['stamp']:
for j in list_validation:
if j['name'] == name_in_df:
value_in_list = j['stamp']
df.loc[r,'stamp'] = value_in_list
break
# if pd.isna(i['city']):
if not i['city']:
name_in_df = i['name']
for j in list_validation:
if j['name'] == name_in_df:
value_in_list = j['city']
df.loc[r,'city'] = value_in_list
break
df

Here is the approach that I would use
Set the index of given dataframe to name and street
Create a new dataframe from list_validation and set its index to name and street as well.
Mask the empty values in df1 and fill the masked values using the values from df2
c = ['name', 'street']
df1 = df.set_index(c)
df2 = pd.DataFrame(list_validation).set_index(c)
df1.mask(df1.eq('')).fillna(df2).reset_index()
name street stamp city other irrelevant columns for this task
0 Fred Baker Street T012 London 1
1 Jane Downing Street T03 London 2
2 Alice Baker Street T05 London 3
3 Margaret Castle Street T01 Cambridge 4

Related

Need help translating a nested dictionary into a pandas dataframe

Looking into translating the following nested dictionary which is an API pull from Yelp into a pandas dataframe to run visualization on:
Top 50 Pizzerias in Chicago
{'businesses': [{'alias': 'pequods-pizzeria-chicago',
'categories': [{'alias': 'pizza', 'title': 'Pizza'}],
'coordinates': {'latitude': 41.92187, 'longitude': -87.664486},
'display_phone': '(773) 327-1512',
'distance': 2158.7084581522413,
'id': 'DXwSYgiXqIVNdO9dazel6w',
'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/8QJUNblfCI0EDhOjuIWJ4A/o.jpg',
'is_closed': False,
'location': {'address1': '2207 N Clybourn Ave',
'address2': '',
'address3': '',
'city': 'Chicago',
'country': 'US',
'display_address': ['2207 N Clybourn Ave',
'Chicago, IL 60614'],
'state': 'IL',
'zip_code': '60614'},
'name': "Pequod's Pizzeria",
'phone': '+17733271512',
'price': '$$',
'rating': 4.0,
'review_count': 6586,
'transactions': ['restaurant_reservation', 'delivery'],
'url': 'https://www.yelp.com/biz/pequods-pizzeria-chicago?adjust_creative=wt2WY5Ii_urZB8YeHggW2g&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=wt2WY5Ii_urZB8YeHggW2g'},
{'alias': 'lou-malnatis-pizzeria-chicago',
'categories': [{'alias': 'pizza', 'title': 'Pizza'},
{'alias': 'italian', 'title': 'Italian'},
{'alias': 'sandwiches', 'title': 'Sandwiches'}],
'coordinates': {'latitude': 41.890357,
'longitude': -87.633704},
'display_phone': '(312) 828-9800',
'distance': 4000.9990531720227,
'id': '8vFJH_paXsMocmEO_KAa3w',
'image_url': 'https://s3-media3.fl.yelpcdn.com/bphoto/9FiL-9Pbytyg6usOE02lYg/o.jpg',
'is_closed': False,
'location': {'address1': '439 N Wells St',
'address2': '',
'address3': '',
'city': 'Chicago',
'country': 'US',
'display_address': ['439 N Wells St',
'Chicago, IL 60654'],
'state': 'IL',
'zip_code': '60654'},
'name': "Lou Malnati's Pizzeria",
'phone': '+13128289800',
'price': '$$',
'rating': 4.0,
'review_count': 6368,
'transactions': ['pickup', 'delivery'],
'url': 'https://www.yelp.com/biz/lou-malnatis-pizzeria-chicago?adjust_creative=wt2WY5Ii_urZB8YeHggW2g&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=wt2WY5Ii_urZB8YeHggW2g'},
....]
I've tried the below and iterations of it but haven't had any luck.
df = pd.DataFrame.from_dict(topresponse)
Im really new to coding so any advice would be helpful
response["businesses"] is a list of records, so:
df = pd.DataFrame.from_records(response["businesses"])

sum values of specific keys in a dict

I have a a list of dicts that looks like this:
source_dict = [{'ppl': 10, 'items': 15, 'airport': 'lax', 'city': 'Los Angeles', 'timestamp': 1, 'region': 'North America', 'country': 'United States'},
{'ppl': 20, 'items': 32, 'airport': 'JFK', 'city': 'New York', 'timestamp': 2, 'region': 'North America', 'country': 'United States'},
{'ppl': 50, 'items': 20, 'airport': 'ABC', 'city': 'London', 'timestamp': 1, 'region': 'Europe', 'country': 'United Kingdom'}... ]
#Gets the list of countries in the dict
countries = list(set(stats['country'] for stats in source_dict))
I know I can use an a collection:
counter = collections.Counter()
for d in source_dict:
counter.update(d)
But, want to group by country and get totals for only certain keys not all of them.
So the result should be
{'Country': 'United States', 'p95': 30, 'items':37},
{'Country': 'England', 'ppl': 50, 'items':20},...
Im not sure how to incorporate multiple keys into a counter. To produce that result
This is one approach using collections.defaultdict & collections.Counter.
Ex:
from collections import defaultdict, Counter
source_dict = [{'ppl': 10, 'items': 15, 'airport': 'lax', 'city': 'Los Angeles', 'timestamp': 1, 'region': 'North America', 'country': 'United States'},
{'ppl': 20, 'items': 32, 'airport': 'JFK', 'city': 'New York', 'timestamp': 2, 'region': 'North America', 'country': 'United States'},
{'ppl': 50, 'items': 20, 'airport': 'ABC', 'city': 'London', 'timestamp': 1, 'region': 'Europe', 'country': 'United Kingdom'} ]
result = defaultdict(Counter)
for stats in source_dict:
result[stats['country']].update(Counter({'ppl': stats['ppl'], "items": stats['items']}))
#result = [{'Country': k, **v} for k, v in result.items()] #Required output
print(result)
Output:
defaultdict(<class 'collections.Counter'>,
{'United Kingdom': Counter({'ppl': 50, 'items': 20}),
'United States': Counter({'items': 47, 'ppl': 30})})
In pandas you can do this:
import io
import pandas as pd
dff=io.StringIO("""ppl,items,airport,city,timestamp,region,country
10,15,lax,Los Angeles,1,North America,United States
20,32,JFK,New York,2,North America,United States
50,20,ABC,London,1,Europe,United Kingdom""")
df3=pd.read_csv(dff)
df3
ppl items airport city timestamp region country
0 10 15 lax Los Angeles 1 North America United States
1 20 32 JFK New York 2 North America United States
2 50 20 ABC London 1 Europe United Kingdom
df3.groupby('region').agg({'ppl':'sum', 'items':'sum'})
# ppl items
#region
#Europe 50 20
#North America 30 47

how to find the no. of person from a particular country from the below code?

[
{'Year': 1901,
'Category': 'Chemistry',
'Prize': 'The Nobel Prize in Chemistry 1901',
'Motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions"',
'Prize Share': '1/1',
'Laureate ID': 160,
'Laureate Type': 'Individual',
'Full Name': "Jacobus Henricus van 't Hoff",
'Birth Date': '1852-08-30',
'Birth City': 'Rotterdam',
'Birth Country': 'Netherlands',
'Sex': 'Male',
'Organization Name': 'Berlin University',
'Organization City': 'Berlin',
'Organization Country': 'Germany',
'Death Date': '1911-03-01',
'Death City': 'Berlin',
'Death Country': 'Germany'},
{'Year': 1901,
'Category': 'Literature',
'Prize': 'The Nobel Prize in Literature 1901',
'Motivation': '"in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect"',
'Prize Share': '1/1',
'Laureate ID': 569,
'Laureate Type': 'Individual',
'Full Name': 'Sully Prudhomme',
'Birth Date': '1839-03-16',
'Birth City': 'Paris',
'Birth Country': 'France',
'Sex': 'Male',
'Organization Name': '',
'Organization City': '',
'Organization Country': '',
'Death Date': '1907-09-07',
'Death City': 'Châtenay',
'Death Country': 'France'}
]
If you want to find, how many person belong to same birth country only from given list of dict, you can use the following code :
from collections import Counter
li = [each['Birth City'] for each in val if each['Birth City']]
print(dict(Counter(li)))
OUTPUT
{'Rotterdam': 1, 'Paris': 1}

How do you handle outputs from google maps distance matrix api when using with python?

I need assistance handling the dict file type that is returned from the google maps api.
Currently, the results are handing me a dict of the resulting data (starting addresses, ending addresses, travel times, distances etc) which I cannot process. I can extract the start and end addresses simply, but the bulk data is proving difficult to extract, and I think it is because of its structure.
The sample of the code I have is as follows;
import googlemaps
import csv
import pandas as pd
postcodes = pd.read_csv("SW.csv", sep=',', usecols=['postcode'], squeeze=True)
infile1 = open('SW.csv', 'r')
reader1 = csv.reader(infile1)
Location1 = postcodes[0:10]
Location2 = 'SW1A 2HQ'
my_distance = gmaps.distance_matrix(Location1, Location2, mode='bicycling', language=None, avoid=None, units='metric',
departure_time='2475925955', arrival_time=None,
transit_routing_preference=None)
print(my_distance)
Which generates the following output;
{'origin_addresses': ['Cossar Mews, Brixton, London SW2 2TR, UK',
'Bushnell Rd, London SW17 8QP, UK', 'Maltings Pl, Fulham, London SW6
2BX, UK', 'Knightsbridge, London SW7 1BJ, UK', 'Chelsea, London SW3
3EE, UK', 'Hester Rd, London SW11 4AJ, UK', 'Brixton, London SW2 1HZ,
UK', 'Randall Cl, London SW11 3TG, UK', 'Sloane St, London SW1X 9SF,
UK', 'Binfield Rd, London SW4 6TA, UK'], 'rows': [{'elements':
[{'duration': {'text': '28 mins', 'value': 1657}, 'status': 'OK',
'distance': {'text': '7.5 km', 'value': 7507}}]}, {'elements':
[{'duration': {'text': '31 mins', 'value': 1850}, 'status': 'OK',
'distance': {'text': '9.2 km', 'value': 9176}}]}, {'elements':
[{'duration': {'text': '27 mins', 'value': 1620}, 'status': 'OK',
'distance': {'text': '7.0 km', 'value': 7038}}]}, {'elements':
[{'duration': {'text': '16 mins', 'value': 953}, 'status': 'OK',
'distance': {'text': '4.0 km', 'value': 4038}}]}, {'elements':
[{'duration': {'text': '15 mins', 'value': 899}, 'status': 'OK',
'distance': {'text': '3.4 km', 'value': 3366}}]}, {'elements':
[{'duration': {'text': '21 mins', 'value': 1260}, 'status': 'OK',
'distance': {'text': '5.3 km', 'value': 5265}}]}, {'elements':
[{'duration': {'text': '28 mins', 'value': 1682}, 'status': 'OK',
'distance': {'text': '7.5 km', 'value': 7502}}]}, {'elements':
[{'duration': {'text': '23 mins', 'value': 1368}, 'status': 'OK',
'distance': {'text': '5.9 km', 'value': 5876}}]}, {'elements':
[{'duration': {'text': '14 mins', 'value': 839}, 'status': 'OK',
'distance': {'text': '3.3 km', 'value': 3341}}]}, {'elements':
[{'duration': {'text': '16 mins', 'value': 982}, 'status': 'OK',
'distance': {'text': '4.3 km', 'value': 4294}}]}],
'destination_addresses': ['Horse Guards Rd, London SW1A 2HQ, UK'],
'status': 'OK'}
I am then using the following code to extract it;
origin = my_distance['origin_addresses']
dest = my_distance['destination_addresses']
dist = my_distance['rows']
I have tried the df_from_list and many others to try and process the dist data. The end goal is to have a matrix with the origin addresses on each row, the destination addresses forming columns, with distance and time as data variables within these columns.
Something similar to this
| DEST 1 | DEST 2 |
| TIME | DIST | TIME | DIST |
START 1 | X | Y | Z | T |
START 2 | A | B | C | T |
Please can someone help me process the my_distance output (shown above) into an architecture similar to that shown above.
Thanks!
This basicly creates a dictionary with the starts and the destination adresses.
The destination adresses have a list of tupels as values. The first element in the tuple is the duration and the second the distance
e.g. (45, 7.0)#45=45min and 7.0 = 7km. Then I create the dataframe with pandas.DataFrame.from_dict()
import pandas as pd
dct = {d_adresses:[] for d_adresses in data['destination_addresses']}
dct['starts'] = []
for i in range(len(data['origin_addresses'])):
duration=int(data['rows'][i]['elements'][0]['duration']['text'].split(' ')[0])
distance=float(data['rows'][i]['elements'][0]['distance']['text'].split(' ')[0])
for key in dct:
if key != 'starts':
dct[key].append((duration, distance))
dct['starts'].append(data['origin_addresses'][i])
df = pd.DataFrame.from_dict(dct)
df.set_index('starts', inplace=True)
I create an empty dataframe before running gmaps.distance_matrix, and place the dictionary keys into the dataframe. Similar to the above solution:
traffic = pd.DataFrame({'time': [], 'origins': [], 'destinations': [], 'destination_addresses': [], 'origin_addresses': [], 'rows': [], 'status': []})
for origin in origins:
for destination in destinations:
traffic = traffic.append({'time': [00:00], 'origins': [origin], 'destinations': [destination]}, ignore_index=True)
if origin != destination:
if cityname == cityname:
# Get travel distance and time for a matrix of origins and destinations
traffic_result = gmaps.distance_matrix((origin), (destination),
mode="driving", language=None, avoid=None, units="metric",
departure_time=00:00, arrival_time=None, transit_mode=None,
transit_routing_preference=None, traffic_model=None, region=None)
for key in traffic_result.keys():
for value in traffic_result[key]:
print(key, value)
traffic = traffic.append({key: [value]}, ignore_index=True)

Python Pandas Adding conditions on cells with duplicate values

This is a followup to my previous question propagating values over non-unique (duplicate) cells in pandas
I have a DataFrame:
import pandas as pd
df = pd.DataFrame({'First': ['Sam', 'Greg', 'Steve', 'Sam',
'Jill', 'Bill', 'Nod', 'Mallory', 'Ping', 'Lamar'],
'Last': ['Stevens', 'Hamcunning', 'Strange', 'Stevens',
'Vargas', 'Simon', 'Purple', 'Green', 'Simon', 'Simon'],
'Address': ['112 Fake St',
'13 Crest St',
'14 Main St',
'112 Fake St',
'2 Morningwood',
'7 Cotton Dr',
'14 Main St',
'20 Main St',
'7 Cotton Dr',
'7 Cotton Dr'],
'Status': ['Infected', '', 'Infected', '', '', '', '','', '', 'Infected'],
'Level': [10, 2, 7, 5, 2, 10, 10, 20, 1, 1],
})
And lets say this time I want propagate the Status value 'infected' to everyone inside the same Address with an additional condition such as if they have the same value in Last.
So the result would look like:
df2 = df1.copy(deep=True)
df2['Status'] = ['Infected', '', 'Infected', 'Infected', '', 'Infected', '', '', 'Infected', 'Infected']
What if I wanted the individual to be marked infected if he in the same address but not the same level? The results would be:
df3 = df1.copy(deep=True)
df3['Status'] = ['Infected', '', 'Infected', '', '', 'Infected', '', '', '', 'Infected']
How would I do this? Is this a groupby problem?
"Same address" is expressed by "groupby".
import pandas as pd
df=pd.DataFrame({'First': [ 'Sam', 'Greg', 'Steve', 'Sam',
'Jill', 'Bill', 'Nod', 'Mallory', 'Ping', 'Lamar'],
'Last': [ 'Stevens', 'Hamcunning', 'Strange', 'Stevens',
'Vargas', 'Simon', 'Purple', 'Green', 'Simon', 'Simon'],
'Address': ['112 Fake St','13 Crest St','14 Main St','112 Fake St','2 Morningwood','7 Cotton Dr','14 Main St','20 Main St','7 Cotton Dr','7 Cotton Dr'],
'Status': ['Infected','','Infected','','','','','','','Infected'],
'Level': [10,2,7,5,2,10,10,20,1,1],
})
df2_index = df.groupby(['Address', 'Last']).filter(lambda x: (x['Status'] == 'Infected').any()).index
df2 = df.copy()
df2.loc[df2_index, 'Status'] = 'Infected'
df3_status = df.groupby('Address', as_index=False, group_keys=False).apply(lambda x: pd.Series(list('Infected' if (row['Status'] == 'Infected') or ((x['Status'] == 'Infected') & (x['Level'] != row['Level'])).any() else '' for _, row in x.iterrows()), index=x.index))
df3 = df.copy()
df3['Status'] = df3_status

Categories

Resources