Python Pandas Adding conditions on cells with duplicate values - python

This is a followup to my previous question propagating values over non-unique (duplicate) cells in pandas
I have a DataFrame:
import pandas as pd
df = pd.DataFrame({'First': ['Sam', 'Greg', 'Steve', 'Sam',
'Jill', 'Bill', 'Nod', 'Mallory', 'Ping', 'Lamar'],
'Last': ['Stevens', 'Hamcunning', 'Strange', 'Stevens',
'Vargas', 'Simon', 'Purple', 'Green', 'Simon', 'Simon'],
'Address': ['112 Fake St',
'13 Crest St',
'14 Main St',
'112 Fake St',
'2 Morningwood',
'7 Cotton Dr',
'14 Main St',
'20 Main St',
'7 Cotton Dr',
'7 Cotton Dr'],
'Status': ['Infected', '', 'Infected', '', '', '', '','', '', 'Infected'],
'Level': [10, 2, 7, 5, 2, 10, 10, 20, 1, 1],
})
And lets say this time I want propagate the Status value 'infected' to everyone inside the same Address with an additional condition such as if they have the same value in Last.
So the result would look like:
df2 = df1.copy(deep=True)
df2['Status'] = ['Infected', '', 'Infected', 'Infected', '', 'Infected', '', '', 'Infected', 'Infected']
What if I wanted the individual to be marked infected if he in the same address but not the same level? The results would be:
df3 = df1.copy(deep=True)
df3['Status'] = ['Infected', '', 'Infected', '', '', 'Infected', '', '', '', 'Infected']
How would I do this? Is this a groupby problem?

"Same address" is expressed by "groupby".
import pandas as pd
df=pd.DataFrame({'First': [ 'Sam', 'Greg', 'Steve', 'Sam',
'Jill', 'Bill', 'Nod', 'Mallory', 'Ping', 'Lamar'],
'Last': [ 'Stevens', 'Hamcunning', 'Strange', 'Stevens',
'Vargas', 'Simon', 'Purple', 'Green', 'Simon', 'Simon'],
'Address': ['112 Fake St','13 Crest St','14 Main St','112 Fake St','2 Morningwood','7 Cotton Dr','14 Main St','20 Main St','7 Cotton Dr','7 Cotton Dr'],
'Status': ['Infected','','Infected','','','','','','','Infected'],
'Level': [10,2,7,5,2,10,10,20,1,1],
})
df2_index = df.groupby(['Address', 'Last']).filter(lambda x: (x['Status'] == 'Infected').any()).index
df2 = df.copy()
df2.loc[df2_index, 'Status'] = 'Infected'
df3_status = df.groupby('Address', as_index=False, group_keys=False).apply(lambda x: pd.Series(list('Infected' if (row['Status'] == 'Infected') or ((x['Status'] == 'Infected') & (x['Level'] != row['Level'])).any() else '' for _, row in x.iterrows()), index=x.index))
df3 = df.copy()
df3['Status'] = df3_status

Related

Need help translating a nested dictionary into a pandas dataframe

Looking into translating the following nested dictionary which is an API pull from Yelp into a pandas dataframe to run visualization on:
Top 50 Pizzerias in Chicago
{'businesses': [{'alias': 'pequods-pizzeria-chicago',
'categories': [{'alias': 'pizza', 'title': 'Pizza'}],
'coordinates': {'latitude': 41.92187, 'longitude': -87.664486},
'display_phone': '(773) 327-1512',
'distance': 2158.7084581522413,
'id': 'DXwSYgiXqIVNdO9dazel6w',
'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/8QJUNblfCI0EDhOjuIWJ4A/o.jpg',
'is_closed': False,
'location': {'address1': '2207 N Clybourn Ave',
'address2': '',
'address3': '',
'city': 'Chicago',
'country': 'US',
'display_address': ['2207 N Clybourn Ave',
'Chicago, IL 60614'],
'state': 'IL',
'zip_code': '60614'},
'name': "Pequod's Pizzeria",
'phone': '+17733271512',
'price': '$$',
'rating': 4.0,
'review_count': 6586,
'transactions': ['restaurant_reservation', 'delivery'],
'url': 'https://www.yelp.com/biz/pequods-pizzeria-chicago?adjust_creative=wt2WY5Ii_urZB8YeHggW2g&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=wt2WY5Ii_urZB8YeHggW2g'},
{'alias': 'lou-malnatis-pizzeria-chicago',
'categories': [{'alias': 'pizza', 'title': 'Pizza'},
{'alias': 'italian', 'title': 'Italian'},
{'alias': 'sandwiches', 'title': 'Sandwiches'}],
'coordinates': {'latitude': 41.890357,
'longitude': -87.633704},
'display_phone': '(312) 828-9800',
'distance': 4000.9990531720227,
'id': '8vFJH_paXsMocmEO_KAa3w',
'image_url': 'https://s3-media3.fl.yelpcdn.com/bphoto/9FiL-9Pbytyg6usOE02lYg/o.jpg',
'is_closed': False,
'location': {'address1': '439 N Wells St',
'address2': '',
'address3': '',
'city': 'Chicago',
'country': 'US',
'display_address': ['439 N Wells St',
'Chicago, IL 60654'],
'state': 'IL',
'zip_code': '60654'},
'name': "Lou Malnati's Pizzeria",
'phone': '+13128289800',
'price': '$$',
'rating': 4.0,
'review_count': 6368,
'transactions': ['pickup', 'delivery'],
'url': 'https://www.yelp.com/biz/lou-malnatis-pizzeria-chicago?adjust_creative=wt2WY5Ii_urZB8YeHggW2g&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=wt2WY5Ii_urZB8YeHggW2g'},
....]
I've tried the below and iterations of it but haven't had any luck.
df = pd.DataFrame.from_dict(topresponse)
Im really new to coding so any advice would be helpful
response["businesses"] is a list of records, so:
df = pd.DataFrame.from_records(response["businesses"])

Adding value to dataframe based on dict

I have problem with a list of dicts like this:
list_validation = [{'name': 'Alice', 'street': 'Baker Street', 'stamp': 'T05', 'city': 'London'}, {'name': 'Margaret', 'street': 'Castle Street', 'stamp': 'T01', 'city': 'Cambridge'}, {'name': 'Fred', 'street': 'Baker Street', 'stamp': 'T012', 'city': 'London'}]
Now in my dataframe there are columns
df = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Castle Street'],
'stamp': ['', 'T03', '', ''],
'city': ['', 'London', '', ''],
'other irrelevant columns for this task' : [1, 2, 3, 4]
})
What I want is to fill the gaps of the stamp columns and the city columns, so it looks like this:
df2 = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Downing Street'],
'stamp': ['T012', 'T03', 'T05', 'T01'],
'city': ['London', 'London', 'London', 'Cambridge'],
'other irrelevant columns for this task' : [1, 2, 3, 4]
})
I have been trying this, but it is not working and going great:
new_dict = df[['name', 'street', 'stamp', 'city']].to_dict()
list(new_dict)
for l in list_validation:
for row in new_dict:
if l['name'] == row['name'] and l['street'] == row['street']:
row['stamp'] = l['stamp']
row['city'] = l['city']
This is one approach iterate over each row in the dataframe and fill the missing values from the list.
List Definition:
list_validation = [{'name': 'Alice', 'street': 'Baker Street', 'stamp': 'T05', 'city': 'London'}, {'name': 'Margaret', 'street': 'Castle Street', 'stamp': 'T01', 'city': 'Cambridge'}, {'name': 'Fred', 'street': 'Baker Street', 'stamp': 'T012', 'city': 'London'}]
DataFrame Definition:
df = pd.DataFrame({'name': ['Fred', 'Jane', 'Alice', 'Margaret'], 'street': ['Baker Street', 'Downing Street', 'Baker Street', 'Castle Street'],
'stamp': ['', 'T03', '', ''],'city': ['', 'London', '', ''],'other irrelevant columns for this task' : [1, 2, 3, 4]})
Logic
for r,i in df.iterrows():
name_in_df = i['name']
# if pd.isna(i['stamp']):
if not i['stamp']:
for j in list_validation:
if j['name'] == name_in_df:
value_in_list = j['stamp']
df.loc[r,'stamp'] = value_in_list
break
# if pd.isna(i['city']):
if not i['city']:
name_in_df = i['name']
for j in list_validation:
if j['name'] == name_in_df:
value_in_list = j['city']
df.loc[r,'city'] = value_in_list
break
df
Here is the approach that I would use
Set the index of given dataframe to name and street
Create a new dataframe from list_validation and set its index to name and street as well.
Mask the empty values in df1 and fill the masked values using the values from df2
c = ['name', 'street']
df1 = df.set_index(c)
df2 = pd.DataFrame(list_validation).set_index(c)
df1.mask(df1.eq('')).fillna(df2).reset_index()
name street stamp city other irrelevant columns for this task
0 Fred Baker Street T012 London 1
1 Jane Downing Street T03 London 2
2 Alice Baker Street T05 London 3
3 Margaret Castle Street T01 Cambridge 4

Looping through a list of dictionaries in a dictionary

I'm extremely new to python and I have been having trouble creating a loop for my dictionary which contains a list of dictionaries. I'd appreciate the help!
mylist = {'ID_01': [{'blood type': 'A',
'Age': '15',
'eye colour': 'Green',
'Location': 'Toronto',
'Initial Score': '30',
'Final Score': '50'},
{'blood type': 'B',
'Age': '20',
'eye colour': 'Green',
'Location': 'Tokyo',
'Initial Score': '50',
'Final Score': '80'}],
'ID_02': [{'blood type': 'C',
'Age': '10',
'eye colour': 'Blue',
'Location': 'Toronto',
'Initial Score': '90',
'Final Score': '100'},
{'blood type': 'D',
'Age': '13',
'eye colour': 'Blue',
'Location': 'Tokyo',
'Initial Score': '60',
'Final Score': '90'}]}
new_dictionary = {}
if location is Toronto, add ID
and
if location is tokyo, check if initial score of Tokyo (50) is smaller than initial score of Toronto (30) AND if final score of Tokyo (80) is bigger than the initial score of Toronto(30) but smaller than the final score of Toronto, if yes, add all data associated with that ID to new_dictionary.
a loop to add the ID data to new_dictionary if :
initial score of tokyo < initial score of toronto
AND
initial score of toronto < final score of tokyo < final score of toronto
Thank You!
Here is what you can do:
mylist = {'ID_01': [{'blood type': 'A',
'Age': '15',
'eye colour': 'Green',
'Location': 'Toronto',
'Initial Score': '30',
'Final Score': '50'}],
'ID_02': [{'blood type': 'B',
'Age': '10',
'eye colour': 'Blue',
'Location': 'Tokyo',
'Initial Score': '50',
'Final Score': '80'}]}
initial_score_of_tokyo = [mylist[ID][0]["Initial Score"] for ID in mylist.keys() if mylist[ID][0]['Location'] == 'Tokyo'][0]
initial_score_of_toronto = [mylist[ID][0]["Initial Score"] for ID in mylist.keys() if mylist[ID][0]['Location'] == 'Toronto'][0]
final_score_of_tokyo = [mylist[ID][0]["Final Score"] for ID in mylist.keys() if mylist[ID][0]['Location'] == 'Tokyo'][0]
final_score_of_toronto = [mylist[ID][0]["Final Score"] for ID in mylist.keys() if mylist[ID][0]['Location'] == 'Toronto'][0]
new_dictionary = {}
for ID in mylist.keys():
if mylist[ID][0]['Location'] == 'Toronto' or (initial_score_of_tokyo < initial_score_of_toronto and initial_score_of_toronto < final_score_of_tokyo < final_score_of_toronto):
new_dictionary.update({ID:mylist[ID]})
print(new_dictionary)
Output:
{'ID_01': [{'blood type': 'A',
'Age': '15',
'eye colour': 'Green',
'Location': 'Toronto',
'Initial Score': '30',
'Final Score': '50'}]}

how to find the no. of person from a particular country from the below code?

[
{'Year': 1901,
'Category': 'Chemistry',
'Prize': 'The Nobel Prize in Chemistry 1901',
'Motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions"',
'Prize Share': '1/1',
'Laureate ID': 160,
'Laureate Type': 'Individual',
'Full Name': "Jacobus Henricus van 't Hoff",
'Birth Date': '1852-08-30',
'Birth City': 'Rotterdam',
'Birth Country': 'Netherlands',
'Sex': 'Male',
'Organization Name': 'Berlin University',
'Organization City': 'Berlin',
'Organization Country': 'Germany',
'Death Date': '1911-03-01',
'Death City': 'Berlin',
'Death Country': 'Germany'},
{'Year': 1901,
'Category': 'Literature',
'Prize': 'The Nobel Prize in Literature 1901',
'Motivation': '"in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect"',
'Prize Share': '1/1',
'Laureate ID': 569,
'Laureate Type': 'Individual',
'Full Name': 'Sully Prudhomme',
'Birth Date': '1839-03-16',
'Birth City': 'Paris',
'Birth Country': 'France',
'Sex': 'Male',
'Organization Name': '',
'Organization City': '',
'Organization Country': '',
'Death Date': '1907-09-07',
'Death City': 'Châtenay',
'Death Country': 'France'}
]
If you want to find, how many person belong to same birth country only from given list of dict, you can use the following code :
from collections import Counter
li = [each['Birth City'] for each in val if each['Birth City']]
print(dict(Counter(li)))
OUTPUT
{'Rotterdam': 1, 'Paris': 1}

How can I store data to a data dictionary in Python when headings are in mixed up order

I'd like to store the following data in a data dictionary so that I can easily export it to a CSV file. The problem is that the columns for each school id are not always in the same order:
text = """
school id= 28392
name|year|degree|age|race
Susan A. Smith|2007|PhD|27|white
Fred Collins|2006|PhD|26|hispanic
Amber Real|2007|MBA|28|white
Mike Lee|2003|PhD|27|white
school id= 273533123
name|year|age|race|degree
John B. Black|2003|27|hispanic|MBA
Steven Smith|2005|28|black|PhD
Jacob Waters|2003|25|hispanic|MBA
school id = 3452332
name|year|race|age|degree
Peter Hintze|2002|white|27|Bachelors
Ann Graden|2004|black|25|MBA
Bryan Stewart|2004|white|28|PhD
"""
I'd like to be able to eventually output all data to a CSV file with the following headings:
school id|year|name|age|race|degree
Can I do this in Python?
This actually seems pretty easy. Process the file into a data structure, then export it into a csv.
school = None
headers = None
data = {}
for line in text.splitlines():
if line.startswith("school id"):
school = line.split('=')[1].strip()
headers = None
continue
if school is not None and headers is None:
headers = line.split('|')
continue
if school is not None and headers is not None and line:
if not school in data:
data[school] = []
datum = dict(zip(headers, line.split('|')))
data[school].append(datum)
In [29]: data
Out[29]:
{'273533123': [{'age': '27',
'degree': 'MBA',
'name': 'John B. Black',
'race': 'hispanic',
'year': '2003'},
{'age': '28',
'degree': 'PhD',
'name': 'Steven Smith',
'race': 'black',
'year': '2005'},
{'age': '25',
'degree': 'MBA',
'name': 'Jacob Waters',
'race': 'hispanic',
'year': '2003'}],
'28392': [{'age': '27',
'degree': 'PhD',
'name': 'Susan A. Smith',
'race': 'white',
'year': '2007'},
{'age': '26',
'degree': 'PhD',
'name': 'Fred Collins',
'race': 'hispanic',
'year': '2006'},
{'age': '28',
'degree': 'MBA',
'name': 'Amber Real',
'race': 'white',
'year': '2007'},
{'age': '27',
'degree': 'PhD',
'name': 'Mike Lee',
'race': 'white',
'year': '2003'}],
'3452332': [{'age': '27',
'degree': 'Bachelors',
'name': 'Peter Hintze',
'race': 'white',
'year': '2002'},
{'age': '25',
'degree': 'MBA',
'name': 'Ann Graden',
'race': 'black',
'year': '2004'},
{'age': '28',
'degree': 'PhD',
'name': 'Bryan Stewart',
'race': 'white',
'year': '2004'}]}

Categories

Resources