I have the following pandas dataframe that has thousands of rows:
import pandas
...
print(df)
FAVORITE_FOOD FAVORITE_DRINK ... USER_A USER_B
0 hamburgers cola ... John John
1 pasta lemonade ... John John
2 omelette coffee ... John John
3 hotdogs beer ... Marie Marie
4 pizza wine ... Marie Marie
7 popcorn oj ... Adam Adam
8 sushi sprite ... Adam Adam
...
...
I want to create a nested dictionary where people's names are the keys and the dictionary of their food/drink combination is the value.
Something like this:
dict = {John : {hamburgers : cola, pasta : lemonade, omelette : coffee},
Marie : {hotdogs : beer, pizza : wine},
Adam : {popcorn : oj, sushi : sprite}
}
I solved this problem with the following code:
import pandas as pd
# this line groups user ID with their favorite food and drink
group_dict = {k: f.groupby('FAVORITE_FOOD')['FAVORITE_DRINK'].apply(list).to_dict() for k, f in df.groupby('USER_A')}
# then we use dictionary comprehension to create the desired nested dictionary
nested_dict = {outer_k: {inner_k : {inner_v for inner_v in v if inner_k != inner_v} for inner_k, v in outer_v.items()} for outer_k, outer_v in group_dict.items()}
You can create desired dictionary by
dict1 = {}
for i in range(len(df)):
row = df.iloc[i, :]
dict1.setdefault(row["USER_A"],{}).update({row["FAVORITE_FOOD"] : row["FAVORITE_DRINK"]})
I used setdefault method to initially create empty dictionary and then append other dictionary as a value.
Related
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
I have a list of objects by each name and a dataframe like this.
Jimmy = ['chair','table','pencil']
Charles = ['smartphone','cake']
John = ['clock','paper']
id
name
1
Jimmy
2
Charles
3
John
I would like to use a loop that allows me to obtain the following result.
id
name
picks
1
Jimmy
chair
1
Jimmy
table
1
Jimmy
pencil
2
Charles
smartphone
2
Charles
cake
3
John
clock
3
John
paper
You can assign and explode:
values = {'Jimmy': Jimmy, 'Charles': Charles, 'John': John}
out = df.assign(picks=df['name'].map(values)).explode('picks')
Or set up a DataFrame, stack and merge:
values = {'Jimmy': Jimmy, 'Charles': Charles, 'John': John}
out = df.merge(
pd.DataFrame.from_dict(values, orient='index')
.stack().droplevel(1).rename('picks'),
left_on='name', right_index=True
)
output:
id name picks
0 1 Jimmy chair
0 1 Jimmy table
0 1 Jimmy pencil
1 2 Charles smartphone
1 2 Charles cake
2 3 John clock
2 3 John paper
We can make a dataframe relating names to picks, then join them together with merge:
import pandas as pd
#dataframe from question
df = pd.DataFrame()
df["id"] = [1, 2, 3]
df["name"] = ["Jimmy", "Charles", "John"]
#dataframe relating names to picks.
picks_df = pd.DataFrame()
picks_df["name"] = ["Jimmy", "Jimmy", "Jimmy", "Charles", "Charles", "John", "John"]
picks_df["picks"] = ["chair", "table", "pencil", "smartphone", "cake", "clock", "paper"]
#Merge and print
print(pd.merge(df, picks_df, on="name"))
I'm trying to do a 'find and replace' in a dataframe data on a specific column type1 using terms stored in a dictionary. I first make all existing values in the dataframe to be uppercase. I create the dictionary mdata, I make sure they are all uppercase as well. Then through the for syntax I loop through the items in mdata replacing accordingly. This code used to work before I turned it into a function.
Any ideas where I've gone wrong?
def to_fish(data, fish):
data['type1'] = data['type1'].str.upper()
if fish == 'monument':
mdata = {
'natural': 'NATURAL FEATURe',
'DITCH TERMINUS': 'DITCH',
'DITCH RECUT': 'DITCH',
'NATURAL_lyr': 'NATURAL FEATURE'
}
mdata = {k.upper(): v.upper() for k, v in mdata.items()}
for copa, fish in mdata.items():
data = data.str.rstrip().str.lstrip().replace(copa, fish, regex=True)
Try the map method:
data['type1'] = data['type1'].map(mdata)
You can give this to get the results.
import pandas as pd
df = pd.DataFrame({'A':['Hello','Bye','OK','Hi','Bonjour'],
'B':['Jack','Jill','Bryan','Kevin','Susan'],
'C':['High','High','Middle','Middle','Low']})
print (df)
lookup_dict = {'High':'California','Middle':'Chicago','Low':'New York'}
df['C'] = df['C'].map(lookup_dict)
print (df)
Before:
A B C
0 Hello Jack High
1 Bye Jill High
2 OK Bryan Middle
3 Hi Kevin Middle
4 Bonjour Susan Low
After:
A B C
0 Hello Jack California
1 Bye Jill California
2 OK Bryan Chicago
3 Hi Kevin Chicago
4 Bonjour Susan New York
I have a data set about 50k~ rows that has a certain Job ID and the User ID of the person that performed the job. It is represented by this sample I've created:
df = pd.DataFrame(data={
'job_id': ['00001', '00002', '00003', '00004', '00005', '00006', '00007', '00008', '00009', '00010', '00011', '00012', '00013', '00014', '00015'],
'user_id': ['frank', 'josh', 'frank', 'jessica', 'josh', 'eric', 'frank', 'josh', 'eric', 'jessica', 'jessica', 'james', 'frank', 'josh', 'james']
})
job_id user_id
0 00001 frank
1 00002 josh
2 00003 frank
3 00004 jessica
4 00005 josh
5 00006 eric
6 00007 frank
7 00008 josh
8 00009 eric
9 00010 jessica
10 00011 jessica
11 00012 james
12 00013 frank
13 00014 josh
14 00015 james
I wish to assign peer reviewers for those jobs in a new column called 'reviewer_id', where the reviewer is from the list of user_id's but the cannot be the same user_id. For example: frank can't review his own job, but jessica can.
My desired output would be something like this:
job_id user_id reviewer_id
0 00001 frank jessica
1 00002 josh frank
2 00003 frank josh
3 00004 jessica eric
4 00005 josh james
...
11 00012 james frank
12 00013 frank josh
13 00014 josh eric
14 00015 james eric
I'm quite new to python so I can only think of getting a list of unique user_id from reviewers = df['user_id'].unique().tolist() and iterating over the dataframe and assigning a reviewer ID but I know you should typically never iterate over a pandas dataframe. So I'm lost on how I would go about something like this.
The simplest way I can think of is to keep changing the reviewer until no one reviews their own works:
users = df['user_id'].unique()
df['reviewer_id'] = df['user_id']
self_review = lambda: df['reviewer_id'] == df['user_id']
while self_review().any():
reviewers = np.random.choice(users, len(df))
df['reviewer_id'] = df['reviewer_id'].mask(self_review(), reviewers)
In terms of performance, the code runs faster when there are more distinct users. Here's a faster version (requires Python 3.8 for the walrus := operator):
users = df['user_id'].unique()
df['reviewer_id'] = df['user_id']
while (self_review := df['user_id'] == df['reviewer_id']).any():
reviewers = np.random.choice(users, self_review.sum())
df.loc[self_review, 'reviewer_id'] = reviewers
You can use apply with set:
import random
unique_ids = set(df.user_id.unique())
assign = lambda x: random.choice(list(unique_ids - {x}))
df['reviewer_id'] = df.user_id.apply(assign)
print(df)
Output:
job_id user_id reviewer_id
0 00001 frank eric
1 00002 josh eric
2 00003 frank jessica
3 00004 jessica frank
4 00005 josh eric
5 00006 eric jessica
6 00007 frank josh
7 00008 josh frank
8 00009 eric james
9 00010 jessica eric
10 00011 jessica frank
11 00012 james josh
12 00013 frank jessica
13 00014 josh jessica
14 00015 james eric
You could use pandas apply to check 2 random reviewer choices against the value of the user, then return the first reviewer that's not the user.
import pandas as pd
from random import sample
personnel = df.user_id.unique().tolist()
def random_reviewer(x):
reviewers = sample(personnel,2)
if reviewers[0] == x['user_id']:
return reviewers[1]
return reviewers[0]
df['reviewer_id'] = df.apply(random_reviewer,axis=1)
Well you can always create a list from the column and then iterate over the list?
import pandas as pd
import random
user_list = []
reviewers = df['user_id'].unique().tolist() #unique names in user_id column
user_id_col = list(df['user_id']) #assign column to list
def rand_reviewer(list_of_reviewers):#function to generate rand user
return list_of_reviewers[random.randint(0,4)]
for i in range(0, len(user_id_col)): #iterate over list ;)
user_list.append(rand_reviewer(reviewers))
while user_id_col[i] == user_list[i]: #generate random user until id's don't match
user_list[i] = rand_reviewer(reviewers)
df['reviewer_id'] = user_list #add new column to df
You can get the user ID values from the dataframe. Idea is, reshuffle the ID such a way that value can't be same as the original position value. Hence, the same user id is not going to get assign as reviewer ID.
You can shuffle the list using random.shuffle and zip the original User ID list and reshuffle the user ID list for checking the positional value.
import random
## shuffle the list
def make_index_shuffle(user_id):
random_index = user_id[:]
while True:
random.shuffle(random_index)
for index, index_value in zip(user_id, random_index):
if index == index_value:
break
else:
return random_index
## get the list of user ID values from the dataframe
user_id = df.user_id.tolist()
## reshuffle the user ID such that the original ID of the list and reshuffled value should not be same
rearrange_id = make_index_shuffle(user_id)
df["reviewer_id" ] = rearrange_id
df
The simplest way would be using apply and sample features of pandas as below:
df['reviewer_id'] = df.apply(lambda row: df[df['user_id']!=row['user_id']].sample()['user_id'].values[0], axis=1)
In the above line: df.where(df['user_id']!=row['user_id']) take all user ids except the current user and then using sample() and cleaning the result to simple string, we assign the results to new column of reviewer_id.
Note that this does not restrict sampling in anyway and one person can happen to take more review jobs than others as sampling is completely random and unconstrainted.
You can create a dictionary with the possible reviewers for each ID, and then use map to assign those possible reviewers for each row. So you get a list for each row, and you need to randomly select an element from each. I wasn't aware of a way to do this without a loop, but perhaps this is still reasonable:
unique = list(df['user_id'].unique())
conversion = {}
for u in unique:
conversion[u] = [i for i in unique if i != u]
df['reviewer_id'] = [np.random.choice(i) for i in df['user_id'].map(conversion)]
Result:
job_id user_id reviewer_id
0 00001 frank james
1 00002 josh eric
2 00003 frank josh
3 00004 jessica james
4 00005 josh jessica
...
...
...
Here is how my dataset looks like:
Name | Country
---------------
Alex | USA
Tony | DEU
Alex | GBR
Alex | USA
I am trying to get something like this out, essentially grouping and counting:
Name | Country
---------------
Alex | {USA:2,GBR:1}
Tony | {DEU:1}
Works, but slow on LARGE datasets
Here is my code that does work on smaller dfs, but takes forever on bigger dfs (mine is around 14 million rows). I also use the multiprocessing module to speed up, but it doesn't help much:
def countNames(x):
return dict(Counter(x))
def aggregate(df_full,nameList):
df_list = []
for q in nameList:
df = df_full[df_full['Name']==q]
df_list.append(df.groupby('Name')['Country'].apply(lambda x: str(countNames(x))).to_frame().reset_index())
return pd.concat(df_list)
df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'],
'Country':['USA','GBR','USA','DEU']})[['Name','Country']]
aggregate(df,df.Name.unique())
Is there anything that can speed up the internal logic (except for running with multiprocessing)?
This is essentially a cross tabulation. You said "something like this" which implies that you aren't quite sure what the output should be.
Option 1
Group by and value_counts
df.groupby('Name').Country.value_counts()
Name Country
Alex USA 2
GBR 1
Tony DEU 1
Name: Country, dtype: int64
To get your specified output:
pd.Series({
name: pd.value_counts(d).to_dict()
for name, d in df.groupby('Name').Country
}).rename_axis('Name').reset_index(name='Country')
Name Country
0 Alex {'USA': 2, 'GBR': 1}
1 Tony {'DEU': 1}
Option 2
However, I'd prefer these representations. Which we can see a number of ways to do this in the answer to question # 9 in this answer
pd.crosstab(df.Name, df.Country)
Country DEU GBR USA
Name
Alex 0 1 2
Tony 1 0 0
Are you looking for this?
import pandas as pd
df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'],
'Country':['USA','GBR','USA','DEU']})[['Name','Country']]
df = (df.groupby('Name')['Country']
.apply(lambda x: str(x.value_counts().to_dict()))
.reset_index(name='Country'))
Returns:
Name Country
0 Alex {'USA': 2, 'DEU': 1}
1 Tony {'GBR': 1}
For an O(n) complexity solution, use collections.Counter.
from collections import Counter, defaultdict
import pandas as pd
df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'],
'Country':['USA','GBR','USA','DEU']})[['Name','Country']]
c = Counter(map(tuple, df.values))
# Counter({('Alex', 'DEU'): 1, ('Alex', 'USA'): 2, ('Tony', 'GBR'): 1})
Dictionary result
You can then get a Name -> Country dictionary mapping via collections.defaultdict. I would not put dictionaries in a pandas dataframe, it's not designed for this purpose.
tree = lambda: defaultdict(tree)
d = tree()
for k, v in c.items():
d[k[0]][k[1]] = v
for k, v in d.items():
print(k, v)
# Alex defaultdict(<function <lambda>>, {'USA': 2, 'DEU': 1})
# Tony defaultdict(<function <lambda>>, {'GBR': 1})
Dataframe result
For display purposes, you can build a dataframe directly from the defaultdict:
res_df = pd.DataFrame.from_dict(d, orient='index').fillna(0)
# USA DEU GBR
# Alex 2.0 1.0 0.0
# Tony 0.0 0.0 1.0