Pandas Slow. Want first occurrence in DataFrame - python

I have a DataFrame of people. One of the columns in this DataFrame is a place_id. I also have a DataFrame of places, where one of the columns is place_id and another is weather. For every person, I am trying to find the corresponding weather. Importantly, many people have the same place_ids.
Currently, my setup is this:
def place_id_to_weather(pid):
return place_df[place_df['place_id'] == pid]['weather'].item()
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`
But this is untenably slow. I would like to speed this up. I suspect that I could achieve a speedup like this:
Instead of returning place_df[...].item(), which does a search for place_id == pid for that entire column and returns a series, and then grabbing the first item in that series, I really just want to curtail the search in place_df after the first match place_df['place_id']==pid has been found. After that, I don't need to search any further. How do I limit the search to first occurrences only?
Are there other methods I could use to achieve a speedup here? Some kind of join-type method?

I think you need drop_duplicates with merge, if there is only common columns place_id and weather in both DataFrames, you can omit parameter on (it depends of data, maybe on='place_id' is necessary):
df1 = place_df.drop_duplicates(['place_id'])
print (df1)
print (pd.merge(person_df, df1))
Sample data:
person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
'A':[4,5,6,7,8,9]})
print (person_df)
A place_id
0 4 s
1 5 d
2 6 f
3 7 s
4 8 d
5 9 f
place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
'weather':['y','e','r', 'h','u','i']})
print (place_df)
place_id weather
0 s y
1 d e
2 f r
3 s h
4 d u
5 f i
def place_id_to_weather(pid):
#for first occurence add iloc[0]
return place_df[place_df['place_id'] == pid]['weather'].iloc[0]
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
A place_id weather
0 4 s y
1 5 d e
2 6 f r
3 7 s y
4 8 d e
5 9 f r
#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
place_id weather
0 s y
1 d e
2 f r
print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
A place_id weather
0 4 s y
1 7 s y
2 5 d e
3 8 d e
4 6 f r
5 9 f r

The map function is your quickest method, the purpose of which is to avoid calling an entire dataframe to run some function repeatedly. This is what you ended up doing in your function i.e. calling an entire dataframe which is fine but not good doing it repeatedly. To tweak your code just a little will significantly speed up your process and only call the place_df dataframe once:
person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))

You can use merge to do the operation :
people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])
# name place
#0 bob 1
#1 alice 2
#2 john 3
#3 paul 2
weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])
# place weather
#0 1 sun
#1 2 rain
#2 3 snow
#3 1 rain
pd.merge(people, weather, on='place')
# name place weather
#0 bob 1 sun
#1 bob 1 rain
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow
In case you have several weather for the same place, you may want to use drop_duplicates, then you have the following result :
pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])
# name place weather
#0 bob 1 sun
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow

Related

Use values from list in order to create few new Dataframe based on existing one

My current DF looks like below
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
9 7 3 9 5 Adam Holiday
3 2 3 4 5 Anna Work
1 4 6 8 5 Anna Work
4 1 6 8 5 Kate Off
2 1 6 1 5 Jon Off
My lists with specific values looks like below:
name = [Jon, Adam]
status = [Off, Work]
I need using those lists create new dataframes for all unique elements in "status" list. So it should looks like below:
df_off:
x y z x c name status
2 1 6 1 5 Jon Off
there is only one values, because name Kate in not in the list name
df_Work:
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
In second DF there is no "Anna" because she is not in list "name".
I hope it is clear. Do you have any idea how can I solve this issue?
Regard
Tomasz
First part, filter your data using:
name = ['Jon', 'Adam']
status = ['Off', 'Work']
df[df['name'].isin(name)&df['status'].isin(status)]
Then use groupby and transform the output to dictionary:
conditions = df['name'].isin(name)&df['status'].isin(status)
dfs = {'df_%s' % k:v for k,v in df[conditions].groupby('status')}
Then access your dataframes using:
>>> dfs['df_Work']
x y z x.1 c name status
0 1 2 3 2 5 Jon Work
1 1 2 5 4 5 Adam Work
You can even use multiple groups:
dfs = {'df_%s_%s' % k:v for k,v in df.groupby(['name', 'status'])}
dfs['df_Adam_Work']
If you goal is to save the subframes:
for groupname, df in df[conditions].groupby('status'):
df.to_excel(f'df_{groupname}.xlsx')

Create multiple DataFrames from one pandas DataFrame by grouping by column values [duplicate]

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 2 years ago.
So I have the following dataframe, but with a valuable amount of rows(100, 1000, etc.):
#
Person1
Person2
Age
1
Alex
Maria
20
2
Paul
Peter
20
3
Klaus
Hans
30
4
Victor
Otto
30
5
Gerry
Justin
30
Problem:
Now I want to print separate dataframes, which contain all people, that visit the same age, so the output should look like this:
DF1:
#
Person1
Person2
Age
1
ALex
Maria
20
2
Paul
Peter
20
DF2:
#
Person1
Person2
Age
3
Klaus
Hans
30
4
Victor
Otto
30
5
Gerry
Justin
30
I've tried this with the following functions:
Try1:
def groupAge(data):
x = -1
for x in range(len(data)):
#q = len(data[data["Age"] == data.loc[x, "Age"]])
b = data[data["Age"] == data.loc[x,"Age"]]
x = x + 1
print(b,x)
return b
Try2:
def groupAge(data):
x = 0
for x in range(len(data)):
q = len(data[data["Age"] == data.loc[x, "Age"]])
x = x + 1
for k in range(0,q,q):
b = data[data["Age"] == data.loc[k,"Age"]]
print(b)
return b
Neither of them produced the right output. Try1 prints a few groups, and all of them twice, but doesn't go through the entire dataframe and Try2 only prints the first Age "group", also twice.
I can't identify firstly why it always prints the output two times, neither why it doesn't work through the entire dataframe.
Can anyone help?
In your first try, you are looping through the length of dataframe and then repeating the below line every time replacing x with 0,1,2,3 and 4, respectively. On a side note, x = x + 1 is not required. range already takes care of that.
b = data[data["Age"] == data.loc[x,"Age"]]
It will obviously print them twice every time because you are scanning through the entire dataframe data and executing duplicate commands. For example:
print(data.loc[0, 'Age'])
print(data.loc[1, 'Age'])
20
20
Both the above statements print 20, so by substituting 20 in the loop, essentially you will be executing the following commands twice.
b = data[data["Age"] == 20]
I think all you need is this,
unq_age = data['Age'].unique()
df1 = df.loc[df['Age'] == unq_age[0]]
df2 = df.loc[df['Age'] == unq_age[1]]
df1
# Person1 Person2 Age
0 1 Alex Maria 20
1 2 Paul Peter 20
df2
# Person1 Person2 Age
2 3 Klaus Hans 30
3 4 Victor Otto 30
4 5 Gerry Justin 30

How to use variable names iteratively in a loop in python/pandas

This seems to be a very common question, but my question is slightly different. Most of the questions in SO I have searched for; gives how to create different variables iteratively. But I want to use those variables iteratively which are already in a dictionary lets say.
Consider using pandas, let us say I define 3 dataframes df_E1, df_E2, df_E3. Each of them has columns like name, date, purpose.
Let's say I want to print describe from all of them. df_E1.describe(), df_E2.describe(), df_E3.describe() . Now, instead of printing one by one, what if I have 20-30 of such dataframes and I want to describe from df_E{number}.describe() in a for loop. How can I do that?
df_E1 = {'name':['john','tom','barney','freddie'], 'number':['3','4','5','6'], 'description':['a','b','c','d']}
df_E2 = {'name':['andy','candy','bruno','mars'], 'number':['1','2','5','8'], 'description':['g','h','j','k']}
df_E3 = {'name':['donald','trump','harry','thomas'], 'number':['9','4','5','7'], 'description':['c','g','j','r']}
df_E1 = pd.DataFrame(df_E1)
df_E2 = pd.DataFrame(df_E2)
df_E3 = pd.DataFrame(df_E3)
print(df_E1.head())
print(df_E2.head())
print(df_E3.head())
#### instead of above three statements, is there any way I can print them in a
#### for loop. Below does not work as it gives just the string printed.
for i in range(1,4):
print(str('df_E')+str(i))
### Now if I am able to print dataframes somehow, I will be able to use all
### these in a loop. Eg. if I need to print describe of these, I will be able
### to do it in for loop:
for i in range(1,4):
print((str('df_E')+str(i)).describe()) // something like this which works
This isn't reflecting the already asked questions as they focus more on just creating the variables as string in for loop, which I know can be done using a dictionary. But here, the requirement is to use already present variables
Here the best is use dict:
d = {'df_E1':df_E1, 'df_E2':df_E2, 'df_E3':df_E3}
print (d)
{'df_E1': name number description
0 john 3 a
1 tom 4 b
2 barney 5 c
3 freddie 6 d, 'df_E2': name number description
0 andy 1 g
1 candy 2 h
2 bruno 5 j
3 mars 8 k, 'df_E3': name number description
0 donald 9 c
1 trump 4 g
2 harry 5 j
3 thomas 7 r}
for k, v in d.items():
print (v.describe())
name number description
count 4 4 4
unique 4 4 4
top john 5 b
freq 1 1 1
name number description
count 4 4 4
unique 4 4 4
top bruno 5 g
freq 1 1 1
name number description
count 4 4 4
unique 4 4 4
top harry 5 r
freq 1 1 1
But is it possible, but not recommended, with globals:
for i in range(1,4):
print(globals()[str('df_E')+str(i)])
name number description
0 john 3 a
1 tom 4 b
2 barney 5 c
3 freddie 6 d
name number description
0 andy 1 g
1 candy 2 h
2 bruno 5 j
3 mars 8 k
name number description
0 donald 9 c
1 trump 4 g
2 harry 5 j
3 thomas 7 r
You can do this simply by using the eval() function, but it is frowned upon :
for i in range(1, 4):
print(eval('df_E{}.describe()'.format(i)))
Here is a link on why it's considered a bad practice

Pandas dataframe - select all user's rows if any of his rows contains certain value

I have a dataframe with patients, date, medications, and diagnosis.
Each patient has a unique id ('pid'), and may or may not be treated with different drugs.
What is best practice to select all patients that at some point have been treated with a certain drug?
Since my dataset is so huge, for-loops and if-statement is the last resort.
Example:
IN:
pid drug
1 A
1 B
1 C
2 A
2 C
2 E
3 B
3 C
3 D
4 D
4 E
4 F
Select all patient who has at some point been treated with drug 'B'. Note that all entries of that patient must to be included, meaning not just treatments with drug B, but all treatments:
OUT:
1 A
1 B
1 C
3 B
3 C
3 D
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
I do support COLDSPEED's answer, but If you say
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
can be solved a lot simpler than hardcoding the if's
patients_B = df.loc[df['drug'] == 'B', 'pid]
or
patients_B = set(df.loc[df['drug'] == 'B', 'pid])
and then
result = df[df['pid'].isin(patients_B)]
The easiest method involves groupby + transform:
df[df.drug.eq('B').groupby(df.pid).transform('any')]
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D
In pursuit of a faster solution, call groupby on df, not a Series:
df[~df.groupby('pid').drug.transform(lambda x: x.eq('B').any())]
pid drug
3 2 A
4 2 C
5 2 E
9 4 D
10 4 E
11 4 F
Here is one way.
s = df.groupby('drug')['pid'].apply(set)
result = df[df['pid'].isin(s['B'])]
# pid drug
# 0 1 A
# 1 1 B
# 2 1 C
# 6 3 B
# 7 3 C
# 8 3 D
Explanation
Create a mapping series s as a separate initial step so that it
does not need recalculating for each result.
For the comparisons, use set for O(1) complexity lookup.
IIUC filter
df.groupby('pid').filter(lambda x : (x['drug']=='B').any())
Out[18]:
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D

How to pass a column from a data frame into wordnet.synsets() in NLTK python

I have a dataframe in which one the columns contains english words. I want to pass each of the elements in that columns through NLTKs synsets() function. My issue is that synsets() only takes in the a single word at a time.
e.g wordnet.synsets('father')
Now if I have dataframe like:
dc = {'A':[0,9,4,5],'B':['father','mother','kid','sister']}
df = pd.DataFrame(dc)
df
A B
0 0 father
1 9 mother
2 4 kid
3 5 sister
I want to pass column B though synsets() function and have another column that contains its output. I want to do this without iterating through the dataframe.
How do I do that?
You could use the apply method:
In [4]: df['C'] = df['B'].apply(wordnet.synsets)
In [5]: df
Out[5]:
A B C
0 0 father [Synset('father.n.01'), Synset('forefather.n.0...
1 9 mother [Synset('mother.n.01'), Synset('mother.n.02'),...
2 4 kid [Synset('child.n.01'), Synset('kid.n.02'), Syn...
3 5 sister [Synset('sister.n.01'), Synset('sister.n.02'),...
However, having a column of lists is usually not a very useful data structure. It might be better to put each synonym in its own column. You can do that by making the callback function return a pd.Series:
In [29]: df.join(df['B'].apply(lambda word: pd.Series([w.name for w in wordnet.synsets(word)])))
Out[29]:
A B 0 1 2 3 \
0 0 father father.n.01 forefather.n.01 father.n.03 church_father.n.01
1 9 mother mother.n.01 mother.n.02 mother.n.03 mother.n.04
2 4 kid child.n.01 kid.n.02 kyd.n.01 child.n.02
3 5 sister sister.n.01 sister.n.02 sister.n.03 baby.n.05
4 5 6 7 8
0 father.n.05 father.n.06 founder.n.02 don.n.03 beget.v.01
1 mother.n.05 mother.v.01 beget.v.01 NaN NaN
2 kid.n.05 pull_the_leg_of.v.01 kid.v.02 NaN NaN
3 NaN NaN NaN NaN NaN
(I've chosen to display just the name attribute of each Synset; you could of course use
df.join(df['B'].apply(lambda word: pd.Series(wordnet.synsets(word))))
if you want the Synset objects themselves.)

Categories

Resources