I am trying to go through my dataframe two lines at a time, checking if a column value is the same in both rows and removing such rows. My dataframe tracks the locations of different people during different encounters.
I have a dataframe, called transfers, in which each row consists of an ID number for a person, an encounter number, and a location. The transfers dataframe was created by running a duplicated on my original dataframe to find rows with the same person ID, grouping them together.
For example, we would want to get rid of the rows with ID = 2 in the dataframe below because the location was "D" in both encounters, so this person has not moved.
However, we would want to keep the rows with ID = 3 because that person moved from "A" to "F".
Another issue arises because some people have more than two rows, for example where ID = 1. For this person, we would want to keep their rows because they have moved from "A" -> "B" and then from "B" -> "C". However, if you only compare the encounters 12 and 13, it does not look like this person has changed locations.
Example dataframe df:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
2 21 D
2 22 D
3 31 A
3 32 F
Expected output:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
3 31 A
3 32 F
I have tried a nested for loops using .iterrows(), but I found that this did not work as it was terribly slow and did not properly handle cases where the person had more than two encounters. I have also tried applying a function to my dataframe, but the runtime was nearly the same as crude looping.
EDIT: I should have explicitly stated this, I am trying to keep the data of any person who has moved locations even if they end up back where they started.
Given
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
5 2 22 D
6 3 31 A
7 3 32 F
you can filter your dataframe via
>>> places = df.groupby('ID')['Location'].transform('nunique')
>>> df[places > 1]
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
6 3 31 A
7 3 32 F
The idea is to count the number of unique places per group (ID) and then drop the rows where a person has only been to one place.
Comparison versus the filter solution:
# setup
>>> df = pd.concat([df.assign(ID=df['ID'] + i) for i in range(1000)], ignore_index=True)
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
... ... ... ...
7995 1000 14 C
7996 1001 21 D
7997 1001 22 D
7998 1002 31 A
7999 1002 32 F
[8000 rows x 3 columns]
# timings # i5-6200U CPU # 2.30GHz
>>> %timeit df.groupby('ID').filter(lambda x: x['Location'].nunique() > 1)
356 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df[df.groupby('ID')['Location'].transform('nunique') > 1]
5.56 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
Let say I have the following dataframe
import pandas as pd
df = pd.DataFrame({
'Est': [1.18,1.83,2.08,2.30,2.45,3.21,3.26,3.54,3.87,4.58,4.59,4.98],
'Buy': [0,1,1,1,0,1,1,0,1,0,0,1]
})
Est Buy
0 1.18 0
1 1.83 1
2 2.08 1
3 2.30 1
4 2.45 0
5 3.21 1
6 3.26 1
7 3.54 0
8 3.87 1
9 4.58 0
10 4.59 0
11 4.98 1
I will like to create a new dataframe with two columns and 4 rows with the following format: the first row contains how many 'Est' values are between 1 and 2, and how many 1's in the column 'Buy'; the second row the same for the 'Est' values between 2 and 3; third row between 3 and 4, and so on. So my output should be
A B
0 2 1
1 3 2
2 4 3
3 3 1
I tried to use the where clause in pandas (or np.where) to create new columns with restrictions like df['Est'] >= 1 & df['Est'] <= 2 and then count. But, is there an easier and cleaner way to do this? Thanks
Sounds like you want to group by the floor of the first column:
g = df.groupby(df['Est'] // 1)
You count the Est column:
count = g['Est'].count()
And sum the Buy column:
buys = g['Buy'].sum()
I have a DataFrame like this
id subid a
1 1 1 2
2 1 1 10
3 1 1 20
4 1 2 30
5 1 2 35
6 1 2 36
7 1 2 40
8 2 2 20
9 2 2 29
10 2 2 30
And I want to calculate and save the value of the mean of variable "a" for each id. For example I want the mean of the variable "a" if id=2. And then save that result on a list
This is what I have so far:
for i in range(2):
results=[]
if df.iloc[:,3]==i:
value=np.mean(df)
results.append(value)
I think what you are trying to do is:
df.groupby('id')['a'].mean()
It will return mean of both 1 and 2 but if you want to take only mean of 2 then you can do this:
df.groupby('id')['a'].mean()[2]
By doing this you're only taking mean of a column whose id is 2.
Problems here,
results=[] should be out of loops, otherwise for each time the loop runs, result resets to [].
I'm aware that iloc[:,2] is a column you're looking for.
value = df['a'].mean()
I often encounter a problem when I am doing queries with lots of specification, How to speed up the process?
Basically I really often use the apply function to get a result but quite often, the computation takes a long time.
Is there a good practice to find how to optimize the Pandas code?
Here is an example, I have a DataFrame representing the exchange of a chat containing 3 columns:
timestamp: the timestamp of the message
sender_id: the id of the sender
receiver_id: the id of the receiver
The goal is to find the fraction of messages that had a response in less than 5 minutes. Here is my code:
import pandas as pd
import numpy as np
import datetime
size_df = 30000
np.random.seed(42)
data = {
'timestamp': pd.date_range('2019-03-01', periods=size_df, freq='30S').astype(int),
'sender_id': np.random.randint(5, size=size_df),
'receiver_id': np.random.randint(5, size=size_df)
}
dataframe = pd.DataFrame(data)
This is how the DataFrame looks like:
print(dataframe.head().to_string())
timestamp sender_id receiver_id
0 1551398400000000000 4 2
1 1551398430000000000 3 2
2 1551398460000000000 1 1
3 1551398490000000000 4 3
4 1551398520000000000 4 3
The function used by apply:
def apply_find_next_answer_within_5_min(row):
"""
Find the index of the next response in a range of 5 minutes
"""
[timestamp, sender, receiver] = row
## find the next responses from receiver to sender in the next 5 minutes
next_responses = df_groups.get_group((receiver, sender))["timestamp"]\
.loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]
## if there is no next responses just return NaN
if not next_responses.size:
return np.nan
## find the next messages from sender to receiver in the next 5 minutes
next_messages = df_groups.get_group((sender, receiver))["timestamp"]\
.loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]
## if the first next message is before next response return nan else return index next reponse
return np.nan if next_messages.size and next_messages.iloc[0] < next_responses.iloc[0] else next_responses.index[0]
%%timeit
df_messages = dataframe.copy()
## create a dataframe to easily find messages from a specific sender and receiver, speed up the querying process for these messages.
df_groups = df_messages.groupby(["sender_id", "receiver_id"])
df_messages["next_message"] = df_messages.apply(lambda row: apply_find_next_answer_within_5_min(row), axis=1)
Output timeit:
42 s ± 2.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it takes 42 seconds to apply the function for a 30 000 rows DataFrame. I think it is very long, but I don't find a way to make it more efficient. I already gained 40 seconds by using the intermediate dataframe that groups the sender and receiver instead of querying the big dataframe in the apply function.
This would the response of this specific problem:
1 - df_messages.next_message[lambda x: pd.isnull(x)].size / df_messages.next_message.size
0.2753
So in such scenarios, how do you find a way to compute more efficiently? Are there some tricks to think about?
In this example, I don't believe it is possible to use vectorizations all the way but maybe by using more groups, it is possible to go quicker?
You can try to group your dataframe
groups = dataframe.reset_index()\ #I reset_index for later to get the value
.groupby([ frozenset([se, re]) #need frosenset to allow the groupby
for se, re in dataframe[['sender_id', 'receiver_id']].values])
Now you can create boolean mask meeting your condition
mask_1 = ( # within a group, check if the following message is sent from the other one
(groups.sender_id.diff(-1).ne(0)
# or if the person talks to oneself
| dataframe.sender_id.eq(dataframe.receiver_id) )
# and check if the following message is within 5 min
& groups.timestamp.diff(-1).gt(-5*60*1000*1000*1000))
Now create the column with the index you look for with the mask and shift on the index:
df_messages.loc[mask_1, 'next_message'] = groups['index'].shift(-1)[mask_1]
and you get like with your method and should be faster:
print (df_messages.head(20))
timestamp sender_id receiver_id next_message
0 1551398400000000000 3 1 NaN
1 1551398430000000000 4 1 NaN
2 1551398460000000000 2 3 NaN
3 1551398490000000000 4 1 NaN
4 1551398520000000000 4 3 NaN
5 1551398550000000000 1 1 NaN
6 1551398580000000000 2 3 10.0
7 1551398610000000000 2 4 NaN
8 1551398640000000000 2 4 NaN
9 1551398670000000000 4 1 NaN
10 1551398700000000000 3 2 NaN
11 1551398730000000000 2 4 NaN
12 1551398760000000000 4 0 18.0
13 1551398790000000000 1 0 NaN
14 1551398820000000000 3 3 16.0
15 1551398850000000000 1 2 NaN
16 1551398880000000000 3 3 NaN
17 1551398910000000000 4 1 NaN
18 1551398940000000000 0 4 NaN
19 1551398970000000000 3 2 NaN
I want to order this DataFrame by a given column field and the number of entries I have for this given field.
So let's say I have a very simple dataframe, looking something like this:
name age
0 Paul 12
1 Ryan 17
2 Michael 100
3 Paul 36
4 Paul 66
5 Michael 45
What I want as a result is something like
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
So I have 3 Paul's, so they come up first, then 2 Michael's, and finally only 1 Ryan.
One option: use value_counts to get the most frequent names, then set, sort, and reset the index:
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
returns
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
Need to create a helper column to sort, in this case the size of the name groups. Add a .reset_index(drop=True) if you prefer a brand new RangeIndex, or keep as is if the original Index is useful.
Sorting does not change the ordering within equal values, so the first 'Paul' row will always appear first within 'Paul'
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
Output
name age
0 Paul 12
3 Paul 36
4 Paul 66
2 Michael 100
5 Michael 45
1 Ryan 17
To allay fears raised in comments, this method is performant. Much more so than the above method. Plus you don't ruin your initial index.
import numpy as np
np.random.seed(42)
N = 10**6
df = pd.DataFrame({'name': np.random.randint(1, 10000, N),
'age': np.random.normal(0, 1, N)})
%%timeit
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
#500 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
#2.67 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The only change I added was the ability to sort by count of name, and by age.
df['name_count'] = df['name'].map(df['name'].value_counts())
df = df.sort_values(by=['name_count', 'age'],
ascending=[False,True]).drop('name_count', axis=1)
df.reset_index(drop=True)
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 45
4 Michael 100
5 Ryan 17
Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. Can you please show me ways of approaching this?
So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).
"Within thresh distance" to me means that the difference between the max
and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by #root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
Update:
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.
Here's the apply way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop