efficiently find maxes of one column over id in a Pandas dataframe

efficiently find maxes of one column over id in a Pandas dataframe - python

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.

I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6

Related

Rearranging data from a table in python to correlate columns

I got a table that look like this:
code
year
month
Value A
Value B
1
2020
1
120
100
1
2020
2
130
90
1
2020
3
90
89
1
2020
4
67
65
...
...
...
...
...
100
2020
10
90
90
100
2020
11
115
100
100
2020
12
150
135
I would like to know if there's a way to rearrange the data to find the correlation between A and B for every distinct code.
What I'm thinking is, for example, getting an array for every code, like:
[(A1,A2,A3...,A12),(B1,B2,B3...,B12)]
where A and B is the values for the respective month, and then I could see the correlation between these two columns. Is there a way to make this dynamic?

IIUC, you don't need to re-arrange to get the correlation for each "code". Instead, try with groupby:
>>> df.groupby("code").apply(lambda x: x["Value A"].corr(x["Value B"]))
code
1 0.830163
100 0.977093
dtype: float64

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000

ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Filter dataframe based on list with ranges

probably the title of my question is some kind of wrong. Currently I have a list:
a = [11,12,13,14,15,16,17,18,19,20,21,22,25,26,27,28,29,30,31,37,38,39]
and a dataframe df:
colfrom colto
1 99
23 24
25 32
25 40
How can I filter my dataframe that the colfrom is inside the array a or smaller then it, and that coltois inside the array or bigger then it? So basically this rule would lead to:
colfrom colto
1 99
25 32
25 40
The only row who gets kicked out is row 2 (or in python row 1), as 23 and 24 are not in the array (and not lower then 11 and not higher then 39).

Use:
mask = ((df['colfrom'].isin(a)) | (df['colfrom']<min(a)) & (df['colto'].isin(a)) | (df['colto']>max(a)))
df[mask]
colfrom colto
0 1 99
2 25 32
3 25 40

Pandas groupby two columns and only keep records satisfying condition based on count

Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36

It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35

I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficiently find maxes of one column over id in a Pandas dataframe - python

Related

Rearranging data from a table in python to correlate columns

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

Filter dataframe based on list with ranges

Pandas groupby two columns and only keep records satisfying condition based on count

Calculating average/standard deviations of rows containing certain string in pandas dataframe

Categories

Resources