pandas reset_index after groupby.value_counts() - python

I am trying to groupby a column and compute value counts on another column.
import pandas as pd
dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(dftest)
dftest looks like
A Amt
0 1 20
1 1 20
2 1 20
3 1 30
4 1 30
5 1 30
6 1 30
7 1 40
8 1 40
9 2 10
10 2 10
11 2 40
12 2 40
13 2 40
perform grouping
grouper = dftest.groupby('A')
df_grouped = grouper['Amt'].value_counts()
which gives
A Amt
1 30 4
20 3
40 2
2 40 3
10 2
Name: Amt, dtype: int64
what I want is to keep top two rows of each group
Also, I was perplexed by an error when I tried to reset_index
df_grouped.reset_index()
which gives following error
df_grouped.reset_index()
ValueError: cannot insert Amt, already exists

You need parameter name in reset_index, because Series name is same as name of one of levels of MultiIndex:
df_grouped.reset_index(name='count')
Another solution is rename Series name:
print (df_grouped.rename('count').reset_index())
A Amt count
0 1 30 4
1 1 20 3
2 1 40 2
3 2 40 3
4 2 10 2
More common solution instead value_counts is aggregate size:
df_grouped1 = dftest.groupby(['A','Amt']).size().reset_index(name='count')
print (df_grouped1)
A Amt count
0 1 20 3
1 1 30 4
2 1 40 2
3 2 10 2
4 2 40 3

Related

Remove groups from dataframe where a min value within that group is not below a threshold

The dataframe looks like this:
id1 id2 value
1 1 35
1 1 23
1 1 20
1 2 5
1 2 50
2 1 42
2 1 3
2 1 12
2 2 64
2 3 34
2 3 1
I want to group them by id1 and id2, and remove all rows of a group if the minimum value of that group is not less than 10.
So the result would look like this:
id1 id2 value
1 2 5
1 2 50
2 1 3
2 1 12
2 3 34
2 3 1
I have tried this:
dfmin = df.groupby(["id1", "id2"])["value"].min().reset_index()
df = df[
dfmin.loc[
(dfmin["id1"] == df["id1"]) & (dfmin["id1"] == df["id1"]),
"value",
].iat[0]
< 10
]
But I get the error Can only compare identically-labeled Series objects.
What am I doing wrong and is there a better way?
use groupby filter
out = df.groupby(['id1', 'id2']).filter(lambda x: x['value'].min() < 10)
out
id1 id2 value
3 1 2 5
4 1 2 50
5 2 1 42
6 2 1 3
7 2 1 12
9 2 3 34
10 2 3 1

python pandas: Remove duplicates by columns A, which is not satisfying a condition in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10

Subtract fixed row value in reference to column value in pandas dataframe

I would like to subtract a fixed row value in rows, in reference to their values in another column.
My data looks like this:
TRACK TIME POSITION_X
0 1 0 12
1 1 30 13
2 1 60 15
3 1 90 11
4 2 0 10
5 2 20 11
6 2 60 13
7 2 90 17
I would like to subtract a fixed row value (WHEN TIME=0) of the POSITION_X column in reference to the TRACK column, and create a new column ("NEW_POSX") with those values. The output should be like this:
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have been using the following code to get this done:
import pandas as pd
data = {'TRACK': [1,1,1,1,2,2,2,2],
'TIME': [0,30,60,90,0,20,60,90],
'POSITION_X': [12,13,15,11,10,11,13,17],
}
df = pd.DataFrame (data, columns = ['TRACK','TIME','POSITION_X'])
df['NEW_POSX']= df.groupby('TRACK')['POSITION_X'].diff().fillna(0).astype(int)
df.head(8)
... but I don't get the desired output. Instead, I get a new column where every row is subtracted by the previous row (according to the "TRACK" column):
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 2
3 1 90 11 -4
4 2 0 10 0
5 2 20 11 1
6 2 60 13 2
7 2 90 17 4
can anyone help me with this?
You can use transform and first to get the value at time 0, and then substract it to the 'POSITION_X' column:
s=df.groupby('TRACK')['POSITION_X'].transform('first')
df['NEW_POSX']=df['POSITION_X']-s
#Same as:
#df['NEW_POSX']=df['POSITION_X'].sub(s)
Output:
df
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7

How to downsampling time series data in pandas?

I have a time series in pandas that looks like this (order by id):
id time value
1 0 2
1 1 4
1 2 5
1 3 10
1 4 15
1 5 16
1 6 18
1 7 20
2 15 3
2 16 5
2 17 8
2 18 10
4 6 5
4 7 6
I want downsampling time from 1 minute to 3 minutes for each group id.
And value is a maximum of group (id and 3 minutes).
The output should be like:
id time value
1 0 5
1 1 16
1 2 20
2 0 8
2 1 10
4 0 6
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
You can convert your time series to an actual timedelta, then use resample for a vectorized solution:
t = pd.to_timedelta(df.time, unit='T')
s = df.set_index(t).groupby('id').resample('3T').last().reset_index(drop=True)
s.assign(time=s.groupby('id').cumcount())
id time value
0 1 0 5
1 1 1 16
2 1 2 20
3 2 0 8
4 2 1 10
5 4 0 6
Use np.r_ and .iloc with groupby:
df.groupby('id')['value'].apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]])
Output:
id
1 2 5
5 16
7 20
2 10 8
11 10
4 13 6
Name: value, dtype: int64
Going a little further with column naming etc..
df_out = df.groupby('id')['value']\
.apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]]).reset_index()
df_out.assign(time=df_out.groupby('id').cumcount()).drop('level_1', axis=1)
Output:
id value time
0 1 5 0
1 1 16 1
2 1 20 2
3 2 8 0
4 2 10 1
5 4 6 0

Run calculations on list of selected columns [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Categories

Resources