I have a dataframe df:
AID JID CID
0 1 A NaN
1 1 A NaN
2 1 B NaN
3 1 NaN X
4 3 A NaN
5 4 NaN NaN
6 4 C X
7 5 C Y
8 5 C X
9 6 A NaN
10 6 B NaN
I want to calculate how many times has AID used either JID or CID.
Resulting dataframe should be like this, where the index is the AID values and the columns are the CID and JID values:
A B C X Y
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
I know how to do it by looping and counting manually. But I was wondering what the more efficient way is?
I'd melt and then use pivot_table:
In [80]: d2 = pd.melt(df, id_vars="AID")
In [81]: d2.pivot_table(index="AID", columns="value", values="variable",
aggfunc="count", fill_value=0)
Out[81]:
value A B C X Y
AID
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
This works because melt "flattens" the dataframe into something where we can more easily access the values together, and pivot_table is for exactly the type of aggregation you have in mind:
In [90]: pd.melt(df, "AID")
Out[90]:
AID variable value
0 1 JID A
1 1 JID A
2 1 JID B
3 1 JID NaN
4 3 JID A
[... skipped]
17 4 CID X
18 5 CID Y
19 5 CID X
20 6 CID NaN
21 6 CID NaN
You can create first Series by stack and then groupby with value_counts. Last reshape by unstack:
df = df.set_index('AID').stack().groupby(level=0).value_counts().unstack(1, fill_value=0)
print (df)
A B C X Y
AID
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
Related
I am trying to pivot on a dataframe that looks like so
country col_a col_b col_c status group
a 4 5 6 confirmed z
a 4 5 6 failed z
a 4 5 6 unknown y
a 4 5 6 confirmed z
b 4 5 6 failed y
b 4 5 6 confirmed y
b 4 5 6 failed z
b 4 5 6 confirmed z
b 4 5 6 confirmed z
I am trying to pivot so that I have a total for each country, and then each group within that country is broken down. As below.
country group confirmed failed unknown
a NaN 2 1 1
Nan z 2 1 0
NaN y 0 0 1
b NaN 3 2 0
NaN z 2 1 0
NaN y 1 1 0
The issue i'm having is that whilst it will look just like this, it will then append the other cols across the top and just repeat the status as below.
col_a col_b col_c
country group confirmed failed unknown confirmed failed unknown confirmed failed unknown
a NaN 2 1 1 2 1 1 2 1 1
Nan z 2 1 0 2 1 0 2 1 0
NaN y 0 0 1 0 0 1 0 0 1
b NaN 3 2 0 3 2 0 3 2 0
NaN z 2 1 0 2 1 0 2 1 0
NaN y 1 1 0 1 1 0 1 1 0
The code im using is -
testdf = df2.pivot_table(index=['country','group'], columns='status', aggfunc=len, fill_value=0)
and when it prints in the console, it looks fine. But as soon as I output to excel, its all broken!
Any ideas?
I am not sure if it is duplicated question, so I decided reopen, I think you want aggfunc='size'
new_df = (df.pivot_table(index=['country','group'],
columns='status',
aggfunc='size',
fill_value=0)
.reset_index()
.rename_axis(None, axis=1))
print(new_df)
country group confirmed failed unknown
0 a y 0 0 1
1 a z 2 1 0
2 b y 1 1 0
3 b z 2 1 0
I noticed that OP is looking for what appears to be a "total" row per country. This is a strategy to get that.
from collections import defaultdict
result = defaultdict(int)
cols = ('country', 'group', 'status')
for c, g, s in zip(*map(df2.get, cols)):
result[(c, g, s)] += 1
result[(c, 'total', s)] += 1
pd.Series(result).rename_axis(cols[:2] + (None,)).unstack(fill_value=0).reset_index()
country group confirmed failed unknown
0 a total 2 1 1
1 a y 0 0 1
2 a z 2 1 0
3 b total 3 2 0
4 b y 1 1 0
5 b z 2 1 0
Strategy 2
result = {}
for c, grp in df2.groupby('country'):
result[(c, 'total')] = {**grp.status.value_counts()}
for g, grp_ in grp.groupby('group'):
result[(c, g)] = {**grp_.status.value_counts()}
idx = pd.MultiIndex.from_tuples(result.keys(), names=['country', 'group'])
pd.DataFrame.from_records([*result.values()], idx) \
.fillna(0, downcast='infer').reset_index()
country group confirmed unknown failed
0 a total 2 1 1
1 a y 0 1 0
2 a z 2 0 1
3 b total 3 0 2
4 b y 1 0 1
5 b z 2 0 1
Strategy 3
x = df2.groupby(['group', 'country', 'status']).size()
y = pd.concat({'total': x.groupby(['country', 'status']).size()}, names=['group'])
x.append(y).unstack(fill_value=0) \
.rename_axis(None, axis=1).swaplevel(0, 1).sort_index().reset_index()
country group confirmed failed unknown
0 a total 1 1 1
1 a y 0 0 1
2 a z 2 1 0
3 b total 2 2 0
4 b y 1 1 0
5 b z 2 1 0
df['abc'] = df[['col_a','col_b','col_c']].sum(axis=1)
table = pd.pivot_table(df, index =['country','group'], columns='status', values='abc', fill_value = 0)
So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0 # --> gets removed since this row appears after id 1 already had a status of 1
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.
The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
However, this runs very slowly, any way to fix this or to alternatively speed up the computation?
First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:
#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Another solution is use custom lambda function with Series.idxmax:
def f(x):
if x['new'].any():
return x.iloc[:x['new'].idxmax()+1, :]
else:
return x
df1 = (df.assign(new=(df['Status'] == 1))
.groupby(df['Id'], group_keys=False)
.apply(f).drop('new', axis=1))
print (df1)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:
m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]
m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
.apply(lambda x: x.shift(fill_value=0).cumsum())
.eq(0))
df = df[m2.reindex(df.index, fill_value=True)]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Let's start with this dataset.
l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])
We will find the status=1 index for each id.
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
index
id
1 4
2 8
Now we join over df_ with status_1_indice
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
Notice .fillna(np.inf) for id's that dont have status=1. Result:
level_0 id status index
0 0 1 0 4.000000
1 1 1 0 4.000000
2 2 1 0 4.000000
3 3 1 0 4.000000
4 4 1 1 4.000000
5 5 2 0 8.000000
6 6 1 0 4.000000
7 7 2 0 8.000000
8 8 2 1 8.000000
9 9 3 0 inf
10 10 2 0 8.000000
11 11 3 0 inf
Required dataframe can be obtained by:
join_table.query('level_0 <= index')[['id', 'status']]
Together:
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 2 1
9 3 0
11 3 0
I cant vouch for the performance but this is more straight forward than the method in question.
I have a dataframe like the following
df
idA idB yA yB
0 3 2 0 1
1 0 1 0 0
2 0 4 0 1
3 0 2 0 1
4 0 3 0 0
I would like to have a unique y for each id. So
df
id y
0 0 0
1 1 0
2 2 1
3 3 3
4 4 1
First create new DataFrame by flatten columns selected by iloc with numpy.ravel, then sort_values and drop_duplicates by id column:
df2 = (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()})
.sort_values('id')
.drop_duplicates(subset=['id'])
.reset_index(drop=True))
print (df2)
id y
0 0 0
1 1 0
2 2 1
3 3 0
4 4 1
Detail:
print (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()}))
id y
0 3 0
1 2 1
2 0 0
3 1 0
4 0 0
5 4 1
6 0 0
7 2 1
8 0 0
9 3 0
Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64