Dataframe Sum contiguous columns following Duplicates - python

I have one df dataframe as follow :
Item Size
0 .decash 1
1 .decash 2
2 usdjpy 1
3 .decash 1
4 usdjpy 1
I would to transform to a df2 as follow (drop duplicates and sum Size) :
Item Size
0 .decash 4
1 usdjpy 2

You can use groupby(..., as_index=False) and sum():
In [270]: df.groupby('Item', as_index=False)['Size'].sum()
Out[270]:
Item Size
0 .decash 4
1 usdjpy 2

Related

python pandas adding 2 dataframe with specific column

I have 2 dataframe the one looks like this :
Date id name amount period
2011-06-30 1 A 10000 1
2011-06-30 2 B 10000 1
2011-06-30 3 C 10000 1
And another one looks like this :
id amount period
1 10000 1
3 10000 0
And the result that i want looks like this :
id amount period
1 20000 2
2 10000 1
3 20000 1
How can i do that in python pandas?
Use concat with filtered columns with aggregate sum:
df = pd.concat([df1[['id','amount','period']], df2]).groupby('id', as_index=False).sum()
print (df)
id amount period
0 1 20000 2
1 2 10000 1
2 3 20000 1
EDIT:
If need subtract by id create index for id and then use DataFrame.sub:
df11 = df1[['id','amount','period']].set_index('id')
df22 = df2.set_index('id')
df3 = df11.sub(df22, fill_value=0).reset_index()
print (df3)
id amount period
0 1 0.0 0.0
1 2 10000.0 1.0
2 3 0.0 1.0

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

How to Transpose multiple columns into multiple rows but retain primary keys as is using Pandas

I have a dataframe which can be generated from the code given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})
The dataframe looks like as shown below
I would like to retain the rows of each person as seperate rows and not as columns like shown in screenshot above.In addition, I want the date1derived,date2derived columns to be dropped.
I did try below approaches but they didn't provide the expected output
1) df.set_index(['person_id']).stack()/unstack
2) df.set_index(['person_id','date1','date2','date3']).stack()/unstack()
3) df.set_index('person_id').unstack()/stack
How can I get an output to be like this? I have more than 600 columns, so I don't think writing the column names manually would help me.
This is a wide_to_long problem:
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
date val
person_id grp
1 1 12/31/2007 2
2 12/31/2017 1
3 12/31/2027 7
2 1 11/25/2009 4
2 11/25/2019 3
3 11/25/2029 9
3 1 10/06/2005 6
2 10/06/2015 5
3 10/06/2025 11
To match your expected output:
df = pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
df = df.reset_index(level=1, drop=True).reset_index()
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11
You can do it without wide_to_long() but just with append()
df2 = pd.DataFrame()
for i in range(1, 4):
new_df = df[['person_id', f'date{i}', f'val{i}']]
new_df.columns = ['person_id', 'date', 'val']
df2 = df2.append(new_df)
df2.sort_values('person_id').reset_index(drop=True)
ouput :
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11

make a unique enumeration for concatenated pandas df

I have some dataframes where data is tagged in groups, let's say as such:
df1 = pd.DataFrame({'id':[1,3,7, 10,30, 70, 100, 300], 'name':[1,1,1,1,1,1,1,1], 'tag': [1,1,1, 2,2,2, 3,3]})
df2 = pd.DataFrame({'id':[2,5,6, 20, 50, 200, 500, 600], 'name': [2,2,2,2,2,2,2,2], 'tag':[1,1,1, 2, 2, 3,3,3]})
df3 = pd.DataFrame({'id':[4, 8, 9, 40, 400, 800, 900], 'name': [3,3,3,3,3,3,3], 'tag':[1,1,1, 2, 3, 3,3]})
In each dataframe, the tag is attibuted in an ascending order of ids (so bigger ids will have equal or bigger tags).
My wish is to recalculate tags in the concatenated dataframe,
df = pd.concat([df1, df2, df3])
so that the tag of each group will be in ascending order of ids of the first element of each. So, the group starting by id=1 will be tagged by 1 (that is, ids 1,3,7), the group starting by id=2 will be tagged by 2 (that is , ids 2,5,6), the group starting by 4 will be tagged by 3, the group starting by 10 will be tagged as 4, and so on.
I did manage to get a (complicated!) solution:
1) Get first row of each group , put those in a dataframe , sort by id and create the new tags:
dff = pd.concat([df1.groupby('tag').first(), df2.groupby('tag').first(), df3.groupby('tag').first()])
dff = dff.sort(['id'])
dff = dff.reset_index()
dff['new_tags'] = dff.index +1
2) Concatenate this dataframe with initial ones, drop_duplicates so as to keep the newly tagged rows, order by group , then propagate new tags:
df = pd.concat([dff, df1, df2, df3])
df = df.drop_duplicates(subset=['id', 'tag', 'name'])
df = df.sort(['name', 'tag'])
df = df.fillna(method = 'pad')
The new tags are exactly what needed, but my solution seems too complicated. Would you have a suggestion on how to make easier? I think I must be missing something!
Thanks in advance,
M.
Using pd.concat + keys , I break down the steps
df=pd.concat([df1,df2,df3],keys=[0,1,2])
df=df.reset_index(level=0)#get the level=0 index
df=df.sort_values(['tag','level_0']) # sort the value
df['New']=(df['tag'].diff().ne(0)|df['level_0'].diff().ne(0)).cumsum()
df
Out[110]:
level_0 id name tag New
0 0 1 1 1 1
1 0 3 1 1 1
2 0 7 1 1 1
0 1 2 2 1 2
1 1 5 2 1 2
2 1 6 2 1 2
0 2 4 3 1 3
1 2 8 3 1 3
2 2 9 3 1 3
3 0 10 1 2 4
4 0 30 1 2 4
5 0 70 1 2 4
3 1 20 2 2 5
4 1 50 2 2 5
3 2 40 3 2 6
6 0 100 1 3 7
7 0 300 1 3 7
5 1 200 2 3 8
6 1 500 2 3 8
7 1 600 2 3 8
4 2 400 3 3 9
5 2 800 3 3 9
6 2 900 3 3 9
Once concatenated, you can use groupby the columns 'tag' and 'name' with transform and first on the column 'id'. Then sort_values this series and cumsum the diff is more than 0 such as:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
df['new'] = (df.groupby(['tag','name'])['id'].transform('first')
.sort_values().diff().ne(0.).cumsum())
and you get the expected output:
id name tag new
0 1 1 1 1
1 2 2 1 2
2 3 1 1 1
3 4 3 1 3
4 5 2 1 2
5 6 2 1 2
6 7 1 1 1
7 8 3 1 3
8 9 3 1 3
9 10 1 2 4
10 20 2 2 5
11 30 1 2 4
12 40 3 2 6
...
EDIT: to avoid using groupby, you can drop_duplicates and index to get the index of the first ids, create the column new with an incremental value using loc and range and then ffill after sort_values to fill the values:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
list_ind = df.drop_duplicates(['name','tag']).index
df.loc[list_ind,'new'] = range(1,len(list_ind)+1)
df['new'] = df.sort_values(['tag','name'])['new'].ffill().astype(int)
and you get the same result

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

I have a dataframe from a multiple choice questions and it is formatted like so:
Sex Qu1 Qu2 Qu3
Name
Bob M 1 2 1
John M 3 3 5
Alex M 4 1 2
Jen F 3 2 4
Mary F 4 3 4
The data is a rating from 1 to 5 for the 3 multiple choice questions. I want rearrange the data so that the index is range(1,6) where 1='bad', 2='poor', 3='ok', 4='good', 5='excellent', the columns are the same and the data is the count of the number occurrences of the values (excluding the Sex column). This is basically a histogram of fixed bin sizes and the x-axis labeled with strings. I like the output of df.plot() much better than df.hist() for this but I can't figure out how to rearrange the table to give me a histogram of data. Also, how do you change x-labels to be strings?
Series.value_counts gives you the histogram you're looking for:
In [9]: df['Qu1'].value_counts()
Out[9]:
4 2
3 2
1 1
So, apply this function to each of those 3 columns:
In [13]: table = df[['Qu1', 'Qu2', 'Qu3']].apply(lambda x: x.value_counts())
In [14]: table
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 NaN 2 1
3 2 2 NaN
4 2 NaN 2
5 NaN NaN 1
In [15]: table = table.fillna(0)
In [16]: table
Out[16]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
Using table.reindex or table.ix[some_array] you can rearrange the data.
To transform to strings, use table.rename:
In [17]: table.rename(index=str)
Out[17]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
In [18]: table.rename(index=str).index[0]
Out[18]: '1'

Categories

Resources