groupby, sum and count to one table

groupby, sum and count to one table - python

I have a dataframe below
df=pd.DataFrame({"A":np.random.randint(1,10,9),"B":np.random.randint(1,10,9),"C":list('abbcacded')})
A B C
0 9 6 a
1 2 2 b
2 1 9 b
3 8 2 c
4 7 6 a
5 3 5 c
6 1 3 d
7 9 9 e
8 3 4 d
I would like to get grouping result (with key="C" column) below,and the row c d and e is dropped intentionally.
number A_sum B_sum
a 2 16 15
b 2 3 11
this is 2row*3column dataframe. the grouping key is column C. And
The column "number"represents the count of each letter(a and b).
A_sum and B_sum represents grouping sum of letters in column C.
I guess we should use method groupby but how can I get this data summary table ?

You can do this using a single groupby with
res = df.groupby(df.C).agg({'A': 'sum', 'B': {'sum': 'sum', 'count': 'count'}})
res.columns = ['A_sum', 'B_sum', 'count']

One option is to count the size and sum the columns for each group separately and then join them by index:
df.groupby("C")['A'].agg({"number": 'size'}).join(df.groupby('C').sum())
number A B
# C
# a 2 11 8
# b 2 14 12
# c 2 8 5
# d 2 11 12
# e 1 7 2
You can also do df.groupby('C').agg(["sum", "size"]) which gives an extra duplicated size column, but if you are fine with that, it should also work.

Related

Pandas/Python: How to Transpose Duplicate Rows to Columns and Preserve Order?

I have two columns.
The first column has the values A, B, C, D, and the the second column values corresponding to A, B, C, and D.
I'd like to convert/transpose A, B, C and D into 4 columns named A, B, C, D and have whatever values had previously corresponded to A, B, C, D (the original 2nd column) ordered beneath the respective column--A, B, C, or D. The original order must be preserved.
Here's an example.
Input:
A|1
B|2
C|3
D|4
A|3
B|6
C|3
D|6
Desired output:
A|B|C|D
1|2|3|4
3|6|3|6
Any ideas on how I can accomplish this using Pandas/Python?
Thanks a lot!

Very similar to pivoting with two columns (Q/A 10 here):
(df.assign(idx=df.groupby('col1').cumcount())
.pivot(index='idx', columns='col1', values='col2')
)
Output:
col1 A B C D
idx
0 1 2 3 4
1 3 6 3 6

To ensure your, you need to "capture" the order first, I am going to use the unique method for this situtaion:
Given df,
df = pd.DataFrame({'Col1':[*'ZCYBWA']*2, 'Col2':np.arange(12)})
Col1 Col2
0 Z 0
1 C 1
2 Y 2
3 B 3
4 W 4
5 A 5
6 Z 6
7 C 7
8 Y 8
9 B 9
10 W 10
11 A 11
Let's get order using unique:
order = df['Col1'].unique()
Then we can reshape using:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack()
Col1 A B C W Y Z
0 5 3 1 4 2 0
1 11 9 7 10 8 6
But, adding reindex we can get original order:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack().reindex(order, axis=1)
Col1 Z C Y B W A
0 0 1 2 3 4 5
1 6 7 8 9 10 11

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])

Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Efficient method to split dataframe multiple times in Python?

I currently have a pandas DataFrame df with the size of 168078 rows × 43 columns. A summary of df is shown below:
doi gender order year ... count
9384155 10.1103/PRL.102.039801 male 1 2009 ... 1
...
3679211 10.1103/PRD.69.024009 male 2 2004 ... 501
The df is currently sorted by count, and therefore varies from 1 to 501.
I would like to split the df into 501 smaller subdata by splitting it by count. In other words, at the end of the process, I would have 501 different sub-df with each characteristic count value.
Since the number of resulting (desired) DataFrames is quite high, and since it is a quantitative data, I was wondering if:
a) it is possible to split the DataFrame that many times (if yes, then how), and
b) it is possible to name each DataFrame quantitatively without manually assigning a name 501 times; i.e. for example, df with count == 1 would be df.1 without having to assign it.

The best practice you can do is create a dictionary of data frames. Below I show you an example:
df=pd.DataFrame({'A':[4,5,6,7,7,5,4,5,6,7],
'count':[1,2,3,4,5,6,7,8,9,10],
'C':['a','b','c','d','e','f','g','h','i','j']})
print(df)
A count C
0 4 1 a
1 5 2 b
2 6 3 c
3 7 4 d
4 7 5 e
5 5 6 f
6 4 7 g
7 5 8 h
8 6 9 i
9 7 10 j
Now we create the dictionary. As you can see the key is the value of count in each row.
keep in mind that here Series.unique is used to make that in the case where there are two rows with the same count value then they are created in the same dictionary.
dfs={key:df[df['count']==key] for key in df['count'].unique()}
Below I show the content of the entire dictionary created and how to access it:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('-'*50)
dfs[1]
A count C
0 4 1 a
--------------------------------------------------
dfs[2]
A count C
1 5 2 b
--------------------------------------------------
dfs[3]
A count C
2 6 3 c
--------------------------------------------------
dfs[4]
A count C
3 7 4 d
--------------------------------------------------
dfs[5]
A count C
4 7 5 e
--------------------------------------------------
dfs[6]
A count C
5 5 6 f
--------------------------------------------------
dfs[7]
A count C
6 4 7 g
--------------------------------------------------
dfs[8]
A count C
7 5 8 h
--------------------------------------------------
dfs[9]
A count C
8 6 9 i
--------------------------------------------------
dfs[10]
A count C
9 7 10 j
--------------------------------------------------

you can just use groupby to get the result like below
here
g.groups: will give group name (group id) for each group
g.get_group: will give you one group with given group name
import numpy as np
import pandas as pd
df=pd.DataFrame({'A':np.random.choice(["a","b","c", "d"], 10),
'count':np.random.choice(10,10)
})
g = df.groupby("count")
for key in g.groups:
print(g.get_group(key))
print("\n---------------")
Result
A count
3 c 0
---------------
A count
9 a 2
---------------
A count
0 c 3
2 b 3
---------------
A count
1 b 4
5 d 4
6 a 4
7 b 4
---------------
A count
8 c 5
---------------
A count
4 d 8
---------------

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.

Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

How do I get panda's "update" function to overwrite numbers in one column but not another?

Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?

Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6

If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby, sum and count to one table - python

You can do this using a single groupby with res = df.groupby(df.C).agg({'A': 'sum', 'B': {'sum': 'sum', 'count': 'count'}}) res.columns = ['A_sum', 'B_sum', 'count']

Related

Pandas/Python: How to Transpose Duplicate Rows to Columns and Preserve Order?

Pandas melt function using column index positions rather than colum names

Efficient method to split dataframe multiple times in Python?

Duplicate row of low occurrence in pandas dataframe

How do I get panda's "update" function to overwrite numbers in one column but not another?

Categories

Resources