Group by with aggregation function as new field in pandas - python

If I do the following group by on a mysql table
SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1
what I get is a table with three columns
col1 col2 agg_col
How can I do the same on a pandas dataframe?
Suppose I have a Dataframe that has three columns col1 col2 and col3. Group by operation
grouped = my_df.groupby('col1')
will returned the data grouped by col1
Also
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
will return the aggregated column equivalent to the one on the sql query. But how can I add this on the grouped dataframe?

We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series:
agg_col_series.reset_index(name='agg_col')
Full example with dummy data:
import random
import pandas as pd
col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]
df = pd.DataFrame(data={
'col1': col1,
'col2': col2,
'col3': col3,
})
grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
print agg_col_series.reset_index(name='agg_col')
index col1 agg_col
0 1 15566
1 2 20056
2 3 17313
3 4 17304
4 5 16380

Let's use groupby with a lambda function that uses size and nunique
then rename the series to 'agg_col' and reset_index to get a dataframe.
import pandas as pd
import numpy as np
np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
'Col2':np.random.randint(1000,9999,50),
'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})
df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()
Output:
Col1 agg_col
0 A 120
1 B 96
2 C 190

Related

How to convert string date column to timestamp in a new column in Python Pandas

I have the following example dataframe:
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df
col1
0 2022-05-16T12:31:00Z
1 2021-01-11T11:32:00Z
I need a second column (say col2) which will have the corresponding timestamp value for each col1 date string value from col1.
How can I do that without using a for loop?
Maybe try this?
import pandas as pd
import numpy as np
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df['col2'] = pd.to_datetime(df['col1'])
df['col2'] = df.col2.values.astype(np.int64) // 10 ** 9
df
Let us try to_datetime
df['col2'] = pd.to_datetime(df['col1'])
df
Out[614]:
col1 col2
0 2022-05-16T12:31:00Z 2022-05-16 12:31:00+00:00
1 2021-01-11T11:32:00Z 2021-01-11 11:32:00+00:00
Update
st = pd.to_datetime('1970-01-01T00:00:00Z')
df['unix'] = (pd.to_datetime(df['col1'])- st).dt.total_seconds()
Out[632]:
0 1.652704e+09
1 1.610365e+09
Name: col1, dtype: float64

Filter multiple dataframes with criteria from list using loop

The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)

How to split a column by a delimiter, while respecting the relative position of items to be separated

Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c

pandas select rows by condition for all of dataframe columns

I have a dataframe
d = {'col1': [1, 2], 'col2': [3, 4], 'col3' : [5,6]}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 5
1 2 4 6
for example, i need select all rows with value = 1 so my code is:
df[df['col1']==1]
col1 col2 col3
0 1 3 5
but how i can choose not only 'col1' but all columns, i have try this code:
for col in df.columns:
print(df[df[col]==1])
but outpus not in pandas dataframe's view:
col1 col2 col3
0 1 3 5
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
can i go over all the columns and get view like in dataframe?
You can use df.eq to check if any value in the df is equal to 1 and using df.any on axis=1 , this will return True for all rows where any of the column values have 1. Finally use boolean indexing
output = df[df.eq(1).any(axis=1)]

Python Pandas .groupby().mean() with NaN values [duplicate]

Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))

Categories

Resources