How to write complicated function to aggregate DataFrame

How to write complicated function to aggregate DataFrame - python

I have a DataFrame in Python like below, which presents agreements of clients:
df = pd.DataFrame({"ID" : [1,2,1,1,3],
"amount" : [100,200,300,400,500],
"status" : ["active", "finished", "finished",
"active", "finished"]})
I need to write FUNCTION in Python, which will calculate:
1.Number (NumAg) and amount (AmAg) of contracts per "ID"
2.Number (NumAct) and amount of active (AmAct) contracts per ID
3.Number (NumFin) and amount of finished (AmFin) contracts per ID
To be more precision i need to create by this function DataFrame like below:

The below solution should fit your use case.
import pandas as pd
def summarise_df(df):
# Define mask to filter df by 'active' value in 'status' column for 'NumAct', 'AmAct', 'NumFin', and 'AmFin' columns
active_mask = df['status'].str.contains('active')
return df.groupby('ID').agg( # Create first columns in output df using agg (no mask needed)
NumAg=pd.NamedAgg(column='amount', aggfunc='count'),
AmAg=pd.NamedAgg(column='amount', aggfunc='sum'
)).join( # Add columns using values with 'active' status
df[active_mask].groupby('ID').agg(
NumAct=pd.NamedAgg(column='amount', aggfunc='count'),
AmAct=pd.NamedAgg(column='amount', aggfunc='sum')
)).join( # Add columns using values with NOT 'active' (i.e. 'finished') status
df[~active_mask].groupby('ID').agg(
NumFin=pd.NamedAgg(column='amount', aggfunc='count'),
AmFin=pd.NamedAgg(column='amount', aggfunc='sum')
)).fillna(0) # Replace nan values with 0
I would recommend reading over this function and its comments alongside documentation for groupby() and join() so that you can develop a better understanding of exactly what is being done here. It is seldom a wise decision to rely upon code that you don't have a good grasp on.

You could use groupby on ID with agg, after adding two bool columns that make the aggregation easier:
df['AmAct'] = df.amount[df.status.eq('active')]
df['AmFin'] = df.amount[df.status.eq('finished')]
df = df.groupby('ID').agg(
NumAg = ('ID', 'count'),
AmAg = ('amount', 'sum'),
NumAct = ('status', lambda col: col.eq('active').sum()),
AmAct = ('AmAct', 'sum'),
NumFin = ('status', lambda col: col.eq('finished').sum()),
AmFin = ('AmFin', 'sum')
)
Result:
NumAg AmAg NumAct AmAct NumFin AmFin
ID
1 3 800 2 500.0 1 300.0
2 1 200 0 0.0 1 200.0
3 1 500 0 0.0 1 500.0
Or add some more columns to df to do a simpler groupby on ID with sum:
df.insert(1, 'NumAg', 1)
df['NumAct'] = df.status.eq('active')
df['AmAct'] = df.amount[df.NumAct]
df['NumFin'] = df.status.eq('finished')
df['AmFin'] = df.amount[df.NumFin]
df.drop(columns=['status'], inplace=True)
df = df.groupby('ID').sum().rename(columns={'amount': 'AmAg'})
with the same result.
Or, maybe the easiest way, let pivot_table do most of the work, after adding a count column to df, and some column-rearranging afterwards:
df['count'] = 1
df = df.pivot_table(index='ID', columns='status', values=['count', 'amount'],
aggfunc=sum, fill_value=0, margins=True).drop('All')
df.columns = ['AmAct', 'AmFin', 'AmAg', 'NumAct', 'NumFin', 'NumAg']
df = df[['NumAg', 'AmAg', 'NumAct', 'AmAct', 'NumFin', 'AmFin']]

Related

KEGG Drug database Python script

I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!

Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.

if duplicata row update rows to 0 in Pyspark

I need to update values in column DF.EMAIL if have duplicates values in DF.EMAIL column to 0
generate DF
data = [('2345', 'leo#gmai.com'),
('2398', 'leo#hotmai.com'),
('2398', 'leo#hotmai.com'),
('2328', 'leo#yahoo.con'),
('3983', 'leo#yahoo.com.ar')]
serialize DF
df = sc.parallelize(data).toDF(['ID', 'EMAIL'])
# show DF
df.show()
Partial Solution
# create column with value 0 if don't have duplicates
# if have duplicates set value 1
df_join = df.join(
df.groupBy(df.columns).agg((count("*")>1).cast("int").alias("duplicate_indicator")),
on=df.columns,
how="inner"
)
# Update to 1 if have duplicates
df1 = df_join.withColumn(
"EMAIL",
when(df_join.duplicate_indicator == 1,"") \
.otherwise(df_join.EMAIL)
)

Syntax-wise, this looks more compact but yours might perform better.
df = (df.withColumn('count', count('*').over(Window.partitionBy('ID')))
.withColumn('EMAIL', when(col('count') > 1, '').otherwise(col('EMAIL'))))

Collapse certain columns horizontally

I have:
haves = pd.DataFrame({'Product':['R123','R234'],
'Price':[1.18,0.23],
'CS_Medium':[1, 0],
'CS_Small':[0, 1],
'SC_A':[1,0],
'SC_B':[0,1],
'SC_C':[0,0]})
print(haves)
given a list of columns, like so:
list_of_starts_with = ["CS_", "SC_"]
I would like to arrive here:
wants = pd.DataFrame({'Product':['R123','R234'],
'Price':[1.18,0.23],
'CS':['Medium', 'Small'],
'SC':['A', 'B'],})
print(wants)
I am aware of wide_to_long but don't think it is applicable here?

We could convert "SC" and "CS" column values to boolean mask to filter the column names; then join it back to the original DataFrame:
msk = haves.columns.str.contains('_')
s = haves.loc[:, msk].astype(bool)
s = s.apply(lambda x: dict(s.columns[x].str.split('_')), axis=1)
out = haves.loc[:, ~msk].join(pd.DataFrame(s.tolist(), index=s.index))
Output:
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B

Based on the list of columns (assuming the starts_with is enough to identify them), it is possible to do the changes in bulk:
def preprocess_column_names(list_of_starts_with, column_names):
"Returns a list of tuples (merged_column_name, options, columns)"
columns_to_transform = []
for starts_with in list_of_starts_with:
len_of_start = len(starts_with)
columns = [col for col in column_names if col.startswith(starts_with)]
options = [col[len_of_start:] for col in columns]
merged_column_name = starts_with[:-1] # Assuming that the last char is not needed
columns_to_transform.append((merged_column_name, options, columns))
return columns_to_transform
def merge_columns(df, merged_column_name, options, columns):
for col, option in zip(columns, options):
df.loc[df[col] == 1, merged_column_name] = option
return df.drop(columns=columns)
def merge_all(df, columns_to_transform):
for merged_column_name, options, columns in columns_to_transform:
df = merge_columns(df, merged_column_name, options, columns)
return df
And to run:
columns_to_transform = preprocess_column_names(list_of_starts_with, haves.columns)
wants = merge_all(haves, columns_to_transform)
If your column names are not surprising (such as Index_ being in list_of_starts_with) the above code should solve the problem with a reasonable performance.

One option is to convert the data to a long form, filter for rows that have a value of 1, then convert back to wide form. We can use pivot_longer from pyjanitor for the wide to long part, and pivot to return to wide form:
# pip install pyjanitor
import pandas as pd
import janitor
( haves
.pivot_longer(index=["Product", "Price"],
names_to=("main", "other"),
names_sep="_")
.query("value==1")
.pivot(index=["Product", "Price"],
columns="main",
values="other")
.rename_axis(columns=None)
.reset_index()
)
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B
You can totally avoid pyjanitor, by tranforming on the columns before reshaping (it still involves wide to long, then long to wide):
index = [col for col in haves
if not col.startswith(tuple(list_of_starts_with))]
temp = haves.set_index(index)
temp.columns = (temp
.columns.str.split("_", expand=True)
.set_names(["main", "other"])
# reshape to get final dataframe
(temp
.stack(["main", "other"])
.loc[lambda df: df == 1]
.reset_index("other")
.drop(columns=0)
.unstack()
.droplevel(0, 1)
.rename_axis(columns=None)
.reset_index()
)
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B

convert a list in rows of dataframe in one column to simple string

I have a dataframe which has list in one column that I want to convert into a simple string
id data_words_nostops
26561364 [andrographolide, major, labdane, diterpenoid]
26561979 [dgat, plays, critical, role, hepatic, triglyc]
26562217 [despite, success, imatinib, inhibiting, bcr]
DESIRED OUTPUT
id data_words_nostops
26561364 andrographolide, major, labdane, diterpenoid
26561979 dgat, plays, critical, role, hepatic, triglyc
26562217 despite, success, imatinib, inhibiting, bcr

Try this :
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Complete code :
import pandas as pd
l1 = ['26561364', '26561979', '26562217']
l2 = [['andrographolide', 'major', 'labdane', 'diterpenoid'],['dgat', 'plays', 'critical', 'role', 'hepatic', 'triglyc'],['despite', 'success', 'imatinib', 'inhibiting', 'bcr']]
df = pd.DataFrame(list(zip(l1, l2)),
columns =['id', 'data_words_nostops'])
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Output :
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr

df["data_words_nostops"] = df.apply(lambda row: row["data_words_nostops"][0], axis=1)

You can use pandas str join for this:
df["data_words_nostops"] = df["data_words_nostops"].str.join(",")
df
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr

I tried the following as well
df_ready['data_words_nostops_Joined'] = df_ready.data_words_nostops.apply(', '.join)

Python Pandas aggregation with condition

I need to group my dataframe and use several aggregation functions on different columns. And some of this aggregation have conditions.
Here is an example. The data are all the orders from 2 customers and I would like to calculate some information on each customer. Like their orders count, their total spendings and average spendings.
import pandas as pd
data = {'order_id' : range(1,9),
'cust_id' : [1]*5 + [2]*3,
'order_amount' : [100,50,70,75,80,105,30,20],
'cust_days_since_reg' : [0,10,25,37,52,0,17,40]}
orders = pd.DataFrame(data)
aggregation = {'order_id' : 'count',
'order_amount' : ['sum', 'mean']}
cust = orders.groupby('cust_id').agg(aggregation).reset_index()
cust.columns = ['_'.join(col) for col in cust.columns.values]
This works fine and gives me :
_
But I have to add an aggregation function with a argument and a condition : the amount a customer spent in his first X months (X must be customizable)
Since I need an argument in this aggregation I tried :
def spendings_X_month(group, n_months):
return group.loc[group['cust_days_since_reg'] <= n_months*30,
'order_amount'].sum()
aggregation = {'order_id' : 'count',
'order_amount' : ['sum',
'mean',
lambda x: spendings_X_month(x, 1)]}
cust = orders.groupby('cust_id').agg(aggregation).reset_index()
But that last line gets me the error : KeyError: 'cust_days_since_reg'.
It must be a scoping error, the cust_days_since_reg column must not be visible in this situation.
I could calculate this last column separately and then join the resulting dataframe to the first but there must be a better solution, that makes every thing in only one groupby.
Could anyone help me with this problem please ?
Thank You

You cannot use agg, because each function working only with one column, so this kind of filtering based of another col is not possible.
Solution use GroupBy.apply:
def spendings_X_month(group, n_months):
a = group['order_id'].count()
b = group['order_amount'].sum()
c = group['order_amount'].mean()
d = group.loc[group['cust_days_since_reg'] <= n_months*30,
'order_amount'].sum()
cols = ['order_id_count','order_amount_sum','order_amount_mean','order_amount_spendings']
return pd.Series([a,b,c,d], index=cols)
cust = orders.groupby('cust_id').apply(spendings_X_month, 1).reset_index()
print (cust)
cust_id order_id_count order_amount_sum order_amount_mean \
0 1 5.0 375.0 75.000000
1 2 3.0 155.0 51.666667
order_amount_spendings
0 220.0
1 135.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write complicated function to aggregate DataFrame - python

Related

KEGG Drug database Python script

if duplicata row update rows to 0 in Pyspark

Collapse certain columns horizontally

convert a list in rows of dataframe in one column to simple string

Python Pandas aggregation with condition

Categories

Resources