Pandas : data frame transformation - python

I have a pandas dataframe which looks like below:
print (df)
customerid acc_type amount premium_member
0 1 Savings 200 N
1 1 Current 300 Y
2 2 Savings 250 N
I want it to transform to below data frame which converts acc_type and amount into 2 and 2 columns. (Dropping original ones).
Also at max it is sure that any customer cannot have more than two rows in original dataframe where account type is savings/current(not any other value).
Premium_member attribute is computed by taking Logical OR of boolean (Y and N) values.

Use:
#filter only 2 rows per customerid
df = df[df.groupby('customerid')['acc_type'].transform('size') < 3]
#new column
df['is'] = 1
#reshape and replace missing values to 0
df1 = df.set_index(['customerid','acc_type']).unstack(fill_value=0)
#check if Y in premium_member
s = df1.pop('premium_member').eq('Y').any(axis=1)
#change order of columns
df1 = df1.sort_index(axis=1, ascending=False)
#flatten MultiIndex
df1.columns = df1.columns.map(''.join)
#new column
df1['premium_member'] = np.where(s, 'Y','N')
#convert index to column
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
customerid isSavings isCurrent amountSavings amountCurrent \
0 1 1 1 200 300
1 2 1 0 250 0
premium_member
0 Y
1 N

Related

How to Pivot/Stack for multi header column dataframe

np.random.seed(2022) # added to make the data the same each time
cols = pd.MultiIndex.from_arrays([['A','A' ,'B','B'], ['min','max','min','max']])
df = pd.DataFrame(np.random.rand(3,4),columns=cols)
df.index.name = 'item'
A B
min max min max
item
0 0.009359 0.499058 0.113384 0.049974
1 0.685408 0.486988 0.897657 0.647452
2 0.896963 0.721135 0.831353 0.827568
There are two column headers and while working with csv, I get a blank column name for every other column on unmerging.
I want result that looks like this. How can I do it?
I tried to use pivot table but couldn't do it.
Try:
df = (
df.stack(level=0)
.reset_index()
.rename(columns={"level_1": "title"})
.sort_values(by=["title", "item"])
)
print(df)
Prints:
item title max min
0 0 A 0.762221 0.737758
2 1 A 0.930523 0.275314
4 2 A 0.746246 0.123621
1 0 B 0.044137 0.264969
3 1 B 0.577637 0.699877
5 2 B 0.601034 0.706978
Then to CSV:
df.to_csv('out.csv', index=False)

How to create a dataframe with a named-index and a unnamed-default-subindex

I want to create a dataframe with index of dates. But in one date there would be one record or more.
so I wanna create a dataframe like :
A B
2021-11-12 1 0 0
2 1 1
2021-11-13 1 0 0
2 1 0
3 0 1
so could I append any row with the same date into this dataframe, and the subindex would be auto-increased?
Or is there any other way to save records with the same date index in one dataframe?
Use:
#remove counter level
df = df.reset_index(level=1, drop=True)
#add new row
#your code
#correct add new row after last datetime
df = df.sort_index()
#add subindex
df = df.set_index(df.groupby(level=0).cumcount().add(1), append=True)

Pandas: DataFrame.apply returning series instead of dataframe

I am applying a function on a dataframe df and that function returns a dataframe int_df, but the result is getting stored as a series.
df
limit
0 4
new_df
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number
This is a pseudocode of what I have done:
def foo(x):
limit = x['limit']
int_df = pd.DataFrame(columns=['A', 'B']) # Create empty dataframe
# Append a new row to the dataframe
for i in range(0, limit):
int_df.loc[len(int_df.index)] = [i, 'Number']
return int_df # This is dataframe
new_df = df.apply(foo, axis=1)
new_df # This is a series but I need a dataframe
Is this the right way to do this?
IIUC, here's one way:
df = df.limit.apply(range).explode().to_frame('A').assign(B='number')
OUTPUT:
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number

Pandas: Delete rows with different encoding of 0s in python

I have calculated statistical values and written them to a csv file. The nan values are replaced with zeros. There are rows with only zeros and there are rows with both 0 and 0.0 values only. How can I delete these rows? According to the attached image rows number 5 , 6 (only 0.0s), 9 and 11 (both 0s and 0.0s) needs to get deleted.
import pandas as pd
all_df = pd.read_csv('source.csv')
all_df.dropna(subset=df_all.columns.tolist()[1:], how='all', inplace=True)
all_df.fillna(0, inplace=True)
all_df.to_csv('outfile.csv', index=False)
Use all_df[(all_df.T != 0).any()] or all_df[(all_df != 0).any(axis=1)]:
all_df = pd.DataFrame({'a':[0,0,0,1], 'b':[0,0,0,1]})
print all_df
a b
0 0 0
1 0 0
2 0 0
3 1 1
all_df = all_df[(all_df.T != 0).any()]
all_df
a b
3 1 1
EDIT 1: After looking at your data, a solution is to convert all numerical columns to float and then do the operations. This problem arises from the way the initial data were saved into the .csv file.
all_df = pd.read_csv('/Users/me/Downloads/Test11.csv')
# do not select 'activity' column
df = all_df.loc[:, all_df.columns != 'activity']
# convert to float
df = df.astype(float)
# remove columns with all 0s
mask = (df != 0).any(axis=1)
df = df[mask]
#mask activity column
recover_lines_of_activity_column = all_df['activity'][mask]
# Final result
final_df = pd.concat([recover_lines_of_activity_column, df], axis = 1)
Output:

Grouping data from multiple columns in data frame into summary view

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

Categories

Resources