Grouping data from multiple columns in data frame into summary view - python

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:

Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

Related

How to Pivot/Stack for multi header column dataframe

np.random.seed(2022) # added to make the data the same each time
cols = pd.MultiIndex.from_arrays([['A','A' ,'B','B'], ['min','max','min','max']])
df = pd.DataFrame(np.random.rand(3,4),columns=cols)
df.index.name = 'item'
A B
min max min max
item
0 0.009359 0.499058 0.113384 0.049974
1 0.685408 0.486988 0.897657 0.647452
2 0.896963 0.721135 0.831353 0.827568
There are two column headers and while working with csv, I get a blank column name for every other column on unmerging.
I want result that looks like this. How can I do it?
I tried to use pivot table but couldn't do it.
Try:
df = (
df.stack(level=0)
.reset_index()
.rename(columns={"level_1": "title"})
.sort_values(by=["title", "item"])
)
print(df)
Prints:
item title max min
0 0 A 0.762221 0.737758
2 1 A 0.930523 0.275314
4 2 A 0.746246 0.123621
1 0 B 0.044137 0.264969
3 1 B 0.577637 0.699877
5 2 B 0.601034 0.706978
Then to CSV:
df.to_csv('out.csv', index=False)

How to create a dataframe with a named-index and a unnamed-default-subindex

I want to create a dataframe with index of dates. But in one date there would be one record or more.
so I wanna create a dataframe like :
A B
2021-11-12 1 0 0
2 1 1
2021-11-13 1 0 0
2 1 0
3 0 1
so could I append any row with the same date into this dataframe, and the subindex would be auto-increased?
Or is there any other way to save records with the same date index in one dataframe?
Use:
#remove counter level
df = df.reset_index(level=1, drop=True)
#add new row
#your code
#correct add new row after last datetime
df = df.sort_index()
#add subindex
df = df.set_index(df.groupby(level=0).cumcount().add(1), append=True)

how to detect value changed in python, pandas in each object

180762508,1268510763,374723980,293,20180402035748,198,25,1,1
180762508,1268503685,374717256,307,20180402035758,225,38,1,1
180762508,1268492506,374708540,236,20180402035808,222,52,1,1
180762508,1268485868,374697563,248,20180402035818,197,47,1,1
180762508,1268482430,374688520,272,20180402035828,196,31,1,1
180707764,1270608366,374988433,246,20180402035925,66,37,1,0
180707764,1270620899,374992366,222,20180402035935,68,49,1,0
first column is unique id and the last column is my interest
I wanna know how can I find last column is changed from 0 to 1
I made a really big data frame with this dataset in pandas
import glob
import pandas as pd
path = r"1\1"
allFiles = glob.glob(path+"\*.DAT")
list=[]
for filename in allFiles:
df = pd.read_csv(filename, header = None)
list.append(df)
a = pd.concat(list)
a.head()
this is all I did
I don't have error but I wanna know the algorithm that I can find the last columns' value changed in each unique id
my goal is made a data frame that
first column is unique id and second, third column is latitude, longitude which is in third, second columns in my dataset and the time stamp which is in 5th columns that last column's value is changed from 0 to 1
If I understood you, you need to get the 5th row, where the change from 0 to 1, in the last column, takes place.
I made a dataframe with your first and last column (by the way, you said the 1st column is some kind of unique id, but I see repeated numbers), anyway based on your sample data, one possible solution is:
import pandas as pd
data = [[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180707764,0],[180707764,0]]
df = pd.DataFrame(data,columns=['my_id','interest'])
#new dataframe to compare the column interest
df2 = df.loc[df['interest'] != df['interest'].shift(-1)]
#output:
# my_id interest
# 4 180762508 1
# 6 180707764 0
imax = df2.index.max() #index after the change
imin = df2.index.min() #index before the change
for i in range(imin,imax,1):
i
#the row with the change in the original dataframe
print(df.loc[i])
Hi and thanks for posting. It looks like the first column doesn't have unique values, so I'm guessing you want to index returned or timestamp returned?
In any case, here's a sample of what might work for you if you want to find when the interest column for an ID changes from 0 to 1:
import pandas as pd
# Provided data
raw_str = """
180762508,1268510763,374723980,293,20180402035748,198,25,1,1 180762508,1268503685,374717256,307,20180402035758,225,38,1,1 180762508,1268492506,374708540,236,20180402035808,222,52,1,1 180762508,1268485868,374697563,248,20180402035818,197,47,1,1 180762508,1268482430,374688520,272,20180402035828,196,31,1,1 180707764,1270608366,374988433,246,20180402035925,66,37,1,0 180707764,1270620899,374992366,222,20180402035935,68,49,1,0
"""
# Replace newline and split on single whitespace
chunks = raw_str.replace('\n', '').split(' ')
# Create simple dictionary for ID, timestamp, and interest columns
ddict = {}
ddict['id'] = [i.split(',')[0] for i in chunks]
ddict['timestamp'] = [i.split(',')[4] for i in chunks]
ddict['interest'] = [i.split(',')[-1] for i in chunks]
# Convert dictionary to pandas DataFrame
df = pd.DataFrame(ddict)
# Create dictionary for sample data
# This is an existing ID with timestamp in the future and 1 as interest
tdict = {
'id': '180707764',
'timestamp': '20180402035945',
'interest': '1',
}
What df looks like:
id timestamp interest
0 180707764 20180402035925 0
1 180707764 20180402035935 0
2 180707764 20180402035945 1
3 180762508 20180402035748 1
4 180762508 20180402035758 1
5 180762508 20180402035808 1
6 180762508 20180402035818 1
7 180762508 20180402035828 1
Continuing on:
# Append that dictionary to your dataframe and sort by id, timestamp
df = df.append(pd.Series(tdict), ignore_index=True).copy(deep=True)
df = df.sort_values(['id', 'timestamp']).reset_index(drop=True)
# Shift dataframe back 1 period by rows
df2 = pd.DataFrame(df.shift(periods=-1, axis=0)
# Merge that dataframe with our original dataframe by index values
# We're dropping an extra id column and renaming our primary id column for aesthetics
df3 = df.merge(df2, left_index=True, right_index=True, suffixes=('_prev', '_curr'))
df3 = df3.drop('id_curr', axis=1).rename(columns={'id_prev': 'id'})
What df3 looks like:
id timestamp_prev interest_prev timestamp_curr interest_curr
0 180707764 20180402035925 0 20180402035935 0
1 180707764 20180402035935 0 20180402035945 1
2 180707764 20180402035945 1 20180402035748 1
3 180762508 20180402035748 1 20180402035758 1
4 180762508 20180402035758 1 20180402035808 1
5 180762508 20180402035808 1 20180402035818 1
6 180762508 20180402035818 1 20180402035828 1
7 180762508 20180402035828 1 NaN NaN
Now we can just create a conditional to return the row where interest changed from 0 to 1:
In[0]: df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]
Which returns:
timestamp_prev interest_prev id_curr timestamp_curr interest_curr
1 20180402035935 0 180707764 20180402035945 1
You can also return specific columns by adding those onto the end of the result set:
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]['timestamp_y']
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')][['id', 'timestamp_y']]
Or use the original dataframe (df) and .iloc to get specified data:
df.iloc[df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')].index, :]
Out:
id timestamp interest
1 180707764 20180402035935 0

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Pandas : data frame transformation

I have a pandas dataframe which looks like below:
print (df)
customerid acc_type amount premium_member
0 1 Savings 200 N
1 1 Current 300 Y
2 2 Savings 250 N
I want it to transform to below data frame which converts acc_type and amount into 2 and 2 columns. (Dropping original ones).
Also at max it is sure that any customer cannot have more than two rows in original dataframe where account type is savings/current(not any other value).
Premium_member attribute is computed by taking Logical OR of boolean (Y and N) values.
Use:
#filter only 2 rows per customerid
df = df[df.groupby('customerid')['acc_type'].transform('size') < 3]
#new column
df['is'] = 1
#reshape and replace missing values to 0
df1 = df.set_index(['customerid','acc_type']).unstack(fill_value=0)
#check if Y in premium_member
s = df1.pop('premium_member').eq('Y').any(axis=1)
#change order of columns
df1 = df1.sort_index(axis=1, ascending=False)
#flatten MultiIndex
df1.columns = df1.columns.map(''.join)
#new column
df1['premium_member'] = np.where(s, 'Y','N')
#convert index to column
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
customerid isSavings isCurrent amountSavings amountCurrent \
0 1 1 1 200 300
1 2 1 0 250 0
premium_member
0 Y
1 N

Categories

Resources