Reshaping data frame and aggregating values

Reshaping data frame and aggregating values - python

I have a data frame with three different columns, A, B and C. I have applied a group by command on Column A, B and C. I have also counted the no. of rows each group of three values possesses.
Resulting data:
Now, I want to make 0 and 1 (cell values in column C) as columns themselves.
Also, I want to add them and display their sum in a separate column (alongside 0 and 1 columns).
Desired output:
A B Count0 Count1 Sum of Counts Count1/Sum of Counts
1000 1000 38 538 567 538/567
1000 1001 9 90 99 90/99
1000 1002 8 16 24 16/24
1000 1003 2 10 12 10/12
(I am not an active Python user. I have searched a lot on this but can’t seem to find the right words to search it) If I learn how to do the sum of counts 0 and 1 and display alongside other columns in the dataframe, I will do the division myself.
Thanks in advance.

Use SeriesGroupBy.value_counts or size with unstack:
df = pd.DataFrame({
'A': [1000] * 10,
'B': [1000] * 2 + [1001] * 3 + [1002] * 5,
'C':[0,1] * 5
})
print (df)
A B C
0 1000 1000 0
1 1000 1000 1
2 1000 1001 0
3 1000 1001 1
4 1000 1001 0
5 1000 1002 1
6 1000 1002 0
7 1000 1002 1
8 1000 1002 0
9 1000 1002 1
df = df.groupby(['A','B'])['C'].value_counts().unstack(fill_value=0).reset_index()
#another solution
#df = pd.crosstab([df['A'], df['B']], df['C']).reset_index()
#solution 2
#df = df.groupby(['A','B','C']).size().unstack(fill_value=0).reset_index()
print (df)
C A B 0 1
0 1000 1000 1 1
1 1000 1001 2 1
2 1000 1002 2 3
And then sum and divide:
df = df.rename(columns={0:'Count0',1:'Count1'})
df['Sum of Counts'] = df['Count0'] + df['Count1']
df['Count1/Sum of Counts'] = df['Count1'] / df['Sum of Counts']
print (df)
C A B Count0 Count1 Sum of Counts Count1/Sum of Counts
0 1000 1000 1 1 2 0.500000
1 1000 1001 2 1 3 0.333333
2 1000 1002 2 3 5 0.600000

Try:
df1 = df.pivot_table(values='counts', index=['A', 'B'], columns=['C'], aggfunc='sum', fill_value=None, margins=True, dropna=True, margins_name='Sum of Counts').reset_index()
df1 = df1.rename(columns={0:'Count0',1:'Count1'})
df1['Count1/Sum of Counts'] = df1['Count1'] / df1['Sum of Counts']
You can do a reset_index() to structure it better. Also, Count1/Sum of Counts is just df['Count1'] / df['Sum of Counts']

Related

pandas dynamic wide to long based on time

I have pandas dataframe that contains data given below
ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt
1 100 2 200 4 300 6 400 8
2 101 3 201 5 301 7 401 9
dataframe looks like below
I would like to do the below
a) For each ID, create 3 rows (from 8 input columns data)
b) Each row should contain the two columns data
c) subsequent rows should shift the columns by 1 (one quarter data).
To understand better, I expect my output to be like as below
I tried the below based on the SO post here but unable to get the expected output
s = 3
n = 2
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2))
Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns.
'1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'

Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]:
N = 2
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
Test with 3 window:
N = 3
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9

One option is with a for loop or list comprehension, followed by a concatenation, and a sort:
temp = df.set_index('ID')
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
outcome = [temp
.iloc(axis=1)[n:n+4]
.set_axis(cols, axis = 1)
for n in range(0, len(cols)+2, 2)]
pd.concat(outcome).sort_index()
1st_rev 1st_transcnt 2nd_rev 2nd_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand):
def reshape_N(df, N):
# you can pass your custom column names here instead
# as long as it matches the width
# of the dataframe
columns = ['rev', 'transcnt']
columns = np.tile(columns, N)
numbers = np.arange(1, N+1).repeat(2)
columns = [f"{n}_{ent}"
for n, ent
in zip(numbers, columns)]
contents = []
start = 0
end = N * 2
temp = df.set_index("ID")
while (end < temp.columns.size):
end += start
frame = temp.iloc(axis=1)[start:end]
frame.columns = columns
contents.append(frame)
start += 2
if not contents:
return df
return pd.concat(contents).sort_index()
let's apply the function:
reshape_N(df, 2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
reshape_N(df, 3)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9

Pandas dataframe column wise calculation

I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.

Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70

Pandas: select rows by random groups while keeping all of the group's variables

My dataframe looks like this:
id std number
A 1 1
A 0 12
B 123.45 34
B 1 56
B 12 78
C 134 90
C 1234 100
C 12345 111
I'd like to select random rows of Id while retaining all of the information in the other rows, such that dataframe would look like this:
id std number
A 1 1
A 0 12
C 134 90
C 1234 100
C 12345 111
I tried it with
size = 1000
replace = True
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df2 = df1.groupby('Id', as_index=False).apply(fn)
and
df2 = df1.sample(n=1000).groupby('id')
but obviously that didn't work. Any help would be appreciated.

You need create random ids first and then compare original column id by Series.isin in boolean indexing:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

change value of one dataframe by value from other dataframe pandas

i have a dataframe df1
id value
1 100
2 100
3 100
4 100
5 100
i have another dataframe df2
id value
2 50
5 30
i want to replace these values for id's in df2 with the values in df1.
final modified df1:
id value
1 100
2 50
3 100
4 100
5 30
i will be running this in a loop. i'e df2, will change time to time (df1, outside loop)
what would be the best way to change the values?

Use combine_first, but first set_index by id in both DataFrames:
Notice: id column in df2 has to be unique.
df = df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
print (df)
id value
0 1 100.0
1 2 50.0
2 3 100.0
3 4 100.0
4 5 30.0

A loc based solution -
i = df1.set_index('id')
j = df2.set_index('id')
i.loc[j.index, 'value'] = j['value']
df2 = i.reset_index()
df2
id value
0 1 100
1 2 50
2 3 100
3 4 100
4 5 30

GroupBy one column, custom operation on another column of grouped records in pandas

I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so

df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.

You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reshaping data frame and aggregating values - python

Related

pandas dynamic wide to long based on time

Pandas dataframe column wise calculation

Pandas: select rows by random groups while keeping all of the group's variables

change value of one dataframe by value from other dataframe pandas

GroupBy one column, custom operation on another column of grouped records in pandas

Categories

Resources