I have pandas dataframe that contains data given below
ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt
1 100 2 200 4 300 6 400 8
2 101 3 201 5 301 7 401 9
dataframe looks like below
I would like to do the below
a) For each ID, create 3 rows (from 8 input columns data)
b) Each row should contain the two columns data
c) subsequent rows should shift the columns by 1 (one quarter data).
To understand better, I expect my output to be like as below
I tried the below based on the SO post here but unable to get the expected output
s = 3
n = 2
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2))
Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns.
'1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'
Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]:
N = 2
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
Test with 3 window:
N = 3
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
One option is with a for loop or list comprehension, followed by a concatenation, and a sort:
temp = df.set_index('ID')
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
outcome = [temp
.iloc(axis=1)[n:n+4]
.set_axis(cols, axis = 1)
for n in range(0, len(cols)+2, 2)]
pd.concat(outcome).sort_index()
1st_rev 1st_transcnt 2nd_rev 2nd_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand):
def reshape_N(df, N):
# you can pass your custom column names here instead
# as long as it matches the width
# of the dataframe
columns = ['rev', 'transcnt']
columns = np.tile(columns, N)
numbers = np.arange(1, N+1).repeat(2)
columns = [f"{n}_{ent}"
for n, ent
in zip(numbers, columns)]
contents = []
start = 0
end = N * 2
temp = df.set_index("ID")
while (end < temp.columns.size):
end += start
frame = temp.iloc(axis=1)[start:end]
frame.columns = columns
contents.append(frame)
start += 2
if not contents:
return df
return pd.concat(contents).sort_index()
let's apply the function:
reshape_N(df, 2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
reshape_N(df, 3)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.
Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70
My dataframe looks like this:
id std number
A 1 1
A 0 12
B 123.45 34
B 1 56
B 12 78
C 134 90
C 1234 100
C 12345 111
I'd like to select random rows of Id while retaining all of the information in the other rows, such that dataframe would look like this:
id std number
A 1 1
A 0 12
C 134 90
C 1234 100
C 12345 111
I tried it with
size = 1000
replace = True
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df2 = df1.groupby('Id', as_index=False).apply(fn)
and
df2 = df1.sample(n=1000).groupby('id')
but obviously that didn't work. Any help would be appreciated.
You need create random ids first and then compare original column id by Series.isin in boolean indexing:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]
i have a dataframe df1
id value
1 100
2 100
3 100
4 100
5 100
i have another dataframe df2
id value
2 50
5 30
i want to replace these values for id's in df2 with the values in df1.
final modified df1:
id value
1 100
2 50
3 100
4 100
5 30
i will be running this in a loop. i'e df2, will change time to time (df1, outside loop)
what would be the best way to change the values?
Use combine_first, but first set_index by id in both DataFrames:
Notice: id column in df2 has to be unique.
df = df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
print (df)
id value
0 1 100.0
1 2 50.0
2 3 100.0
3 4 100.0
4 5 30.0
A loc based solution -
i = df1.set_index('id')
j = df2.set_index('id')
i.loc[j.index, 'value'] = j['value']
df2 = i.reset_index()
df2
id value
0 1 100
1 2 50
2 3 100
3 4 100
4 5 30
I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so
df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.
You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10