Create a column based on computation of a another column - python
I would like to create another column based on the sales for the previous week. Here is the sample input:
df = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20]})
print(df)
Based on this, i would like to create another column which is nothing but the sales of the previous week. Here is the sample of the desired output
df_output = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20],
'Sales_previous_week':[0,0,100,200,300,400,100,200,300,400,0,0,100,200,300,400,100,200,300,400]})
print(df_output)
Am finding it hard to create what would be a self join. The previous week should only be influenced by sales file and i should be able to retain the "sales_others" column
--Edit
Adding original code
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
Renaming columns
CR_UK_NL_Weeklevel.columns.values[4] = 'CURRENT_WEEK'
CR_UK_NL_Weeklevel.columns.values[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns.values
Trying to implement solution:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
[78]:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
--Error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in
----> 1 CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
2 print(CR_UK_NL_Weeklevel)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in getitem(self, key)
273 else:
274 if key not in self.obj:
--> 275 raise KeyError("Column not found: {key}".format(key=key))
276 return self._gotitem(key, ndim=1)
277
KeyError: 'Column not found: CURRENT_WEEK'
If there are always same categories per week and consecutive weeks use DataFrameGroupBy.shift grouping by Category column:
df['Sales_PREVIOUS'] = df.groupby('Category')['Sales'].shift(fill_value=0)
print (df)
Week Category Sales Sales_PREVIOUS
0 1 Red 100 0
1 1 White 200 0
2 2 Red 300 100
3 2 White 400 200
4 3 Red 100 300
5 3 White 200 400
6 4 Red 300 100
7 4 White 400 200
8 5 Red 100 300
9 5 White 200 400
Another idea with pivoting is use DataFrame.pivot, then DataFrame.shift with DataFrame.stack for Series and last add new column by DataFrame.join:
s = df.pivot('Week','Category','Sales').shift(fill_value=0).stack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['Week','Category'])
EDIT:
With new data add column id:
df['Sales_PREVIOUS'] = df.groupby(['id','Category'])['Sales'].shift(fill_value=0)
And for second solution:
s = df.set_index(['Week','id','Category'])['Sales'].unstack([1,2]).shift(fill_value=0).unstack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['id','Category','Week'])
print (df)
Week Category id Sales Sales_others Sales_PREVIOUS WEEK
0 1 Red 1 100 10 0
1 1 White 1 200 20 0
2 2 Red 1 300 30 100
3 2 White 1 400 40 200
4 3 Red 1 100 10 300
5 3 White 1 200 20 400
6 4 Red 1 300 30 100
7 4 White 1 400 40 200
8 5 Red 1 100 10 300
9 5 White 1 200 20 400
10 1 Red 2 100 10 0
11 1 White 2 200 20 0
12 2 Red 2 300 30 100
13 2 White 2 400 40 200
14 3 Red 2 100 10 300
15 3 White 2 200 20 400
16 4 Red 2 300 30 100
17 4 White 2 400 40 200
18 5 Red 2 100 10 300
19 5 White 2 200 20 400
EDIT:
Problem is with columns names, use:
cols = CR_UK_NL_Weeklevel.columns.tolist()
cols[4] = 'CURRENT_WEEK'
cols[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns = cols
Related
pandas dynamic wide to long based on time
I have pandas dataframe that contains data given below ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt 1 100 2 200 4 300 6 400 8 2 101 3 201 5 301 7 401 9 dataframe looks like below I would like to do the below a) For each ID, create 3 rows (from 8 input columns data) b) Each row should contain the two columns data c) subsequent rows should shift the columns by 1 (one quarter data). To understand better, I expect my output to be like as below I tried the below based on the SO post here but unable to get the expected output s = 3 n = 2 cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt'] output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2)) Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns. '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'
Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]: N = 2 df1 = df.set_index('ID') def f(x,n=N): out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()]) return pd.DataFrame(np.vstack(out)) df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False) .apply(f) .sort_index(axis=1, level=1, sort_remaining=False)) df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index))) df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}') print (df2) 1_rev 1_transcnt 2_rev 2_transcnt ID 1 100 2 200 4 1 200 4 300 6 1 300 6 400 8 2 101 3 201 5 2 201 5 301 7 2 301 7 401 9 Test with 3 window: N = 3 df1 = df.set_index('ID') def f(x,n=N): out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()]) return pd.DataFrame(np.vstack(out)) df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False) .apply(f) .sort_index(axis=1, level=1, sort_remaining=False)) df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index))) df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}') print (df2) 1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt ID 1 100 2 200 4 300 6 1 200 4 300 6 400 8 2 101 3 201 5 301 7 2 201 5 301 7 401 9
One option is with a for loop or list comprehension, followed by a concatenation, and a sort: temp = df.set_index('ID') cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt'] outcome = [temp .iloc(axis=1)[n:n+4] .set_axis(cols, axis = 1) for n in range(0, len(cols)+2, 2)] pd.concat(outcome).sort_index() 1st_rev 1st_transcnt 2nd_rev 2nd_transcnt ID 1 100 2 200 4 1 200 4 300 6 1 300 6 400 8 2 101 3 201 5 2 201 5 301 7 2 301 7 401 9 To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand): def reshape_N(df, N): # you can pass your custom column names here instead # as long as it matches the width # of the dataframe columns = ['rev', 'transcnt'] columns = np.tile(columns, N) numbers = np.arange(1, N+1).repeat(2) columns = [f"{n}_{ent}" for n, ent in zip(numbers, columns)] contents = [] start = 0 end = N * 2 temp = df.set_index("ID") while (end < temp.columns.size): end += start frame = temp.iloc(axis=1)[start:end] frame.columns = columns contents.append(frame) start += 2 if not contents: return df return pd.concat(contents).sort_index() let's apply the function: reshape_N(df, 2) 1_rev 1_transcnt 2_rev 2_transcnt ID 1 100 2 200 4 1 200 4 300 6 1 300 6 400 8 2 101 3 201 5 2 201 5 301 7 2 301 7 401 9 reshape_N(df, 3) 1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt ID 1 100 2 200 4 300 6 1 200 4 300 6 400 8 2 101 3 201 5 301 7 2 201 5 301 7 401 9
Calculating average in array under different conditions using pandas [duplicate]
I have a DataFrame >>> df = pd.DataFrame({'a':[1,1,1,2,2,2], ... 'b':[10,20,20,10,20,20], ... 'result':[100,200,300,400,500,600]}) ... >>> df a b result 0 1 10 100 1 1 20 200 2 1 20 300 3 2 10 400 4 2 20 500 5 2 20 600 and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby: >>> df.groupby(['a','b'])['result'].mean() a b 1 10 100 20 250 2 10 400 20 550 Name: result, dtype: int64 but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this, >>> df a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550 I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform: df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean') This generates a correctly indexed column of the groupby values for you: a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below: So it is better to go with the Window function as in the below code snippet example: windowSpecAgg = Window.partitionBy('a', 'b') ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show() The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]
I have a DataFrame >>> df = pd.DataFrame({'a':[1,1,1,2,2,2], ... 'b':[10,20,20,10,20,20], ... 'result':[100,200,300,400,500,600]}) ... >>> df a b result 0 1 10 100 1 1 20 200 2 1 20 300 3 2 10 400 4 2 20 500 5 2 20 600 and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby: >>> df.groupby(['a','b'])['result'].mean() a b 1 10 100 20 250 2 10 400 20 550 Name: result, dtype: int64 but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this, >>> df a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550 I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform: df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean') This generates a correctly indexed column of the groupby values for you: a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below: So it is better to go with the Window function as in the below code snippet example: windowSpecAgg = Window.partitionBy('a', 'b') ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show() The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
Is there any efficient way to filter out cluster data simultaneously in a large pandas dataframe?
I have large pandas dataframe which look like this: DF: ID setID Weight PG_002456788.1 1 100 UG_004678935.1 2 110 UG_012975895.1 2 150 PG_023788904.1 3 200 UR_073542247.1 3 200 UR_099876678.2 3 264 PR_066120875.1 4 400 PR_098759678.1 4 600 UR_096677888.2 4 750 PG_012667994.1 5 800 PG_077555239.1 5 800 I would like to filter out rows based on criteria: Criteria to choose representative per setID is in below order of priority Preference 1 ID starting with PG_ Preference 2 ID starting with UG_ Preference 3 ID starting with PR_ Preference 4 ID starting with UR_ Along with this next priority is to choose the highest Weight simultaneously for each setID cluster. 'Desired output:' ID setID weight PG_002456788.1 1 100 UG_012975895.1 2 150 PG_023788904.1 3 200 PR_098759678.1 4 600 PG_012667994.1 5 800 Also, I would like to print rows with same ID 'Initials' as well as weight separatly if there is any. For example, ID setID weight PG_012667994.1 5 800 PG_077555239.1 5 800
IIUC you can define a pd.Categorical dummy column with the initial substring in ID, and use it and the Weight to order the dataframe. Then groupby setID, take the first: df['ID_init'] = pd.Categorical(df.ID.str.split('_',1).str[0], categories=['PG','UG','PR','UR'], ordered=True) (df.sort_values(by=['ID_init','Weight'], ascending=[True, False]) .groupby('setID') .head(1) .sort_values('setID') .drop('ID_init',1)) ID setID Weight 0 PG_002456788.1 1 100 2 UG_012975895.1 2 150 3 PG_023788904.1 3 200 7 PR_098759678.1 4 600 9 PG_012667994.1 5 800
For the first part: create a new column called code from the ID. Then, sort the data frame by the code and weight, group by setID and take first entry. df['code'] = df['ID'].str[:2].replace({'PG': 1, 'UG': 2, 'PR': 3, 'UR': 4}) df2 = df.sort_values(['code', 'Weight'], ascending=[True, False]).groupby('setID').first() df2 = df2.reset_index().drop('code', axis=1) Output setID ID Weight 0 1 PG_002456788.1 100 1 2 UG_012975895.1 150 2 3 PG_023788904.1 200 3 4 PR_098759678.1 600 4 5 PG_012667994.1 800 The second part: df3 = df.join(df.groupby(['setID', 'code']).count()['ID'], on=['setID', 'code'], rsuffix='_Count') df3[ df3['ID_Count'] > 1].drop(['code', 'ID_Count'], axis=1) Output: ID setID Weight 1 UG_004678935.1 2 110 2 UG_012975895.1 2 150 4 UR_073542247.1 3 200 5 UR_099876678.2 3 264 6 PR_066120875.1 4 400 7 PR_098759678.1 4 600 9 PG_012667994.1 5 800 10 PG_077555239.1 5 800
Pandas new column from groupby averages
I have a DataFrame >>> df = pd.DataFrame({'a':[1,1,1,2,2,2], ... 'b':[10,20,20,10,20,20], ... 'result':[100,200,300,400,500,600]}) ... >>> df a b result 0 1 10 100 1 1 20 200 2 1 20 300 3 2 10 400 4 2 20 500 5 2 20 600 and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby: >>> df.groupby(['a','b'])['result'].mean() a b 1 10 100 20 250 2 10 400 20 550 Name: result, dtype: int64 but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this, >>> df a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550 I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform: df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean') This generates a correctly indexed column of the groupby values for you: a b result avg_result 0 1 10 100 100 1 1 20 200 250 2 1 20 300 250 3 2 10 400 400 4 2 20 500 550 5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below: So it is better to go with the Window function as in the below code snippet example: windowSpecAgg = Window.partitionBy('a', 'b') ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show() The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).