Create a column based on computation of a another column - python

I would like to create another column based on the sales for the previous week. Here is the sample input:
df = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20]})
print(df)
Based on this, i would like to create another column which is nothing but the sales of the previous week. Here is the sample of the desired output
df_output = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20],
'Sales_previous_week':[0,0,100,200,300,400,100,200,300,400,0,0,100,200,300,400,100,200,300,400]})
print(df_output)
Am finding it hard to create what would be a self join. The previous week should only be influenced by sales file and i should be able to retain the "sales_others" column
--Edit
Adding original code
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
Renaming columns
CR_UK_NL_Weeklevel.columns.values[4] = 'CURRENT_WEEK'
CR_UK_NL_Weeklevel.columns.values[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns.values
Trying to implement solution:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
[78]:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
--Error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in
----> 1 CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
2 print(CR_UK_NL_Weeklevel)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in getitem(self, key)
273 else:
274 if key not in self.obj:
--> 275 raise KeyError("Column not found: {key}".format(key=key))
276 return self._gotitem(key, ndim=1)
277
KeyError: 'Column not found: CURRENT_WEEK'

If there are always same categories per week and consecutive weeks use DataFrameGroupBy.shift grouping by Category column:
df['Sales_PREVIOUS'] = df.groupby('Category')['Sales'].shift(fill_value=0)
print (df)
Week Category Sales Sales_PREVIOUS
0 1 Red 100 0
1 1 White 200 0
2 2 Red 300 100
3 2 White 400 200
4 3 Red 100 300
5 3 White 200 400
6 4 Red 300 100
7 4 White 400 200
8 5 Red 100 300
9 5 White 200 400
Another idea with pivoting is use DataFrame.pivot, then DataFrame.shift with DataFrame.stack for Series and last add new column by DataFrame.join:
s = df.pivot('Week','Category','Sales').shift(fill_value=0).stack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['Week','Category'])
EDIT:
With new data add column id:
df['Sales_PREVIOUS'] = df.groupby(['id','Category'])['Sales'].shift(fill_value=0)
And for second solution:
s = df.set_index(['Week','id','Category'])['Sales'].unstack([1,2]).shift(fill_value=0).unstack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['id','Category','Week'])
print (df)
Week Category id Sales Sales_others Sales_PREVIOUS WEEK
0 1 Red 1 100 10 0
1 1 White 1 200 20 0
2 2 Red 1 300 30 100
3 2 White 1 400 40 200
4 3 Red 1 100 10 300
5 3 White 1 200 20 400
6 4 Red 1 300 30 100
7 4 White 1 400 40 200
8 5 Red 1 100 10 300
9 5 White 1 200 20 400
10 1 Red 2 100 10 0
11 1 White 2 200 20 0
12 2 Red 2 300 30 100
13 2 White 2 400 40 200
14 3 Red 2 100 10 300
15 3 White 2 200 20 400
16 4 Red 2 300 30 100
17 4 White 2 400 40 200
18 5 Red 2 100 10 300
19 5 White 2 200 20 400
EDIT:
Problem is with columns names, use:
cols = CR_UK_NL_Weeklevel.columns.tolist()
cols[4] = 'CURRENT_WEEK'
cols[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns = cols

Related

pandas dynamic wide to long based on time

I have pandas dataframe that contains data given below
ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt
1 100 2 200 4 300 6 400 8
2 101 3 201 5 301 7 401 9
dataframe looks like below
I would like to do the below
a) For each ID, create 3 rows (from 8 input columns data)
b) Each row should contain the two columns data
c) subsequent rows should shift the columns by 1 (one quarter data).
To understand better, I expect my output to be like as below
I tried the below based on the SO post here but unable to get the expected output
s = 3
n = 2
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2))
Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns.
'1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'
Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]:
N = 2
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
Test with 3 window:
N = 3
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
One option is with a for loop or list comprehension, followed by a concatenation, and a sort:
temp = df.set_index('ID')
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
outcome = [temp
.iloc(axis=1)[n:n+4]
.set_axis(cols, axis = 1)
for n in range(0, len(cols)+2, 2)]
pd.concat(outcome).sort_index()
1st_rev 1st_transcnt 2nd_rev 2nd_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand):
def reshape_N(df, N):
# you can pass your custom column names here instead
# as long as it matches the width
# of the dataframe
columns = ['rev', 'transcnt']
columns = np.tile(columns, N)
numbers = np.arange(1, N+1).repeat(2)
columns = [f"{n}_{ent}"
for n, ent
in zip(numbers, columns)]
contents = []
start = 0
end = N * 2
temp = df.set_index("ID")
while (end < temp.columns.size):
end += start
frame = temp.iloc(axis=1)[start:end]
frame.columns = columns
contents.append(frame)
start += 2
if not contents:
return df
return pd.concat(contents).sort_index()
let's apply the function:
reshape_N(df, 2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
reshape_N(df, 3)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9

Calculating average in array under different conditions using pandas [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Is there any efficient way to filter out cluster data simultaneously in a large pandas dataframe?

I have large pandas dataframe which look like this:
DF:
ID setID Weight
PG_002456788.1 1 100
UG_004678935.1 2 110
UG_012975895.1 2 150
PG_023788904.1 3 200
UR_073542247.1 3 200
UR_099876678.2 3 264
PR_066120875.1 4 400
PR_098759678.1 4 600
UR_096677888.2 4 750
PG_012667994.1 5 800
PG_077555239.1 5 800
I would like to filter out rows based on criteria:
Criteria to choose representative per setID is in below order of priority
Preference 1 ID starting with PG_
Preference 2 ID starting with UG_
Preference 3 ID starting with PR_
Preference 4 ID starting with UR_
Along with this next priority is to choose the highest Weight simultaneously for each setID cluster.
'Desired output:'
ID setID weight
PG_002456788.1 1 100
UG_012975895.1 2 150
PG_023788904.1 3 200
PR_098759678.1 4 600
PG_012667994.1 5 800
Also, I would like to print rows with same ID 'Initials' as well as weight separatly if there is any.
For example,
ID setID weight
PG_012667994.1 5 800
PG_077555239.1 5 800
IIUC you can define a pd.Categorical dummy column with the initial substring in ID, and use it and the Weight to order the dataframe. Then groupby setID, take the first:
df['ID_init'] = pd.Categorical(df.ID.str.split('_',1).str[0],
categories=['PG','UG','PR','UR'],
ordered=True)
(df.sort_values(by=['ID_init','Weight'], ascending=[True, False])
.groupby('setID')
.head(1)
.sort_values('setID')
.drop('ID_init',1))
ID setID Weight
0 PG_002456788.1 1 100
2 UG_012975895.1 2 150
3 PG_023788904.1 3 200
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800
For the first part: create a new column called code from the ID. Then, sort the data frame by the code and weight, group by setID and take first entry.
df['code'] = df['ID'].str[:2].replace({'PG': 1, 'UG': 2, 'PR': 3, 'UR': 4})
df2 = df.sort_values(['code', 'Weight'], ascending=[True, False]).groupby('setID').first()
df2 = df2.reset_index().drop('code', axis=1)
Output
setID ID Weight
0 1 PG_002456788.1 100
1 2 UG_012975895.1 150
2 3 PG_023788904.1 200
3 4 PR_098759678.1 600
4 5 PG_012667994.1 800
The second part:
df3 = df.join(df.groupby(['setID', 'code']).count()['ID'],
on=['setID', 'code'], rsuffix='_Count')
df3[ df3['ID_Count'] > 1].drop(['code', 'ID_Count'], axis=1)
Output:
ID setID Weight
1 UG_004678935.1 2 110
2 UG_012975895.1 2 150
4 UR_073542247.1 3 200
5 UR_099876678.2 3 264
6 PR_066120875.1 4 400
7 PR_098759678.1 4 600
9 PG_012667994.1 5 800
10 PG_077555239.1 5 800

Pandas new column from groupby averages

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Categories

Resources