Pandas new column from groupby averages - python

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.

You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550

Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Related

Calculating average in array under different conditions using pandas [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Create a column based on computation of a another column

I would like to create another column based on the sales for the previous week. Here is the sample input:
df = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20]})
print(df)
Based on this, i would like to create another column which is nothing but the sales of the previous week. Here is the sample of the desired output
df_output = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20],
'Sales_previous_week':[0,0,100,200,300,400,100,200,300,400,0,0,100,200,300,400,100,200,300,400]})
print(df_output)
Am finding it hard to create what would be a self join. The previous week should only be influenced by sales file and i should be able to retain the "sales_others" column
--Edit
Adding original code
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
Renaming columns
CR_UK_NL_Weeklevel.columns.values[4] = 'CURRENT_WEEK'
CR_UK_NL_Weeklevel.columns.values[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns.values
Trying to implement solution:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
[78]:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
--Error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in
----> 1 CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
2 print(CR_UK_NL_Weeklevel)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in getitem(self, key)
273 else:
274 if key not in self.obj:
--> 275 raise KeyError("Column not found: {key}".format(key=key))
276 return self._gotitem(key, ndim=1)
277
KeyError: 'Column not found: CURRENT_WEEK'
If there are always same categories per week and consecutive weeks use DataFrameGroupBy.shift grouping by Category column:
df['Sales_PREVIOUS'] = df.groupby('Category')['Sales'].shift(fill_value=0)
print (df)
Week Category Sales Sales_PREVIOUS
0 1 Red 100 0
1 1 White 200 0
2 2 Red 300 100
3 2 White 400 200
4 3 Red 100 300
5 3 White 200 400
6 4 Red 300 100
7 4 White 400 200
8 5 Red 100 300
9 5 White 200 400
Another idea with pivoting is use DataFrame.pivot, then DataFrame.shift with DataFrame.stack for Series and last add new column by DataFrame.join:
s = df.pivot('Week','Category','Sales').shift(fill_value=0).stack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['Week','Category'])
EDIT:
With new data add column id:
df['Sales_PREVIOUS'] = df.groupby(['id','Category'])['Sales'].shift(fill_value=0)
And for second solution:
s = df.set_index(['Week','id','Category'])['Sales'].unstack([1,2]).shift(fill_value=0).unstack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['id','Category','Week'])
print (df)
Week Category id Sales Sales_others Sales_PREVIOUS WEEK
0 1 Red 1 100 10 0
1 1 White 1 200 20 0
2 2 Red 1 300 30 100
3 2 White 1 400 40 200
4 3 Red 1 100 10 300
5 3 White 1 200 20 400
6 4 Red 1 300 30 100
7 4 White 1 400 40 200
8 5 Red 1 100 10 300
9 5 White 1 200 20 400
10 1 Red 2 100 10 0
11 1 White 2 200 20 0
12 2 Red 2 300 30 100
13 2 White 2 400 40 200
14 3 Red 2 100 10 300
15 3 White 2 200 20 400
16 4 Red 2 300 30 100
17 4 White 2 400 40 200
18 5 Red 2 100 10 300
19 5 White 2 200 20 400
EDIT:
Problem is with columns names, use:
cols = CR_UK_NL_Weeklevel.columns.tolist()
cols[4] = 'CURRENT_WEEK'
cols[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns = cols

pandas reshape dataframe based on repetitive values in a column

consider the below pandas dataframe
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80],2:[100,200,300,400,500,600,700,800]})
0 1 2
0 a 10 100
1 1 20 200
2 2 30 300
3 3 40 400
4 a 50 500
5 1 60 600
6 2 70 700
7 3 80 800
i want to reshape the dataframe such that my desired output should look like
1 2 3 4
a 10 100 50 500
1 20 200 60 600
2 30 300 70 700
3 40 400 80 800
basically, i have repetitive and finite set of values in df[0] but the corresponding values in other columns are unique at each repetition. I want so unstack the table in such as way that I can get the desired output. a numpy solution is also acceptable.
You can do something like this: group rows by the 0th column and then convert the groups into Series.
df.groupby(0)[1].apply(list).apply(pd.Series)
# 0 1
#0
#1 20 60
#2 30 70
#3 40 80
#a 10 50
Use groupby and then convert values to columns:
df.groupby(by=[0])[1].apply(lambda x: pd.Series(x.tolist())).unstack()
Out[37]:
0 1
0
1 20 60
2 30 70
3 40 80
a 10 50
Here's one solution, using a dictionary to store your repetitive values and the corresponding columns, and converting it back to a dataframe. Keep in mind that dicts are disordered, so if you want to keep the order of your repetitive values you would need to tweak this a bit.
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80]})
unstacked = {}
for index, row in df.iterrows():
if row.iloc[0] not in unstacked:
unstacked[ row.iloc[0] ] = list(row[1::])
else:
unstacked[ row.iloc[0] ] += list(row[1::])
unstacked_df = pd.DataFrame.from_dict( unstacked, orient='index' )
print unstacked_df
0 1
a 10 50
1 20 60
2 30 70
3 40 80

Pandas framework: determining the count of a column data

I have a TSV file with data as shown below:
UserID ItemID
100 1
200 1
300 2
400 3
500 2
600 4
700 4
800 5
...
...
N X
I am new to the pandas framework and i want to know how can i get the count of all ItemID's associated with all the users, for the above dataset. For example, if we assume that in the above TSV file ItemID 1 repeats only two times, i need to get the count as 2 and so on. An example would be very helpful for me to get going. Appreciate your help in advance!
As mentioned by #edchum, value_count can be used on the column "ItemID". It will return a series with indices as "ItemID" and value as the count of "ItemID".
counter = df["ItemID"].value_counts() #df is your dataframe
print counter[1] #prints how many times 1 occurred
Here are 2 methods:
In [14]:
# setup data, note I have put userID 100 3 times
temp="""UserID ItemID
100 1
100 1
100 2
400 3
500 2
600 4
700 4
800 5"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# count the occurences of each user
df.groupby('UserID').count()
Out[14]:
ItemID
UserID
100 3
400 1
500 1
600 1
700 1
800 1
In [15]:
# count each ItemID unique values
df['ItemID'].value_counts()
Out[15]:
4 2
2 2
1 2
5 1
3 1
dtype: int64

Categories

Resources