consider the below pandas dataframe
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80],2:[100,200,300,400,500,600,700,800]})
0 1 2
0 a 10 100
1 1 20 200
2 2 30 300
3 3 40 400
4 a 50 500
5 1 60 600
6 2 70 700
7 3 80 800
i want to reshape the dataframe such that my desired output should look like
1 2 3 4
a 10 100 50 500
1 20 200 60 600
2 30 300 70 700
3 40 400 80 800
basically, i have repetitive and finite set of values in df[0] but the corresponding values in other columns are unique at each repetition. I want so unstack the table in such as way that I can get the desired output. a numpy solution is also acceptable.
You can do something like this: group rows by the 0th column and then convert the groups into Series.
df.groupby(0)[1].apply(list).apply(pd.Series)
# 0 1
#0
#1 20 60
#2 30 70
#3 40 80
#a 10 50
Use groupby and then convert values to columns:
df.groupby(by=[0])[1].apply(lambda x: pd.Series(x.tolist())).unstack()
Out[37]:
0 1
0
1 20 60
2 30 70
3 40 80
a 10 50
Here's one solution, using a dictionary to store your repetitive values and the corresponding columns, and converting it back to a dataframe. Keep in mind that dicts are disordered, so if you want to keep the order of your repetitive values you would need to tweak this a bit.
df = pd.DataFrame({0:['a',1,2,3,'a',1,2,3],1:[10,20,30,40,50,60,70,80]})
unstacked = {}
for index, row in df.iterrows():
if row.iloc[0] not in unstacked:
unstacked[ row.iloc[0] ] = list(row[1::])
else:
unstacked[ row.iloc[0] ] += list(row[1::])
unstacked_df = pd.DataFrame.from_dict( unstacked, orient='index' )
print unstacked_df
0 1
a 10 50
1 20 60
2 30 70
3 40 80
Related
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have two dataframes, the first one looks like:
Age
200
300
400
1
34
32
50
2
42
73
20
The second dataframe looks like
ID
Age
Score
10
2
200
23
1
300
My goal is to create another column in the second dataframe which fetches the value from the first dataframe by the corresponding values of both the column Age and Score.
The Score's are the columns in the first dataframe.
The resulting dataframe:
ID
Age
Score
Count
10
2
200
42
23
1
300
32
Try with melt and merge
tomerge = df1.melt('Age',var_name = 'Score',value_name='Count')
tomerge['Score'] = tomerge['Score'].astype(int)
out = df2.merge(tomerge)
out
Out[988]:
ID Age Score Count
0 10 2 200 42
1 23 1 300 32
You can create a pd.MultiIndex.from_arrays from df2 and map it onto a series created with 'Age' index followed by a stack from df1.
df2.Count = pd.Series(
pd.MultiIndex.from_arrays([df2.Age, df2.Score]).map(df1.set_index('Age').stack())
)
Intermediate outputs:
df1.set_index('Age').stack()
Age
1 200 34
300 32
400 50
2 200 42
300 73
400 20
dtype: int64
pd.MultiIndex.from_arrays([df2.Age, df2.Score])
MultiIndex([(2, 200),
(1, 300)],
names=['Age', 'Score'])
print(df2):
ID Age Score Count
0 10 2 200 42
1 23 1 300 32
Imagine you have a dataframe df as follows:
Id Side Volume
2 a 40
2 b 30
1 a 20
2 b 10
You want the following output
Id Side sum
1 a 20
1 all 20
2 a 40
2 b 40
2 all 80
all a 60
all b 40
all all 100
Which would be a df.groupby(['Id','Side'].C.sum().reset_index() AND the sum values for all side and all id's (df.Volume.sum(), df[df.Side == 'a'].Volume.sum(), df[df.Side == 'b'].Volume.sum(), etc...)?
Is there a way to do this without calculating it outside and then merging both results?
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).