Rearranging Pandas Dataframe - python

I have a DataFrame as follows:
d = {'name': ['a', 'a','a','b','b','b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
name var Yr1 Yr2 Yr3
a v1 11 12 13
a v2 21 22 23
a v3 31 32 33
b v1 41 42 43
b v2 51 52 53
b v3 61 62 63
and I want to rearrange it to look like this:
name Yr v1 v2 v3
a 1 11 21 31
a 2 12 22 32
a 3 13 23 33
b 1 41 51 61
b 2 42 52 62
b 3 43 53 63
I am new to pandas and tried using other threads I found here but struggled to make it work. Any help would be much appreciated.

Try this
import pandas as pd
d = {'name': ['a', 'a', 'a', 'b', 'b', 'b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
# Solution
df.set_index(['name', 'var'], inplace=True)
df = df.unstack().stack(0)
print(df.reset_index())
output:
var name level_1 v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Reference: pandas.DataFrame.stack

Try groupby apply:
df.groupby("name").apply(
lambda x: x.set_index("var").T.drop("name")
).reset_index().rename(columns={"level_1": "Yr"}).rename_axis(columns=None)
name Yr v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Or better:
df.pivot("var", "name", ["Yr1", "Yr2", "Yr3"]).T.sort_index(
level=1
).reset_index().rename({"level_0": "Yr"}, axis=1).rename_axis(columns=None)
Yr name v1 v2 v3
0 Yr1 a 11 21 31
1 Yr2 a 12 22 32
2 Yr3 a 13 23 33
3 Yr1 b 41 51 61
4 Yr2 b 42 52 62
5 Yr3 b 43 53 63

We can use pd.wide_to_long + df.unstack here.
pd.wide_to_long doc:
With stubnames [‘A’, ‘B’], this function expects to find one or more groups of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’).
pd.wide_to_long(
df, stubnames="Yr", i=["name", "var"], j="Y"
).squeeze().unstack(level=1).reset_index()
var name Y v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
We can use df.melt + df.pivot here.
out = df.melt(id_vars=['name', 'var'], var_name='Yr')
out['Yr'] = out['Yr'].str.replace('Yr', '')
out.pivot(index=['name', 'Yr'], columns='var', values='value').reset_index()
var name Yr v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63

Related

Saving result from for loop to different columns

I am trying to run a nested loop in which I want the output to be saved in four different columns. Let C1R1 be the value I want in the first column first row, C2R2 the one I want in the second column second row, etc. What I have come up with this far gives me a list where the output is saved like this:
['C1R1', 'C2R1', 'C3R1', 'C4R1']. This is the code I am using:
dfs1 = []
for i in range(24):
pd = (data_json2['data']['Rows'][i])
for j in range(4):
pd1 = pd['Columns'][j]['Value']
dfs1.append(pd1)
What could be a good way to achieve this?
EDIT: This is what I want to achieve:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
While this is what I got now:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
Thank you.
Try:
import pandas as pd
def get_dataframe(num_cols=4, num_values=24):
return pd.DataFrame(
([v * 24 + c for v in range(num_cols)] for c in range(num_values)),
columns=[f"Column {c}" for c in range(1, num_cols + 1)],
)
df = get_dataframe()
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95

Calculate time difference of 2 adjacent datapoints for each user

I have the following dataframe:
df = pd.DataFrame(
{'user_id': [53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54],
'timestamp': [10, 15, 20, 25, 30, 31, 34, 37, 14, 16, 18, 20, 22, 25, 28],
'activity': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'D', 'D', 'D', 'D', 'D', 'D', 'D']}
)
df
user_id timestamp activity
0 53 10 A
1 53 15 A
2 53 20 A
3 53 25 A
4 53 30 A
5 53 31 A
6 53 34 A
7 53 37 A
8 54 14 D
9 54 16 D
10 54 18 D
11 54 20 D
12 54 22 D
13 54 25 D
14 54 28 D
I want to calculate the time difference between every
2 adjacent datapoints (rows) in each user_id and plot the CDF,
per activity. Assuming each user starts new activity from 0 seconds. timestamp column represents unix timestamp, I give last 2 digits for brevity.
Target df (required result):
user_id timestamp activity timestamp_diff
0 53 10 A 0
1 53 15 A 5
2 53 20 A 5
3 53 25 A 5
4 53 30 A 5
5 53 31 A 1
6 53 34 A 3
7 53 37 A 3
8 54 14 D 0
9 54 16 D 2
10 54 18 D 2
11 54 20 D 2
12 54 22 D 2
13 54 25 D 3
14 54 28 D 3
My attempts (to calculate the time differences):
df['shift1'] = df.groupby('user_id')['timestamp'].shift(1, fill_value=0)
df['shift2'] = df.groupby('user_id')['timestamp'].shift(-1, fill_value=0)
df['diff1'] = df.timestamp - df.shift1
df['diff2'] = df.shift2 - df.timestamp
df['shift3'] = df.groupby('user_id')['timestamp'].shift(-1)
df['shift3'].fillna(method='ffill', inplace=True)
df['diff3'] = df.shift3 - df.timestamp
df
user_id timestamp activity shift1 shift2 diff1 diff2 shift3 diff3
0 53 10 A 0 15 10 5 15.0 5.0
1 53 15 A 10 20 5 5 20.0 5.0
2 53 20 A 15 25 5 5 25.0 5.0
3 53 25 A 20 30 5 5 30.0 5.0
4 53 30 A 25 31 5 1 31.0 1.0
5 53 31 A 30 34 1 3 34.0 3.0
6 53 34 A 31 37 3 3 37.0 3.0
7 53 37 A 34 0 3 -37 37.0 0.0
8 54 14 D 0 16 14 2 16.0 2.0
9 54 16 D 14 18 2 2 18.0 2.0
10 54 18 D 16 20 2 2 20.0 2.0
11 54 20 D 18 22 2 2 22.0 2.0
12 54 22 D 20 25 2 3 25.0 3.0
13 54 25 D 22 28 3 3 28.0 3.0
14 54 28 D 25 0 3 -28 28.0 0.0
I cannot reach to the target, none of diff1, diff2 or diff3 columns match the timestamp_diff.
IIUC you are looking for diff:
df['timestamp_diff'] = df.groupby('user_id')['timestamp'].diff().fillna(0).astype(int)

How to concatenate rows side by side in pandas

I want to combine the five rows of the same dataset into a single dataset
I have 700 rows and i want to combining every five rows
A B C D E F G
1 10,11,12,13,14,15,16
2 17,18,19,20,21,22,23
3 24,25,26,27,28,29,30
4 31,32,33,34,35,36,37
5 38,39,40,41,42,43,44
.
.
.
.
.
700
After combining the first five rows.. My first row should look like this:
A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G
1 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44
If you can guarantee that the total number of rows you have is a multiple of 5, dipping into numpy will be the most efficient way to solve this problem:
import numpy as np
import pandas as pd
data = np.arange(70).reshape(-1, 7)
df = pd.DataFrame(data, columns=[*'ABCDEFG'])
print(df)
A B C D E F G
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
5 35 36 37 38 39 40 41
6 42 43 44 45 46 47 48
7 49 50 51 52 53 54 55
8 56 57 58 59 60 61 62
9 63 64 65 66 67 68 69
out = pd.DataFrame(
df.to_numpy().reshape(-1, df.shape[1] * 5),
columns=[*df.columns] * 5
)
print(out)
A B C D E F G A B C D E F ... B C D E F G A B C D E F G
0 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 22 23 24 25 26 27 28 29 30 31 32 33 34
1 35 36 37 38 39 40 41 42 43 44 45 46 47 ... 57 58 59 60 61 62 63 64 65 66 67 68 69
[2 rows x 35 columns]
You can do:
cols = [col for v in [df.columns.tolist()]*len(df) for col in v]
dfs = [df[i:min(i+5,len(df))].reset_index(drop=True) for i in range(0,len(df),5)]
df2 = pd.concat([pd.DataFrame(df.stack()).T for df in dfs])
df2.columns = cols
df2.reset_index(drop=True, inplace=True)
see if this helps answer your question
unstack turns the columns into rows, and once we have the data in a column, we just need it transposed. reset_index makes the resulting series into a dataframe. the original columns names are made into an index, so when we transpose we have the columns as you had stated in your columns.
df.unstack().reset_index().set_index('level_0')[[0]].T
level_0 A A A A A B B B B B ... F F F F F G G G G G
0 10 17 24 31 38 11 18 25 32 39 ... 15 22 29 36 43 16 23 30 37 44
vote and/or accept if the answer helps
the easiest way is to convert your dataframe to a numpy array, reshape it then cast it back to a new dataframe.
Edit:
data= # your dataframe
new_dataframe=pd.DataFrame(data.to_numpy().reshape(len(data)//5,-1),columns=np.tile(data.columns,5))
Stacking and unstacking data in pandas
Data in tables are often presented multiple ways. Long form ("tidy data") refers to data that are stacked in a couple of columns. One of the columns will have categorical indicators about the values. In contrast, wide form ("stacked data") is where each category has it's own column.
In your example, you present the wide form of data, and you're trying to get it into long form. The pandas.melt, pandas.groupby, pandas.pivot, pandas.stack, pandas.unstack, and pandas.reset_index are the functions that help convert between these forms.
Start with your original dataframe:
df = pd.DataFrame({
'A' : [10, 17, 24, 31, 38],
'B' : [11, 18, 25, 32, 39],
'C' : [12, 19, 26, 33, 40],
'D' : [13, 20, 27, 34, 41],
'E' : [14, 21, 28, 35, 42],
'F' : [15, 22, 29, 36, 43],
'G' : [16, 23, 30, 37, 44]})
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
Use pandas.melt to convert it to long form, then sort to get it how you requested the data: The ignore index option helps us to get it back to wide form later.
melted_df = df.melt(ignore_index=False).sort_values(by='value')
variable value
0 A 10
0 B 11
0 C 12
0 D 13
0 E 14
0 F 15
0 G 16
1 A 17
1 B 18
...
Use groupby, unstack, and reset_index to convert it back to wide form. This is often a much more difficult process that relies on grouping by the value stacked column, other columns, index, and stacked variable and then unstacking and resetting the index.
(melted_df
.reset_index() # puts the index values into a column called 'index'
.groupby(['index','variable']) #groups by the index and the variable
.value #selects the value column in each of the groupby objects
.mean() #since there is only one item per group, it only aggregates one item
.unstack() #this sets the first item of the multi-index to columns
.reset_index() #fix the index
.set_index('index') #set index
)
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
This stuff can be quite difficult and requires trial and error. I would recommend making a smaller version of your problems and mess with them. This way you can figure out how the functions are working.
Try this using arange() with floordiv to get groups by every 5, then creating a new df with the groups. This should work even if your df is not divisible by 5.
l = 5
(df.groupby(np.arange(len(df.index))//l)
.apply(lambda x: pd.DataFrame([x.to_numpy().ravel()]))
.set_axis(df.columns.tolist() * l,axis=1)
.reset_index(drop=True))
or
(df.groupby(np.arange(len(df.index))//5)
.apply(lambda x: x.reset_index(drop=True).stack())
.unstack(level=[1,2])
.droplevel(0,axis=1))
Output:
A B C D E F G A B C ... E F G A B C D E F G
0 9 0 3 2 6 2 9 1 7 5 ... 2 5 9 5 4 9 7 3 8 9
1 9 5 0 8 1 5 8 7 7 7 ... 6 3 5 5 2 3 9 7 5 6

Pivot some rows to new columns in DataFrame

I'm after a pythonic and pandemic (from pandas, pun not intended =) way to pivot some rows in a dataframe into new columns.
My data has this format:
dof foo bar qux
idxA idxB
100 101 1 10 30 50
101 2 11 31 51
101 3 12 32 52
102 1 13 33 53
102 2 14 34 54
102 3 15 35 55
200 101 1 16 36 56
101 2 17 37 57
101 3 18 38 58
102 1 19 39 59
102 2 20 40 60
102 3 21 41 61
The variables foo, bar and qux actually have 3 dimensional coordinates, which I would like to call foo1, foo2, foo3, bar1, ..., qux3. These are identified by the column dof. Each row represents one axis in 3D, dof == 1 is the x axis, dof == 2 the y axis and dof == 3 is the z axis.
So, here is the final dataframe I want:
foo1 bar1 qux1 foo2 bar2 qux2 foo3 bar3 qux3
idxA idxB
100 101 10 30 50 11 31 51 12 32 52
102 13 33 53 14 34 54 15 35 55
200 101 16 36 56 17 37 57 18 38 58
102 19 39 59 20 40 60 21 41 61
Question is: what is the best way to do that?
Here is what I have done.
A code to re-create my dataset in a dataframe:
import pandas as pd
data = [[100, 101, 1, 10, 30, 50],
[100, 101, 2, 11, 31, 51],
[100, 101, 3, 12, 32, 52],
[100, 102, 1, 13, 33, 53],
[100, 102, 2, 14, 34, 54],
[100, 102, 3, 15, 35, 55],
[200, 101, 1, 16, 36, 56],
[200, 101, 2, 17, 37, 57],
[200, 101, 3, 18, 38, 58],
[200, 102, 1, 19, 39, 59],
[200, 102, 2, 20, 40, 60],
[200, 102, 3, 21, 41, 61],
]
df = pd.DataFrame(data=data, columns=['idxA', 'idxB', 'dof', 'foo', 'bar', 'qux'])
df.set_index(['idxA', 'idxB'], inplace=True)
Code to do what I would like to do:
df2 = df[df.dof == 1].reset_index()[['idxA', 'idxB']]
df2.set_index(['idxA', 'idxB'], inplace=True)
for pivot in [1, 2, 3]:
df2.loc[:, 'foo%d' % pivot] = df[df.dof == pivot]['foo']
df2.loc[:, 'bar%d' % pivot] = df[df.dof == pivot]['bar']
df2.loc[:, 'qux%d' % pivot] = df[df.dof == pivot]['qux']
However I'm not too happy with these .loc calls and incremental column additions to the dataframe. I thought that pandas being awesome as it is would have a neater way of doing that. A one-liner would be super cool.
You can try df.pivot
df = df.pivot(columns='dof')
foo bar qux
dof 1 2 3 1 2 3 1 2 3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
Now join using df.columns
df.columns = df.columns.map('{0[0]}{0[1]}'.format) #suggested by #YOBEN_S
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
You can add dof to index and do a unstack:
new_df = df.set_index('dof',append=True).unstack('dof')
new_df.columns = [f'{x}{y}' for x,y in new_df.columns]
Output:
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61

Finding the maximum difference for a subset of columns with pandas

I have a dataframe:
A B C D E
0 a 34 55 43 aa
1 b 53 77 65 bb
2 c 23 100 34 cc
3 d 54 43 23 dd
4 e 23 67 54 ee
5 f 43 98 23 ff
I need to get the maximum difference between the column B,C and D and return the value in column A . in row 'a' maximum difference between columns is 55 - 34 = 21 . data is in a dataframe.
The expected result is
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Use np.ptp:
# df['A'] = np.ptp(df.loc[:, 'B':'D'], axis=1)
df['A'] = np.ptp(df[['B', 'C', 'D']], axis=1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Or, find the max and min yourself:
df['A'] = df[['B', 'C', 'D']].max(1) - df[['B', 'C', 'D']].min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
If performance is important, you can do this in NumPy space:
v = df[['B', 'C', 'D']].values
df['A'] = v.max(1) - v.min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff

Categories

Resources