I'm after a pythonic and pandemic (from pandas, pun not intended =) way to pivot some rows in a dataframe into new columns.
My data has this format:
dof foo bar qux
idxA idxB
100 101 1 10 30 50
101 2 11 31 51
101 3 12 32 52
102 1 13 33 53
102 2 14 34 54
102 3 15 35 55
200 101 1 16 36 56
101 2 17 37 57
101 3 18 38 58
102 1 19 39 59
102 2 20 40 60
102 3 21 41 61
The variables foo, bar and qux actually have 3 dimensional coordinates, which I would like to call foo1, foo2, foo3, bar1, ..., qux3. These are identified by the column dof. Each row represents one axis in 3D, dof == 1 is the x axis, dof == 2 the y axis and dof == 3 is the z axis.
So, here is the final dataframe I want:
foo1 bar1 qux1 foo2 bar2 qux2 foo3 bar3 qux3
idxA idxB
100 101 10 30 50 11 31 51 12 32 52
102 13 33 53 14 34 54 15 35 55
200 101 16 36 56 17 37 57 18 38 58
102 19 39 59 20 40 60 21 41 61
Question is: what is the best way to do that?
Here is what I have done.
A code to re-create my dataset in a dataframe:
import pandas as pd
data = [[100, 101, 1, 10, 30, 50],
[100, 101, 2, 11, 31, 51],
[100, 101, 3, 12, 32, 52],
[100, 102, 1, 13, 33, 53],
[100, 102, 2, 14, 34, 54],
[100, 102, 3, 15, 35, 55],
[200, 101, 1, 16, 36, 56],
[200, 101, 2, 17, 37, 57],
[200, 101, 3, 18, 38, 58],
[200, 102, 1, 19, 39, 59],
[200, 102, 2, 20, 40, 60],
[200, 102, 3, 21, 41, 61],
]
df = pd.DataFrame(data=data, columns=['idxA', 'idxB', 'dof', 'foo', 'bar', 'qux'])
df.set_index(['idxA', 'idxB'], inplace=True)
Code to do what I would like to do:
df2 = df[df.dof == 1].reset_index()[['idxA', 'idxB']]
df2.set_index(['idxA', 'idxB'], inplace=True)
for pivot in [1, 2, 3]:
df2.loc[:, 'foo%d' % pivot] = df[df.dof == pivot]['foo']
df2.loc[:, 'bar%d' % pivot] = df[df.dof == pivot]['bar']
df2.loc[:, 'qux%d' % pivot] = df[df.dof == pivot]['qux']
However I'm not too happy with these .loc calls and incremental column additions to the dataframe. I thought that pandas being awesome as it is would have a neater way of doing that. A one-liner would be super cool.
You can try df.pivot
df = df.pivot(columns='dof')
foo bar qux
dof 1 2 3 1 2 3 1 2 3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
Now join using df.columns
df.columns = df.columns.map('{0[0]}{0[1]}'.format) #suggested by #YOBEN_S
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
You can add dof to index and do a unstack:
new_df = df.set_index('dof',append=True).unstack('dof')
new_df.columns = [f'{x}{y}' for x,y in new_df.columns]
Output:
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
Related
I am trying to run a nested loop in which I want the output to be saved in four different columns. Let C1R1 be the value I want in the first column first row, C2R2 the one I want in the second column second row, etc. What I have come up with this far gives me a list where the output is saved like this:
['C1R1', 'C2R1', 'C3R1', 'C4R1']. This is the code I am using:
dfs1 = []
for i in range(24):
pd = (data_json2['data']['Rows'][i])
for j in range(4):
pd1 = pd['Columns'][j]['Value']
dfs1.append(pd1)
What could be a good way to achieve this?
EDIT: This is what I want to achieve:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
While this is what I got now:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
Thank you.
Try:
import pandas as pd
def get_dataframe(num_cols=4, num_values=24):
return pd.DataFrame(
([v * 24 + c for v in range(num_cols)] for c in range(num_values)),
columns=[f"Column {c}" for c in range(1, num_cols + 1)],
)
df = get_dataframe()
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
I have a DataFrame as follows:
d = {'name': ['a', 'a','a','b','b','b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
name var Yr1 Yr2 Yr3
a v1 11 12 13
a v2 21 22 23
a v3 31 32 33
b v1 41 42 43
b v2 51 52 53
b v3 61 62 63
and I want to rearrange it to look like this:
name Yr v1 v2 v3
a 1 11 21 31
a 2 12 22 32
a 3 13 23 33
b 1 41 51 61
b 2 42 52 62
b 3 43 53 63
I am new to pandas and tried using other threads I found here but struggled to make it work. Any help would be much appreciated.
Try this
import pandas as pd
d = {'name': ['a', 'a', 'a', 'b', 'b', 'b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
# Solution
df.set_index(['name', 'var'], inplace=True)
df = df.unstack().stack(0)
print(df.reset_index())
output:
var name level_1 v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Reference: pandas.DataFrame.stack
Try groupby apply:
df.groupby("name").apply(
lambda x: x.set_index("var").T.drop("name")
).reset_index().rename(columns={"level_1": "Yr"}).rename_axis(columns=None)
name Yr v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Or better:
df.pivot("var", "name", ["Yr1", "Yr2", "Yr3"]).T.sort_index(
level=1
).reset_index().rename({"level_0": "Yr"}, axis=1).rename_axis(columns=None)
Yr name v1 v2 v3
0 Yr1 a 11 21 31
1 Yr2 a 12 22 32
2 Yr3 a 13 23 33
3 Yr1 b 41 51 61
4 Yr2 b 42 52 62
5 Yr3 b 43 53 63
We can use pd.wide_to_long + df.unstack here.
pd.wide_to_long doc:
With stubnames [‘A’, ‘B’], this function expects to find one or more groups of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’).
pd.wide_to_long(
df, stubnames="Yr", i=["name", "var"], j="Y"
).squeeze().unstack(level=1).reset_index()
var name Y v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
We can use df.melt + df.pivot here.
out = df.melt(id_vars=['name', 'var'], var_name='Yr')
out['Yr'] = out['Yr'].str.replace('Yr', '')
out.pivot(index=['name', 'Yr'], columns='var', values='value').reset_index()
var name Yr v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
I have a dataframe, dataframe_1, that looks like this:
0 1 2 3 4 5 ... 192
0 12 35 60 78 23 90 32
And another dataframe, dataframe_2, that looks like this:
58 59 60 61 62 ... 350
0 1 4 192 4 4 1
1 0 3 3 5 3 4
2 3 1 4 2 2 192
The values in dataframe_2 are the column names from dataframe_1. What I'd like to do is change the values in dataframe_2 based on the column names of dataframe_1, like so:
58 59 60 61 62 ... 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
I tried a for loop using .loc, but it did not work.
Any help is greatly appreciated!
Using replace
d2.replace(dict(zip(d1.columns,d1.iloc[0])))
stack and map
# if necessary, cast,
# df1.columns = df1.columns.astype(int)
df2.stack().map(df1.iloc[0]).unstack()
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Stack df2 so we can call Series.map to perform a single vectorised replacement using df1.
apply and map
df2.apply(pd.Series.map, args=(df1.iloc[0],))
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Instead of stacking to get a Series, we apply a map operation across each column.
You could define a dictionary from df1 and use it to replace replace the values in df2:
d = dict(zip(df1.columns, df1.values.ravel()))
df2.replace(d)
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Or stacking df1 and then replacing:
df2.replace(df1.stack().droplevel(0))
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Create a lookup table and map the values using the underlying numpy array. This assumes integer column names.
u = np.zeros(df1.columns.max()+1, dtype=int)
u[df1.columns] = df1.iloc[0].values
u[df2.values]
array([[35, 23, 32, 23, 23, 35],
[12, 78, 78, 90, 78, 23],
[78, 35, 23, 60, 60, 32]])
If there are values that might not match a value in df1:
u = np.full(df1.columns.max()+1, np.nan)
u[df1.columns] = df1.iloc[0].values
u[df2.values]
And then fillna with df2 if desired.
df2.applymap(lambda x: df1.loc[0,x])
Let's say I have an ndarray with 100 elements, and I want to select the first 4 elements, skip 6 and go ahead like this (in other words, select the first 4 elements every 10 elements).
I tried with python slicing with step but I think it's not working in my case. How can I do that? I'm using Pandas and numpy, can they help? I searched around but I have found nothing like that kind of slicing. Thanks!
You could use NumPy slicing to solve your case.
For a 1D array case -
A.reshape(-1,10)[:,:4].reshape(-1)
This can be extended to a 2D array case with the selection to be made along the first axis -
A.reshape(-1,10,A.shape[1])[:,:4].reshape(-1,A.shape[1])
You could reshape the array to a 10x10, then use slicing to pick the first 4 elements of each row. Then flatten the reshaped, sliced array:
In [46]: print a
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
In [47]: print a.reshape((10,-1))[:,:4].flatten()
[ 0 1 2 3 10 11 12 13 20 21 22 23 30 31 32 33 40 41 42 43 50 51 52 53 60
61 62 63 70 71 72 73 80 81 82 83 90 91 92 93]
Use % 10:
print [i for i in range(100) if i % 10 in (0, 1, 2, 3)]
[0, 1, 2, 3, 10, 11, 12, 13, 20, 21, 22, 23, 30, 31, 32, 33, 40, 41, 42, 43, 50, 51, 52, 53, 60, 61, 62, 63, 70, 71, 72, 73, 80, 81, 82, 83, 90, 91, 92, 93]
shorter_arr = arr[np.arange(len(arr))%10 < 4]
In the example in OP, the input array is divisible by m+n. If it's not, then you could use the below function take_n_skip_m. It expands on #Divakar's answer by padding the input array to make it reshapeable into a proper 2D matrix; slice, flatten and slice again to get the desired outcome:
def take_n_skip_m(arr, n=4, m=6):
# in case len(arr) is not divisible by (n+m), get the remainder
remainder = len(arr) % (n+m)
# will pad arr with (n+m-remainder) 0s at the back
pad_size = (0, n+m-remainder)
# pad arr; reshape to create 2D array; take first n of each row; flatten 2D->1D
sliced_arr = np.pad(arr, pad_size).reshape(-1, n+m)[:, :n].flatten()
# remove any remaining padding constant if there is any (which depends on whether remainder >= n or not)
return sliced_arr if remainder >= n else sliced_arr[:remainder-n]
Examples:
>>> out = take_n_skip_m(np.arange(20), n=5, m=4)
>>> print(out)
[ 0 1 2 3 4 9 10 11 12 13 18 19]
>>> out = take_n_skip_m(np.arange(20), n=5, m=6)
>>> print(out)
[ 0 1 2 3 4 11 12 13 14 15]
I have an XY problem. My setup is as follows - I have a dataframe with multi-index of 2 levels. I want to split it to two dataframes, taking only a fraction of rows from each label in the first level. For example:
df = pd.DataFrame({'a':[1, 1, 1, 1, 7, 7, 10, 10, 10, 10, 10, 10, 10], 'b': np.random.randint(0, 100, 13), 'c':np.random.randint(0, 100, 13)}).set_index(['a', 'b'])
df
Out[13]:
c
a b
1 86 83
1 37
57 64
53 5
7 4 66
13 49
10 61 0
32 84
97 59
69 98
25 52
17 31
37 95
So let's say the fraction is 0.5, I want to split it to two dataframes:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
69 98
c
a b
1 57 64
53 5
7 13 49
10 25 52
17 31
37 95
I thought about doing (df.groupby(level = 0).count() * 0.5).astype(int) to get the limit on which to "slice" the dataframe. Then, if only I had a way to add a running distance such as this:
c r
a b
1 38 36 0
6 47 1
57 6 2
55 45 3
7 7 51 0
90 96 1
10 59 75 0
27 16 1
58 7 2
79 51 3
58 77 4
63 48 5
87 60 6
I could join the limits and this df and filter with a boolean condition. Any suggestions on either problem? (splitting a fraction of rows or adding a level-aware running index)
This turns out to be pretty trivial with groupby:
In [36]: df.groupby(level=0).apply(lambda x:x.head(int(x.shape[0] * 0.5))).reset_index(level=0, drop=True)
Out[36]:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
Also getting the running index per group:
In [33]: df.groupby(level=0).cumcount()
Out[33]:
a b
1 38 0
6 1
57 2
55 3
7 7 0
90 1
10 59 0
27 1
58 2
79 3
58 4
63 5
87 6