Pandas: Reshaping Long Data to Wide with duplicated columns - python

I need to pivot long pandas dataframe to wide. The issue is that for some id there are multiple values for the same parameter. Some parameters present only in a few ids.
df = pd.DataFrame({'indx':[11,11,11,11,12,12,12,13,13,13,13],'param':['a','b','b','c','a','b','d','a','b','c','c'],'value':[100,54,65,65,789,24,98,24,27,75,35]})
indx param value
11 a 100
11 b 54
11 b 65
11 c 65
12 a 789
12 b 24
12 d 98
13 a 24
13 b 27
13 c 75
13 c 35
Want to receive something like this:
indx a b c d
11 100 `54,65` 65 None
12 789 None 98 24
13 24 27 `75,35` None
or
indx a b b1 c c1 d
11 100 54 65 65 None None
12 789 None None 98 None 24
13 24 27 None 75 35 None
So, obviously direct df.pivot() not a solution.
Thanks for any help.

Option 1:
df.astype(str).groupby(['indx', 'param'])['value'].agg(','.join).unstack()
Output:
param a b c d
indx
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN
Option 2
df_out = df.set_index(['indx', 'param', df.groupby(['indx','param']).cumcount()])['value'].unstack([1,2])
df_out.columns = [f'{i}_{j}' if j != 0 else f'{i}' for i, j in df_out.columns]
df_out.reset_index()
Output:
indx a b b_1 c d c_1
0 11 100.0 54.0 65.0 65.0 NaN NaN
1 12 789.0 24.0 NaN NaN 98.0 NaN
2 13 24.0 27.0 NaN 75.0 NaN 35.0

Ok, found a solution (there is method df.pivot_table for such cases,allows different aggregation functions):
df.pivot_table(index='indx', columns='param',values='value', aggfunc=lambda x: ','.join(x.astype(str)) )
indx a b c d
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN

Related

Pandas custom groupby fill

I have this dataset:
menu alternative id varA varB varC
1 NaN A NaN NaN NaN
1 NaN A NaN NaN NaN
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 NaN A NaN NaN NaN
7 NaN A NaN NaN NaN
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 NaN B NaN NaN NaN
6 NaN B NaN NaN NaN
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 NaN C NaN NaN NaN
5 NaN C NaN NaN NaN
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
As you can see here, I have some null values which I need to fill. However, I need to fill them in a somewhat custom manner. For every id and for every menu I need to fill the null values based on random selection among the same menus (same menu number) in different ids which have non-null values.
Example. The menu 1 in id A has null values. I want to randomly select menu 1 in different id which has non-null values and fill them there. Let it be, id B and menu 1. For menu 7 in id A let it be menu 7 in id C and etc.
It is somehow similar to this question but iin my case, the filling should happen within the same "subgroups" if we can say so.
The final output should be something like this:
menu alternative id varA varB varC
1 82 A 21.34428845 9.901326583 1.053134597
1 91 A 19.04689216 16.29217346 29.56962312
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 17 A 4.319991082 21.29233667 3.516184987
7 8 A 24.09490443 9.507000131 14.93472971
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 57 B 1.068603487 27.95362014 1.334049372
6 100 B 26.31848796 6.757305213 4.742282633
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 48 C 5.591317152 25.17616679 24.30522374
5 16 C 23.85069753 23.12154586 0.781450997
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
Any guidance would be appreciated. Maybe even there is some groupby apply logic which could assist in this.
You can run fillna() row-wise in apply(), then fill with a random sample from the dataframe filtered by your conditions:
df.apply(lambda row: row.fillna(df[(df['menu'] == row['menu']) & (df['id'] != row['id'])].dropna().sample(n=1).iloc[0]), axis=1)

pandas dataframe in def

I tried the below code to pass a df to a def function.
the first line works fine with df.dropna.
however the df.replace has issue as I found that it does not do the replace as I expected.
def Max(df):
df.dropna(subset=df.columns[3:10], inplace=True)
print(df)
df.replace(to_replace=65535, value=-10, inplace=True)
print(df)
return df
anyone know the issue and how to solve it?
Your code works well. Maybe try this version without inplace modifications:
>>> df
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 65535 13 14 15.0 16 17 18 19 20.0
2 21 22 23 24 25.0 26 27 28 29 NaN
3 65535 32 33 34 NaN 36 37 38 39 40.0
4 41 42 65535 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0
def Max(df):
return df.dropna(subset=df.columns[3:10]).replace(65535, -10)
>>> Max(df)
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 -10 13 14 15.0 16 17 18 19 20.0
4 41 42 -10 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0

Finding minimum value of a column between two entries in another column

Viewed 64 times
0
I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.
Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".
SAMPLE INPUT
ROWNUM A B
1 None 68
2 None 83
3 X 51
4 None 66
5 None 90
6 Y 81
7 None 81
8 None 100
9 None 83
10 None 78
11 X 68
12 None 53
13 None 83
14 Y 68
15 None 94
16 None 50
17 None 71
18 None 71
19 None 52
20 None 67
21 None 82
22 X 76
23 None 66
24 None 92
For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.
EDIT:
In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.
Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.
I have the following code.
import numpy as np
import pandas as pd
df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)
This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.
I read your sample data as follows:
df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])
Note na_values=['None'], which provides that None in input (a string)
is read as NaN.
This way the DataFrame is:
ROWNUM A B
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76
22 23 NaN 66
23 24 NaN 92
The code to do your task is:
res = df.index[df.A.notnull()].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
.dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)
As you can see, it is in part a slight change of your code, with some
additions.
The result is:
ROWNUM A B Result
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69 YES
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76 NO
22 23 NaN 66
23 24 NaN 92
The advantage of my solution over the other is that:
the content is either YES or NO, just as you want,
this content shows up only for non-null values in A column,
"ignoring" first 3, which don't have enough "predecessors".
Here's my approach:
def is_incr(x):
return x[:2].min() > x[2:].min()
# replace with s = df['A'] == 'None' if needed
s = df['A'].isna()
df['new_col'] = df.loc[s, 'B'].rolling(4).apply(is_incr)
Output:
ROWNUM A B new_col
0 1 NaN 68 NaN
1 2 NaN 83 NaN
2 3 X 51 NaN
3 4 NaN 66 NaN
4 5 NaN 90 1.0
5 6 Y 81 NaN
6 7 NaN 81 0.0
7 8 NaN 100 0.0
8 9 NaN 83 0.0
9 10 NaN 78 1.0
10 11 X 68 NaN
11 12 NaN 53 1.0
12 13 NaN 83 1.0
13 14 Y 68 NaN
14 15 NaN 94 0.0
15 16 NaN 50 1.0
16 17 NaN 71 1.0
17 18 NaN 71 0.0
18 19 NaN 52 0.0
19 20 NaN 67 1.0
20 21 NaN 82 0.0
21 22 X 76 NaN
22 23 NaN 66 0.0
23 24 NaN 92 1.0

Get columns names from one data frame and add them as empty columns in another data frame in pandas

I have one data frame (df1) with 5 columns and another (df2) with 10 columns. I want to add columns from df2 to df1, but only columns names (without values). Also, I want to do the same with adding columns without values from df1 to df2.
Here are the data frames:
df1
A B C D E
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
df2
F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
And I want to get this:
df1
A B C D E F G H I J K L M N O
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
And I want to get this:
df2
A B C D E F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
I tried with df.columns.values and got the array of columns names, but then I have to apply them as data frame columns and give them empty values, and the way that I am doing now has too many lines of code, and I just wonder is it some easier way to do that?
I will appreciate any help.
Use Index.union with DataFrame.reindex:
cols = df1.columns.union(df2.columns)
#if order is important
#cols = df1.columns.append(df2.columns)
df1 = df1.reindex(columns=cols)
df2 = df2.reindex(columns=cols)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 45 21 36 5 65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 5 24 3 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1
If same index values in both DataFrames is possible use DataFrame.align:
print (df1)
A B C D E
0 1 234 52 1 54
1 54 23 87 5 125
2 678 67 63 8 18
df1, df2 = df1.align(df2)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1

pandas drop row below each row containing an 'na'

i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?
IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3
Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16

Categories

Resources