Transform a dataset from wide to long pandas - python

I have a little problem with the transformation from wide to long on a dataset. I tried with melt but I didn't get a good result. I hope that someone could help me. The dataset is as follow:
pd.DataFrame({'id': [0, 1, 2, 3, 4, 5],
'type': ['a', 'b', 'c', 'd', 'e', 'f'],
'rank': ['alpha', 'beta', 'gamma', 'epsilon', 'phi', 'ro'],
'type.1': ['d', 'g', 'z', 'a', 'nan', 'nan'],
'rank.1': ['phi', 'sigma', 'gamma', 'lambda', 'nan', 'nan'],
'type.2': ['nan', 'nan', 'j', 'r', 'nan', 'nan'],
'rank.2': ['nan', 'nan', 'eta', 'theta', 'nan', 'nan']})
And I need the dataset in this way:
pd.DataFrame({'id': [0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 5],
'type': ['a', 'd', 'b', 'g', 'c', 'z', 'j', 'd', 'a', 'r', 'e', 'f'],
'rank': ['alpha', 'phi', 'beta', 'sigma', 'gamma', 'gamma', 'eta', 'epsilon', 'lambda', 'theta', 'phi', 'ro']})
Can anyone help me with that? Thanks a lot

Use wide_to_long:
# normalize the `type` and `rank` columns so they have the same format as others
df = df.rename(columns={'type': 'type.0', 'rank': 'rank.0'})
(pd.wide_to_long(df, stubnames=['type', 'rank'], i='id', j='var', sep='.')
[lambda x: (x['type'] != 'nan') | (x['rank'] != 'nan')].reset_index())
id var type rank
0 0 0 a alpha
1 1 0 b beta
2 2 0 c gamma
3 3 0 d epsilon
4 4 0 e phi
5 5 0 f ro
6 0 1 d phi
7 1 1 g sigma
8 2 1 z gamma
9 3 1 a lambda
10 2 2 j eta
11 3 2 r theta
You can drop the var column if not needed.

One option is pivot_longer from pyjanitor, which abstracts the reshaping process:
# pip install janitor
import janitor
(df
.pivot_longer(
index = 'id',
names_to = ['type', 'rank'],
names_pattern = ['type', 'rank'],
sort_by_appearance = True)
.loc[lambda df: ~df.eq('nan').any(1)]
)
id type rank
0 0 a alpha
1 0 d phi
3 1 b beta
4 1 g sigma
6 2 c gamma
7 2 z gamma
8 2 j eta
9 3 d epsilon
10 3 a lambda
11 3 r theta
12 4 e phi
15 5 f ro
The idea for this particular reshape is that each regex in names_pattern is used to pair the matching column with the paired name in names_to.

Related

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

pandas: select certain amount of rows based on column ranking using loop

I have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
a b c
0 A N 4
1 B Y 5
2 C Y 9
3 D N 8
4 E Y 1
5 F N 3
6 G Y 7
7 H N 2
8 I N 6
9 J Y 10
Out of the 10 rows I want to select 5 rows based on the following criteria:
column 'c' is my rank column.
select the rows with lowest 2 ranks (rows 4 and 7 selected)
select all rows where column 'b' = 'Y' AND rank <=5 (row 1 selected)
in the event fewer than 5 rows are selected using the above criteria the remaining open positions should be filled by rank order (lowest) with rows where 'b' = 'Y' and which have rank <= 7 (row 6 selected)
in the event fewer than 5 rows pass the first 3 criteria fill remaining positions in rank order (lowest) where 'b' = 'N'
I have tried this (which covers rule 1 & 2) but struggling how to go on from there
df['selected'] = ''
df.loc[(df.c <= 2), 'selected'] = 'rule_1'
df.loc[((df.c <= 5) & (df.b == 'Y')), 'selected'] = 'rule_2'
my resulting dataframe should look like this
a b c selected
0 A N 4 False
1 B Y 5 rule_2
2 C Y 9 False
3 D N 8 rule_4
4 E Y 1 rule_1
5 F N 3 False
6 G Y 7 rule_3
7 H N 2 rule_1
8 I N 6 False
9 J Y 10 False
based on on of the solutions provided by Vinod Karantothu below I went for the following which seems to work:
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule1['action'] = 'rule_1'
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result_df_rule2['action'] = 'rule_2'
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(subset='a', keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result_df_rule3['action'] = 'rule_3'
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(subset='a', keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result1['action'] = 'rule_4'
result = pd.concat([result, result1]).head(5).drop_duplicates(subset='a', keep=False).merge(df, how='outer', on='a')
return result
if __name__ == '__main__':
df = pd.DataFrame({'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b': ['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c': [4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)
import pandas as pd
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result = pd.concat([result, result1]).head(5)
return result
if __name__ == '__main__':
df = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)
Result:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
For your 4th RULE, you have mentioned in your resulting dataframe, ROW_INDEX 3 will come, but it has rank of 8 which is not lowest, ROW_INDEX 5 should come according to the RULES you have given:
import pandas as pd
data = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
data1 = data.nsmallest(2, ['c'])
dataX = data.drop(data1.index)
data2 = dataX[((dataX.b == "Y") & (dataX.c<=5))]
dataX = dataX.drop(data2.index)
data3 = dataX[((dataX.b == "Y") & (dataX.c<=7))]
dataX = dataX.drop(data3.index)
data4 = dataX[((dataX.b == "N"))]
data4 = data4.nsmallest(1, ['c'])
resultframes = [data1, data2, data3, data4]
resultfinal = pd.concat(resultframes)
print(resultfinal)
And here is the output:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
You can create extra columns for the rules, then sort and take the head. IIUC from the comments then rule 3 already covers rule 2 so no need to calculate it separately.
df['r1'] = df.c < 3
df['r3'] = (df.c <= 7) & (df.b == 'Y')
print(df.sort_values(['r1', 'r3', 'c'], ascending=[False, False, True])[['a', 'b', 'c']].head(5))
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
Sorting on boolean column works because True > False.
Note: You might need to tweak the code to your expectations with different datasets. For example your last row 9 J Y 10 is currently not covered by any of the rules. You can take this approach and extend it if needed.

Keep largest value based on sum of two groupbys in pandas

I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]
IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l

Creating a new column based on selecting by multiple conditions between two pandas dataframes

I have two dataframes that contain (some) common columns (A,B,C), but are ordered differently and have different values for C.
I'd like to replace the 'C' values in first dataframe with those from the second.
I can create a toy example like this:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i' ]
df1 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
A.reverse()
B.reverse()
C = [ c.upper() for c in reversed(C) ]
df2 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
I'd like to update df1 so that it looks like this - i.e. it has the 'C' values from df2:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' ]
I've tried:
df1['C'] = df2[ (df2['A'] == df1['A']) & (df2['B'] == df1['B']) ]['C']
but that doesn't work because, I think, the order of A and B are different.
merge_df = pd.merge(df1, df2, on=['A', 'B'])
df1['C'] = merge_df['C_y']
I think your toy code has a problem in [ c.upper() for c in C.reverse() ].
C.reverse() return None.
It is not easy, because duplicates in columns A and B (3,x).
So I create new columns D by cumcount and then use
merge, last remove unnecessary columns:
df1['D'] = df1.groupby(['A','B']).C.cumcount()
df2['D'] = df2.groupby(['A','B']).C.cumcount(ascending=False)
df3 = pd.merge(df1, df2, on=['A','B','D'], how='right', suffixes=('_',''))
df3 = df3.drop(['C_', 'D'], axis=1)
print (df3)
A B C
0 1 x A
1 1 y B
2 1 z C
3 2 x D
4 2 y E
5 2 y F
6 3 x G
7 3 x H
8 3 x I

Python: Create vectors of same length using two DataFrames

I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1

Categories

Resources