I have my input in a Pandas dataframe in the following format.
I would like to convert it into the below format
What I have managed to do so far:
I managed to extract the col values A and B from the column names have have cross joined it with the name column to obtain the following dataframe. I am not sure if my approach is correct.
I am not sure how I should go about it. Any help would be appreciated. Thanks
I agree with the earlier comment about posting data/code, but in this case it's simple enough to type in an example:
df = pd.DataFrame( { 'Name' : ['AA','BB','CC'],
'col1_A' : [5,2,5],
'col2_A' : [10,3,6],
'col1_B' : [15,4,7],
'col2_B' : [20,6,21],
})
print(df)
Name col1_A col2_A col1_B col2_B
0 AA 5 10 15 20
1 BB 2 3 4 6
2 CC 5 6 7 21
You can create a pd.MultiIndex to replace the column names to match the structure of the table:
df = df.set_index('Name')
df.columns = pd.MultiIndex.from_product([['A','B'],['val_1','val_2']], names=('col', None))
print(df)
col A B
val_1 val_2 val_1 val_2
Name
AA 5 10 15 20
BB 2 3 4 6
CC 5 6 7 21
Then stack() the 'col' column index, and reset both indices to be columns:
df = df.stack('col').reset_index()
print(df)
Name col val_1 val_2
0 AA A 5 10
1 AA B 15 20
2 BB A 2 3
3 BB B 4 6
4 CC A 5 6
5 CC B 7 21
Example code:
import pandas as pd
import re
# Dummy dataframe
d = {'Name': ['AA', 'BB'], 'col1_A': [5, 4], 'col1_B': [10, 9], 'col2_A': [15, 14], 'col2_B': [20, 19]}
df = pd.DataFrame(d)
# Get all the number index inside 'col' columns name
col_idx = [re.findall(r'\d+', name)[0] for name in list(df.columns[df.columns.str.contains('col')])]
# Get all the alphabet suffix at end of 'col' columns name
col_sfx = [name.split('_')[-1] for name in list(df.columns[df.columns.str.contains('col')])]
# Get unique value in list
col_idx = list(dict.fromkeys(col_idx))
col_sfx = list(dict.fromkeys(col_sfx))
# Create new df with repeated 'Name' and 'col'
new_d = {'Name': [name for name in df['Name'] for i in range(len(col_sfx))], 'col': col_sfx * len(df.index)}
new_df = pd.DataFrame(new_d)
all_sub_df = []
all_sub_df.append(new_df)
print("Name and col:\n{}\n".format(new_df))
# Create new df for each val columns
for i_c in col_idx:
df_coli = df.filter(like='col' + i_c, axis=1)
df_coli = df_coli.stack().reset_index()
df_coli = df_coli[df_coli.columns[-1:]]
df_coli.columns = ['val_' + i_c]
print("df_col{}:\n{}\n".format(i_c, df_coli))
all_sub_df.append(df_coli)
# Concatenate all columns for result
new_df = pd.concat(all_sub_df, axis=1)
new_df
Outputs:
Name and col:
Name col
0 AA A
1 AA B
2 BB A
3 BB B
df_col1:
val_1
0 5
1 10
2 4
3 9
df_col2:
val_2
0 15
1 20
2 14
3 19
Name col val_1 val_2
0 AA A 5 15
1 AA B 10 20
2 BB A 4 14
3 BB B 9 19
Related
I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64
I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
I am trying to copy Name from df2 into df1 where ID is common between both dataframes.
df1:
ID Name
1 A
2 B
4 C
16 D
7 E
df2:
ID Name
1 X
2 Y
7 Z
Expected Output:
ID Name
1 X
2 Y
4 C
16 D
7 Z
I have tried like this, but it didn't worked. I am not able to understand how to assign value here. I am assigning =df2['Name'] which is wrong.
for i in df2["ID"].tolist():
df1['Name'].loc[(df1['ID'] == i)] = df2['Name']
Try with update
df1 = df1.set_index('ID')
df1.update(df2.set_index('ID'))
df1 = df1.reset_index()
df1
Out[476]:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
If the order of rows does not matter, then concatenate two dfs and drop_duplicates will achieve the result,
df2.append(df1).drop_duplicates(subset='ID')
another solution would be,
s = df1["Name"]
df1.loc[:,"Name"]=df1["ID"].map(df2.set_index("ID")["Name"].to_dict()).fillna(s)
o/P:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
One more for consideration
df,dg = df1,df2
df = df.set_index('ID')
dg = dg.set_index('ID')
df.loc[dg.index,:] = dg # All columns
#df.loc[dg.index,'Name'] = dg.Name # Single column
df = df.reset_index()
>>> df
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
Or for a single column (index for both is 'ID'
How am I supposed to remove the index column in the first row. I know it is not counted as a column but when I transpose the data frame, it does not allow me to use my headers anymore.
In[297] df = df.transpose()
print(df)
df = df.drop('RTM',1)
df = df.drop('Requirements', 1)
df = df.drop('Test Summary Report', 1)
print(df)
This throws me an error "labels ['RTM'] not contained in axis".
RTM is contained in an axis and this works if I do index_col=0
df = xl.parse(sheet_name,header=1, index_col=0, usecols="A:E", nrows=6, index_col=None)
but then I lose my (0,0) value "Artifact name" as a header. Any help will be appreciated.
You can do this with .iloc, to assign the column names to the first row after transposing. Then you can drop the first row, and clean up the name
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': list('ABCDE'),
'val1': np.arange(1,6,1),
'val2': np.arange(11,16,1)})
id val1 val2
0 A 1 11
1 B 2 12
2 C 3 13
3 D 4 14
4 E 5 15
Transpose and clean up the names
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.iloc[0].index.name)
df.columns.name = None
df is now:
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Alternatively, just create a new DataFrame to begin with, specifying which column you want to be the header column.
header_col = 'id'
cols = [x for x in df.columns if x != header_col]
pd.DataFrame(df[cols].values.T, columns=df[header_col], index=cols)
Output:
id A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Using the setup from #ALollz:
df.set_index('id').rename_axis(None).T
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]