I have the following dataset with a numeric outcome and several columns that represent tags for the numeric outcome
outcome tag1 tag2 tag3
340 a b a
123 a a b
23 d c b
54 c a c
I would like to unstack the dataset by creating rows from the column values (a, b, c..) and the relative outcome value, something like:
tag outcome
a 340
a 123
a 54
b 340
b 124
b 23
c 23
d 54
How?
Thanks!
Use
In [321]: (df.set_index('outcome').unstack()
.reset_index(level=0, drop=True)
.sort_values()
.reset_index(name='tag')
.drop_duplicates())
Out[321]:
outcome tag
0 340 a
1 123 a
3 54 a
5 340 b
6 123 b
7 23 b
8 54 c
9 23 c
11 23 d
Use:
df1 = (df.melt('outcome', value_name='tag')
.sort_values('tag')
.drop('variable', axis=1)
.dropna(subset=['tag'])
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by melt
Change order by sort_values
Remove column by drop
Remove possible missing values by dropna
Last remove duplicates by drop_duplicates
Or:
df1 = (df.set_index('outcome')
.stack()
.sort_values()
.reset_index(level=1, drop=True)
.reset_index(name='tag')
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by set_index with stack
Series is sorted by sort_values
Double reset_index - first remove level 1 and then create column form index
Last remove duplicates by drop_duplicates
print (df1)
tag outcome
0 a 340
1 a 123
7 a 54
4 b 340
9 b 123
10 b 23
3 c 54
6 c 23
2 d 23
Related
I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
I have a pandas dataframe with more than 100 columns.
For example in the following df:
df['A','B','C','D','E','date','G','H','F','I']
How can I move date to be the last column? assuming the dataframe is large and i cant write all the column names manually.
You can try this:
new_cols = [col for col in df.columns if col != 'date'] + ['date']
df = df[new_cols]
Test data:
cols = ['A','B','C','D','E','date','G','H','F','I']
df = pd.DataFrame([np.arange(len(cols))],
columns=cols)
print(df)
# A B C D E date G H F I
# 0 0 1 2 3 4 5 6 7 8 9
Output of the code:
A B C D E G H F I date
0 0 1 2 3 4 6 7 8 9 5
Use pandas.DataFrame.pop and pandas.concat:
print(df)
col1 col2 col3
0 1 11 111
1 2 22 222
2 3 33 333
s = df.pop('col1')
new_df = pd.concat([df, s], 1)
print(new_df)
Output:
col2 col3 col1
0 11 111 1
1 22 222 2
2 33 333 3
This way :
df_new=df.loc[:,df.columns!='date']
df_new['date']=df['date']
Simple reindexing should do the job:
original = df.columns
new_cols = original.delete(original.get_loc('date'))
df.reindex(columns=new_cols)
You can use reindex and union:
df.reindex(df.columns[df.columns != 'date'].union(['date']), axis=1)
Let's only work with the index headers and not the complete dataframe.
Then, use reindex to reorder the columns.
Output using #QuangHoang setup:
A B C D E F G H I date
0 0 1 2 3 4 8 6 7 9 5
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'date')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
I have two dataframes that I want to sum along the y axis, conditionally.
For example:
df_1
a b value
1 1 1011
1 2 1012
2 1 1021
2 2 1022
df_2
a b value
9 9 99
1 2 12
2 1 21
I want to make df_1['value'] -= df_2['value'] if df_1[a] == df_2[a] & df_1[b] == df_2[b], so the output would be:
OUTPUT
a b value
1 1 1011
1 2 1000
2 1 1000
2 2 1022
Is there a way to achieve that instead of iterating the whole dataframe? (It's pretty big)
Make use of index alignment that pandas provides here, by setting a and b as your index before subtracting.
for df in [df1, df2]:
df.set_index(['a', 'b'], inplace=True)
df1.sub(df2, fill_value=0).reindex(df1.index)
value
a b
1 1 1011.0
2 1000.0
2 1 1000.0
2 1022.0
You could also perform a left join and subtract matching values. Here is how to do that:
(pd.merge(df_1, df_2, how='left', on=['a', 'b'], suffixes=('_1', '_2'))
.fillna(0)
.assign(value=lambda x: x.value_1 - x.value_2)
)[['a', 'b', 'value']]
You could let
merged = pd.merge(df_1, df_2, on=['a', 'b'], left_index=True)
df_1.value[merged.index] = merged.value_x - merged.value_y
Result:
In [37]: df_1
Out[37]:
a b value
0 1 1 1011
1 1 2 1000
2 2 1 1000
3 2 2 1022
I have two columns as below:
id, colA, colB
0, a, 13
1, a, 52
2, b, 16
3, a, 34
4, b, 946
etc...
I am trying to create a third column, colC, that is colB if colA == a, otherwise 0.
This is what I was thinking, but it does not work:
data[data['colA']=='a']['colC'] = data[data['colA']=='a']['colB']
I was also thinking about using np.where(), but I don't think that would work here.
Any thoughts?
Use loc with a mask to assign:
In [300]:
df.loc[df['colA'] == 'a', 'colC'] = df['colB']
df['colC'] = df['colC'].fillna(0)
df
Out[300]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
EDIT
or use np.where:
In [296]:
df['colC'] = np.where(df['colA'] == 'a', df['colC'],0)
df
Out[296]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
df['colC'] = df[df['colA'] == 'a']['colB']
should result in exactly what you want, afaik.
Then replace the NaN's with zeroes with df.fillna(inplace=True)
I have these two data frames with different row indices but some columns are same.
What I want to do is to a get a data frame that sums the numbers of the two data frames with same column names
df1 = pd.DataFrame([(1,2,3),(3,4,5),(5,6,7)], columns=['a','b','d'], index = ['A', 'B','C','D'])
df1
a b d
A 1 2 3
B 3 4 5
C 5 6 7
df2 = pd.DataFrame([(10,20,30)], columns=['a','b','c'])
df2
a b c
0 10 20 30
Output dataframe:
a b d
A 11 22 3
B 13 24 5
C 15 16 7
Whats the best way to do this? .add() doesn't seem to work with data frames with different indices.
This one-liner does the trick:
In [30]: df1 + df2.ix[0].reindex(df1.columns).fillna(0)
Out[30]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7
Here's one way to do it.
Extract common columns on which you want to add from df1 and df2.
In [153]: col1 = df1.columns
In [154]: col2 = df2.columns
In [155]: cols = list(set(col1) & set(col2))
In [156]: cols
Out[156]: ['a', 'b']
And, now add the values
In [157]: dff = df1
In [158]: dff[cols] = df1[cols].values+df2[cols].values
In [159]: dff
Out[159]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7
I think this might be the shortest approach:
In [36]:
print df1 + df2.ix[0,df1.columns].fillna(0)
a b d
A 11 22 3
B 13 24 5
C 15 26 7
Step 1 result a series like this:
In [44]:
df2.ix[0,df1.columns]
Out[44]:
a 10
b 20
d NaN
Name: 0, dtype: float64
fill the nan and added it to df1 will suffice, as the index will be aligned:
In [45]:
df2.ix[0,df1.columns].fillna(0)
Out[45]:
a 10
b 20
d 0
Name: 0, dtype: float64