dstack with no multiple layers

dstack with no multiple layers - python

I have the following dataset with a numeric outcome and several columns that represent tags for the numeric outcome
outcome tag1 tag2 tag3
340 a b a
123 a a b
23 d c b
54 c a c
I would like to unstack the dataset by creating rows from the column values (a, b, c..) and the relative outcome value, something like:
tag outcome
a 340
a 123
a 54
b 340
b 124
b 23
c 23
d 54
How?
Thanks!

Use
In [321]: (df.set_index('outcome').unstack()
.reset_index(level=0, drop=True)
.sort_values()
.reset_index(name='tag')
.drop_duplicates())
Out[321]:
outcome tag
0 340 a
1 123 a
3 54 a
5 340 b
6 123 b
7 23 b
8 54 c
9 23 c
11 23 d

Use:
df1 = (df.melt('outcome', value_name='tag')
.sort_values('tag')
.drop('variable', axis=1)
.dropna(subset=['tag'])
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by melt
Change order by sort_values
Remove column by drop
Remove possible missing values by dropna
Last remove duplicates by drop_duplicates
Or:
df1 = (df.set_index('outcome')
.stack()
.sort_values()
.reset_index(level=1, drop=True)
.reset_index(name='tag')
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by set_index with stack
Series is sorted by sort_values
Double reset_index - first remove level 1 and then create column form index
Last remove duplicates by drop_duplicates
print (df1)
tag outcome
0 a 340
1 a 123
7 a 54
4 b 340
9 b 123
10 b 23
3 c 54
6 c 23
2 d 23

Related

Insert/replace/merge values from one dataframe to another

I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.

You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4

Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50

Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)

Best way to move a column in pandas dataframe to last column in large dataframe

I have a pandas dataframe with more than 100 columns.
For example in the following df:
df['A','B','C','D','E','date','G','H','F','I']
How can I move date to be the last column? assuming the dataframe is large and i cant write all the column names manually.

You can try this:
new_cols = [col for col in df.columns if col != 'date'] + ['date']
df = df[new_cols]
Test data:
cols = ['A','B','C','D','E','date','G','H','F','I']
df = pd.DataFrame([np.arange(len(cols))],
columns=cols)
print(df)
# A B C D E date G H F I
# 0 0 1 2 3 4 5 6 7 8 9
Output of the code:
A B C D E G H F I date
0 0 1 2 3 4 6 7 8 9 5

Use pandas.DataFrame.pop and pandas.concat:
print(df)
col1 col2 col3
0 1 11 111
1 2 22 222
2 3 33 333
s = df.pop('col1')
new_df = pd.concat([df, s], 1)
print(new_df)
Output:
col2 col3 col1
0 11 111 1
1 22 222 2
2 33 333 3

This way :
df_new=df.loc[:,df.columns!='date']
df_new['date']=df['date']

Simple reindexing should do the job:
original = df.columns
new_cols = original.delete(original.get_loc('date'))
df.reindex(columns=new_cols)

You can use reindex and union:
df.reindex(df.columns[df.columns != 'date'].union(['date']), axis=1)
Let's only work with the index headers and not the complete dataframe.
Then, use reindex to reorder the columns.
Output using #QuangHoang setup:
A B C D E F G H I date
0 0 1 2 3 4 8 6 7 9 5

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'date')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

Pandas sum two dataframes based on the value of column

I have two dataframes that I want to sum along the y axis, conditionally.
For example:
df_1
a b value
1 1 1011
1 2 1012
2 1 1021
2 2 1022
df_2
a b value
9 9 99
1 2 12
2 1 21
I want to make df_1['value'] -= df_2['value'] if df_1[a] == df_2[a] & df_1[b] == df_2[b], so the output would be:
OUTPUT
a b value
1 1 1011
1 2 1000
2 1 1000
2 2 1022
Is there a way to achieve that instead of iterating the whole dataframe? (It's pretty big)

Make use of index alignment that pandas provides here, by setting a and b as your index before subtracting.
for df in [df1, df2]:
df.set_index(['a', 'b'], inplace=True)
df1.sub(df2, fill_value=0).reindex(df1.index)
value
a b
1 1 1011.0
2 1000.0
2 1 1000.0
2 1022.0

You could also perform a left join and subtract matching values. Here is how to do that:
(pd.merge(df_1, df_2, how='left', on=['a', 'b'], suffixes=('_1', '_2'))
.fillna(0)
.assign(value=lambda x: x.value_1 - x.value_2)
)[['a', 'b', 'value']]

You could let
merged = pd.merge(df_1, df_2, on=['a', 'b'], left_index=True)
df_1.value[merged.index] = merged.value_x - merged.value_y
Result:
In [37]: df_1
Out[37]:
a b value
0 1 1 1011
1 1 2 1000
2 2 1 1000
3 2 2 1022

pandas create one column equal to another if condition is satisfied

I have two columns as below:
id, colA, colB
0, a, 13
1, a, 52
2, b, 16
3, a, 34
4, b, 946
etc...
I am trying to create a third column, colC, that is colB if colA == a, otherwise 0.
This is what I was thinking, but it does not work:
data[data['colA']=='a']['colC'] = data[data['colA']=='a']['colB']
I was also thinking about using np.where(), but I don't think that would work here.
Any thoughts?

Use loc with a mask to assign:
In [300]:
df.loc[df['colA'] == 'a', 'colC'] = df['colB']
df['colC'] = df['colC'].fillna(0)
df
Out[300]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
EDIT
or use np.where:
In [296]:
df['colC'] = np.where(df['colA'] == 'a', df['colC'],0)
df
Out[296]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0

df['colC'] = df[df['colA'] == 'a']['colB']
should result in exactly what you want, afaik.
Then replace the NaN's with zeroes with df.fillna(inplace=True)

Perform mathematical operation on a matching columns from 2 data frames: Python Pandas

I have these two data frames with different row indices but some columns are same.
What I want to do is to a get a data frame that sums the numbers of the two data frames with same column names
df1 = pd.DataFrame([(1,2,3),(3,4,5),(5,6,7)], columns=['a','b','d'], index = ['A', 'B','C','D'])
df1
a b d
A 1 2 3
B 3 4 5
C 5 6 7
df2 = pd.DataFrame([(10,20,30)], columns=['a','b','c'])
df2
a b c
0 10 20 30
Output dataframe:
a b d
A 11 22 3
B 13 24 5
C 15 16 7
Whats the best way to do this? .add() doesn't seem to work with data frames with different indices.

This one-liner does the trick:
In [30]: df1 + df2.ix[0].reindex(df1.columns).fillna(0)
Out[30]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7

Here's one way to do it.
Extract common columns on which you want to add from df1 and df2.
In [153]: col1 = df1.columns
In [154]: col2 = df2.columns
In [155]: cols = list(set(col1) & set(col2))
In [156]: cols
Out[156]: ['a', 'b']
And, now add the values
In [157]: dff = df1
In [158]: dff[cols] = df1[cols].values+df2[cols].values
In [159]: dff
Out[159]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7

I think this might be the shortest approach:
In [36]:
print df1 + df2.ix[0,df1.columns].fillna(0)
a b d
A 11 22 3
B 13 24 5
C 15 26 7
Step 1 result a series like this:
In [44]:
df2.ix[0,df1.columns]
Out[44]:
a 10
b 20
d NaN
Name: 0, dtype: float64
fill the nan and added it to df1 will suffice, as the index will be aligned:
In [45]:
df2.ix[0,df1.columns].fillna(0)
Out[45]:
a 10
b 20
d 0
Name: 0, dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dstack with no multiple layers - python

Use In [321]: (df.set_index('outcome').unstack() .reset_index(level=0, drop=True) .sort_values() .reset_index(name='tag') .drop_duplicates()) Out[321]: outcome tag 0 340 a 1 123 a 3 54 a 5 340 b 6 123 b 7 23 b 8 54 c 9 23 c 11 23 d

Related

Insert/replace/merge values from one dataframe to another

Best way to move a column in pandas dataframe to last column in large dataframe

Pandas sum two dataframes based on the value of column

pandas create one column equal to another if condition is satisfied

Perform mathematical operation on a matching columns from 2 data frames: Python Pandas

Categories

Resources