Python - Lookup value from different columns dynamically - python

I have the following dataframes.
Name | Data
A foo
A bar
B foo
B bar
C foo
C bar
C cat
Name | foo | bar | cat
A 1 2 3
B 4 5 6
C 7 8 9
I need to lookup the values present in the 2nd dataframe and create a dataframe like this
Name | Data | Value
A foo 1
A bar 2
B foo 4
B bar 5
C foo 7
C bar 8
C cat 9
I tried looping over df1 and parsing df2 like df2[df2['Name']=='A']['foo'], this works but it takes forever to complete. I am new to python and any help to reduce the runtime would be appreciated.

You can use .melt + .merge:
x = df1.merge(df2.melt("Name", var_name="Data"), on=["Name", "Data"])
print(x)
Prints:
Name Data value
0 A foo 1
1 A bar 2
2 B foo 4
3 B bar 5
4 C foo 7
5 C bar 8
6 C cat 9

You can melt your second dataframe and then merge it with your first:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Data': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'cat'],
})
df2 = pd.DataFrame({
'Name': ['A', 'B', 'C'],
'foo': [1, 4, 7],
'bar': [2, 5, 8],
'cat': [3, 6, 9],
})
df1.merge(df2.melt('Name', var_name='Data'), on=['Name', 'Data'])

Related

Pandas group multiple columns and append value based on condition in non-grouped column

I'd like to group several columns in my dataframe, then append a new column to the original dataframe with a non-aggregated value determined by a condition in another column that falls outside of the grouping. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat' : ['foo', 'foo', 'foo', 'foo','foo','foo',
'bar', 'bar', 'bar',' bar','bar', 'bar'],
'subcat' : ['a', 'a','a', 'b', 'b', 'b',
'c', 'c','c','d', 'd', 'd'],
'bin' : [1,0,0,0,1,0,0,0,1,0,0,1],
'value':[2,5,7,6,3,9,8,3,2,1,2,4]
})
I'd like to group by both 'cat' and 'subcat', and I'm hoping to append the corresponding 'value' as a new column where 'bin' == 1.
This is my desired output:
df = pd.DataFrame({'cat' : ['foo', 'foo', 'foo', 'foo','foo','foo',
'bar', 'bar', 'bar',' bar','bar', 'bar'],
'subcat' : ['a', 'a','a', 'b', 'b', 'b',
'c', 'c','c','d', 'd', 'd'],
'bin' : [1,0,0,0,1,0,0,0,1,0,0,1],
'value':[2,5,7,6,3,9,8,3,2,1,2,4],
'new_value':[2,2,2,3,3,3,2,2,2,4,4,4]
})
I've tried various approaches including the following, but my merge yields more rows than expected so am hoping to find a different route.
vals = df[df['bin'] == 1].loc[:,('cat', 'subcat', 'value')]
df_merged = pd.merge(left = df, right = vals, how = "left", on = ('cat','subcat'))
Thanks!
Try with loc with groupby and idxmax:
df['new_value'] = df.loc[df.groupby(['subcat'])['bin'].transform('idxmax'), 'value'].reset_index(drop=True)
print(df)
Output:
cat subcat bin value new_value
0 foo a 1 2 2
1 foo a 0 5 2
2 foo a 0 7 2
3 foo b 0 6 3
4 foo b 1 3 3
5 foo b 0 9 3
6 bar c 0 8 2
7 bar c 0 3 2
8 bar c 1 2 2
9 bar d 0 1 4
10 bar d 0 2 4
11 bar d 1 4 4

How to add new column group after using pivot pandas?

I'm trying to create a new column group consisting of 3 sub-columns after using pivot on a dataframe, but the result is only one column.
Let's say I have the following dataframe that I pivot:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [1, 2, 3, 4, 5, 6]})
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
Now I want an extra column group that is the sum of the two value columns baz and zoo.
My output:
df.loc[:, "baz+zoo"] = df.loc[:,'baz'] + df.loc[:,'baz']
The desired output:
I know that performing the sum and then concatenating will do the trick, but I was hoping for a neater solution.
I think if many rows or mainly many columns is better/faster create new DataFrame and add first level of MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.loc[:,'baz'] + df.loc[:,'zoo']
df1.columns = pd.MultiIndex.from_product([['baz+zoo'], df1.columns])
print (df1)
baz+zoo
A B C
foo
one 2 4 6
two 8 10 12
df = df.join(df1)
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
Another solution is loop by second levels and select MultiIndex by tuples, but if large DataFrame performance should be worse, the best test with real data:
for x in df.columns.levels[1]:
df[('baz+zoo', x)] = df[('baz', x)] + df[('zoo', x)]
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
I was able to do it this way too. I'm not sure I understand the theory, but...
df['baz+zoo'] = df['baz']+df['zoo']
df.pivot(index='foo', columns='bar', values=['baz','zoo','baz+zoo'])

Excluding all data above a percentile for different categories

I have a dataframe with different categories and want to exclude all the values which are above a given percentile for each category.
d = {'cat': ['A', 'B', 'A', 'A', 'C', 'C', 'B', 'A', 'B', 'C'],
'val': [1, 2, 4, 2, 1, 0, 9, 8, 7, 7]}
df = pd.DataFrame(data=d)
cat val
0 A 1
1 B 2
2 A 4
3 A 2
4 C 1
5 C 0
6 B 9
7 A 8
8 B 7
9 C 7
So for example, excluding the 0.95 percentile should result in:
cat val
0 A 1
1 B 2
2 A 4
3 A 2
4 C 1
5 C 0
8 B 7
because we have:
>>> df[df['cat']=='A'].quantile(0.95).item()
7.399999999999999
>>> df[df['cat']=='B'].quantile(0.95).item()
8.8
>>> df[df['cat']=='C'].quantile(0.95).item()
6.399999999999999
In reality there are many categories and I need a neat way to do it.
You can use the quantile function in combination with groupby:
df.groupby('cat')['val'].apply(lambda x: x[x < x.quantile(0.95)]).reset_index().drop(columns='level_1')
I came up with the following solution:
idx = [False] * df.shape[0]
for cat in df['cat'].unique():
idx |= ((df['cat']==cat) & (df['val'].between(0, df[df['cat']==cat ].quantile(0.95).item())))
df[idx]
but it would be nice to see other solutions (hopefully better ones).

change column name using index

import pandas as pd
d = {
'one': [1, 2, 3, 4, 5],
'one': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
I have bigger dataframe with multiple columns of having same name .
I want to change the column name from number of column as in r.
e.g. colnames(df)[2]='two'
I want to change second column name 'one' to 'two' .I want to do
that in python .
I think the simpliest is assign new columns names by np.arange or range:
#valid dictionary have unique keys
d = {
'one1': [1, 2, 3, 4, 5],
'one2': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
df.columns = ['one'] * 2 + ['three']
print (df)
one one three
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
df.columns = np.arange(len(df.columns))
#alternative
#df.columns = range(len(df.columns))
print (df)
0 1 2
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
Then select by name:
print (df[1])
0 9
1 8
2 7
3 6
4 5
Name: 1, dtype: int64

Unstack dataframe and keep columns

I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.
Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3
set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object

Categories

Resources