So i am working with a dataset with two data frames.
The Data Frames look like this:
df1:
Item_ID Item_Name
0 A
1 B
2 C
df2:
Item_slot_1 Item_slot_2 Item_Slot_3
2 2 1
1 2 0
0 1 1
The values in df2 represent the Item_ID from df1. How can i replace the values in df2 from the item_id to the actual item name so that df2 can look like:
Item_slot_1 Item_slot_2 Item_Slot_3
C C B
B C A
A B B
The data set in reality is much larger and has way more id's and names than just a,b and c
Create dictionary by zip and pass it to applymap, or replace or apply with map:
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
#if value not exist in df1['Item_ID'] get None in df2
df2 = df2.applymap(s.get)
Or:
#if value not exist in df1['Item_ID'] get original value in df2
df2 = df2.replace(s)
Or:
#if value not exist in df1['Item_ID'] get NaN in df2
df2 = df2.apply(lambda x: x.map(s))
print (df2)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
EDIT:
You can specified columns by names for process:
cols = ['Item_slot_1','Item_slot_2','Item_Slot_3']
df2[cols] = df2[cols].applymap(s.get)
df2[cols] = df2[cols].replace(s)
df2[cols] = df2[cols].apply(lambda x: x.map(s))
You can improve the speed of dictionary mapping with numpy. If your items are numbered 0-N this is trivial, if they are not, it gets a bit more tricky, but is still easily doable.
If the items in df1 are numbered 0-N, use basic indexing:
a = df1['Item_Name'].values
b = df2.values
pd.DataFrame(a[b], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
If they are not numbered 0-N, here is a more general approach:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
To only replace a subset of columns from df2 is also simple, let's demonstrate only replacing the first two columns of df2:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
cols = ['Item_slot_1', 'Item_slot_2']
z = df2[cols].values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
df2[cols] = m[z]
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C 1
1 B C 0
2 A B 1
This type of indexing nets a hefty performance gain over apply and replace:
import string
df1 = pd.DataFrame({'Item_ID': np.arange(26), 'Item_Name': list(string.ascii_uppercase)})
df2 = pd.DataFrame(np.random.randint(1, 26, (10000, 100)))
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.applymap(s.get)
158 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.replace(s)
750 ms ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.apply(lambda x: x.map(s))
93.1 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
30.4 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I'd like to modify the col1 of the following dataframe df:
col1 col2
0 Black 7
1 Death 2
2 Hardcore 6
3 Grindcore 1
4 Deathcore 4
...
I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'} to get the following dataframe:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
...
I know I can use df.map or df.replace, for example like this:
df.replace({"col1":cat_dic})
but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:
col1 col2
0 B 7
1 D 2
2 H 6
3 Grindcore 1
4 Deathcore 4
...
Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?
Use dict.get:
df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)
print (df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Performance comparison in 50k rows:
df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
You may use pd.apply first
df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
Then, you can safely use pd.replace
df.replace({"col1":cat_dic})
This can be done in One line:
df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
Output is:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Here is a one liner easy solution which gives us the expected output.
df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))
Output :
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Series.map already maps NaN to the mismatched key
$ print(df['col1'].map(cat_dic))
0 B
1 D
2 H
3 NaN
4 NaN
Name: col1, dtype: object
Anyway, you can update your cat_dic with missing keys from col1 column
cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)
print(cat_dic)
{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}
print(df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
In [6]: df.col1.map(cat_dic.get)
Out[6]:
0 B
1 D
2 H
3 None
4 None
dtype: object
You could also use apply, both work. When working on a Series, map is faster I think.
Explanation:
You can get a default value for missing keys by using dict.get instead using the [..]-operator. By default, this default value is None. So simply passing the dict.get method to apply/map just works.
I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.
Given a dataframe df1 table that maps ids to names:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
and a second dataframe df2 which contains lists of names:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
This is a working method to achieve the same result using apply:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)
I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
If all values match:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
Test for 4k rows:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One way using operator.itemgetter:
from operator import itemgetter
def listgetter(x):
i = itemgetter(*x)(d)
return list(i) if isinstance(i, tuple) else [i]
d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)
Output:
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
Benchmark on 100k rows:
d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)
%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2["ids2"] = df2["names"].apply(listgetter)
# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this seems to work:
df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])
interested to know if this is the best approach
I need to add a new column on the basis of a condition in pandas dataframe
input file
Name C2Mean C1Mean
a 2 0
b 4 2
c 6 2.5
These are the conditions:
if C1Mean = 0; log2FC = log2([C2Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
Based on these conditions I want to add a new column 'log2FC' like this:
Name C2Mean C1Mean log2FC
a 2 0 1
b 4 2 1
c 6 2.5 1.2630344058
The code I tried:
import pandas as pd
import numpy as np
import os
def induced_genes(rsem_exp_data):
pwd = os.getcwd()
data = pd.read_csv(rsem_exp_data,header=0,sep="\t")
data['log2FC'] = [np.log2(data['C2Mean']/data['C1Mean'])\
if data['C2Mean'] > 0] else np.log2(data['C2Mean'])]
print(data.head(5))
induced_genes('induced.genes')
This should work and it's faster than apply
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df["log2FC"] = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
UPDATE: Timing
N = 10000
df = pd.DataFrame({"C2Mean":np.random.randint(0,10,N),
"C1Mean":np.random.randint(0,10,N)})
%%timeit -n10
a = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
1.06 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10
b = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0
else np.log2(x["C2Mean"]), axis=1)
248 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The speed up is ~233x.
*UPDATE 2: Remove RuntimeWarning
Just add this at the beginning
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
You can use the below code:
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df.head()
Name C2Mean C1Mean
a 2 0.0
b 4 2.0
c 6 2.5
df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)
df.head()
Name C2Mean C1Mean log2FC
a 2 0.0 1.000000
b 4 2.0 1.000000
c 6 2.5 1.263034
Here axis=1 implies that you want to do this operation for all the rows.
I need to calculate a value for each row in a Pandas data frame by comparing two columns to the values of the same columns for the previous row. I was able to do this by using iloc, but it takes a really long time when applying it to over 100K rows.
I tried using lambda, but it seems that it only returns one row or one column at the time, so I can't use it to compare multiple columns and rows at the same time.
In this example, I subtract the value of 'b' for the previous row from the value of 'b' for the current row, but only if the value of 'a' is the same for both rows.
This is the code I've been using:
import pandas as pd
df = pd.DataFrame({'a':['a','a','b','b','b'],'b':[1,2,3,4,5]})
df['increase'] = 0
for row in range(len(df)):
if row > 0:
if df.iloc[row]['a'] == df.iloc[row - 1]['a']:
df.iloc[row, 2] = df.iloc[row]['b'] - df.iloc[row - 1]['b']
is there a faster way to do the same calculation?
Thanks.
IIUC, you can suing groupby +diff
df.groupby('a').b.diff().fillna(0)
Out[193]:
0 0.0
1 1.0
2 0.0
3 1.0
4 1.0
Name: b, dtype: float64
After assign it back
df['increase']=df.groupby('a').b.diff().fillna(0)
df
Out[198]:
a b increase
0 a 1 0.0
1 a 2 1.0
2 b 3 0.0
3 b 4 1.0
4 b 5 1.0
Here is one solution:
df['increase'] = [0] + [(d - c) if a == b else 0 for a, b, c, d in \
zip(df.a, df.a[1:], df.b, df.b[1:])]
Some benchmarking vs #Wen's pandonic solution:
df = pd.DataFrame({'a':['a','a','b','b','b']*20000,'b':[1,2,3,4,5]*20000})
%timeit [0] + [(d - c) if a == b else 0 for a, b, c, d in zip(df.a, df.a[1:], df.b, df.b[1:])]
# 51.6 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.groupby('a').b.diff().fillna(0)
# 37.8 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)