I want to parse row values as columns and use them to look up values in a pandas dataframe
tried iterrows and .loc indexing without success
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
build toy dataset
coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30
""")
df = pd.read_csv(coltable, sep=";")
I want to access the column elements of the pair (eg first row NA=10 and NB=60 and use those values to create a new column (desired_result=10+60=70).
I want the function to create the new column in pandas to be compatible with np.vectorize as the dataset is huge
Something like this:
df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])
thanks a lot for any assistance you can give!
Use DataFrame.lookup:
a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])
df['new'] = a + b
print (df)
NA NB NC ND pair desired_result new
0 10 60 50 20 NANB 70 70
1 20 30 10 5 NANC 30 30
2 40 30 20 10 NCND 30 30
Also if no missing values is possible use list comprehension or apply:
#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)
In [263]: %%timeit
...: a = df.lookup(df.index, df['pair'].str[:2])
...: b = df.lookup(df.index, df['pair'].str[2:])
...:
...: df['new'] = a + b
...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %%timeit
...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
...:
...: df['new'] = a + b
...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %%timeit
...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
...:
...: df['new'] = a + b
...:
...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).
I am updating 100 columns like this:
df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100
This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.
You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:
cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
Performance:
np.random.seed(2022)
df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})
In [58]: %%timeit
...: for i in range(1, 101):
...: df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
...:
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [59]: %%timeit
...: cols = [f'col{c}' for c in range(1, 101)]
...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
...:
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Given a dataframe df1 table that maps ids to names:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
and a second dataframe df2 which contains lists of names:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
This is a working method to achieve the same result using apply:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)
I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
If all values match:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
Test for 4k rows:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One way using operator.itemgetter:
from operator import itemgetter
def listgetter(x):
i = itemgetter(*x)(d)
return list(i) if isinstance(i, tuple) else [i]
d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)
Output:
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
Benchmark on 100k rows:
d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)
%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2["ids2"] = df2["names"].apply(listgetter)
# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this seems to work:
df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])
interested to know if this is the best approach
hi I have the following Dataframe
import pandas as pd
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df['new'] = ''
df['new'].iloc[0] = 100
df
I tried to calculate (beginning in row 1) in column 'new' the previous value divided by the value of 'col1'+ 1.
For example in row one, column new: 100/(0.12+1) = 89,285
For example in row two, column new: 89,285/(-0.10+1) = 99,206
and so on
I already tried to use a lambda function - without success. Thanks for help
Try this:
df['new'].iloc[0] = 100
for i in range(1,df.shape[0]):
prev = df['new'].iloc[i-1]
df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
Output:
col1 new
-------------------
0 0.02 100
1 0.12 89.2857
2 -0.10 99.2063
3 -0.07 106.673
4 0.01 105.617
I think numba is way how working with loops here if performance is important:
from numba import jit
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df.loc[0, 'new'] = 100
#jit(nopython=True)
def f(a, b):
for i in range(1, a.shape[0]):
a[i] = a[i-1] / (b[i] +1)
return a
df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
print (df)
col1 new
0 0.02 100.000000
1 0.12 89.285714
2 -0.10 99.206349
3 -0.07 106.673494
4 0.01 105.617321
Performance for 5000 rows:
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 1000, ignore_index=True)
In [168]: %timeit df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
277 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: %%timeit
...: for i in range(1,df.shape[0]):
...: prev = df['new'].iloc[i-1]
...: df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
...:
1.31 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
...: for i_row, row in df.iloc[1:, ].iterrows():
...: df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
...:
2.08 s ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I don't see any vectorized solution. Here's a sheer loop:
df['new'] = 100
for i_row, row in df.iloc[1:, ].iterrows():
df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
I'm looking for a solution that doesn't involve an .apply or lambda function that loops through the list and ads the string at desired index. I have a column like this with many entries:
df = pd.DataFrame(["1:77631829:-:1:77641672:-"], columns=["position"])
position
0 1:77631829:-:1:77641672:-
I'd like:
position
0 chr1:77631829:-:chr1:77641672:-
So insert "chr" at beginning and after third colon :
I would have thought something like this would do, but insert hasn't been implemented in series:
"chr" + df["position"].str.split(":").insert(3, "chr").str.join(":")
This does it, but looks inefficient:
"chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")
I think you can use split by 3 value of :, then extract head and tail of lists - join head, add ch to tail, prepend ch and last append to list L:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
print (df)
position
0 1:77631829:-:1:77641672:-
1 1:77631829:-:1:77641672:-
L = []
for x in df["position"]:
*i, j = x.split(':', 3)
L.append(("chr" + ':'.join(i) + "chr" + j))
df['new'] = L
print (df)
position new
0 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
1 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
Hack solution with comments:
'chr' + df['position'].str.replace('-:', '-:chr')
Faster with list comprehension and f-strings:
df['new'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
Performance:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
#[20000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [226]: %%timeit
...: L = []
...: for x in df["position"]:
...: *i, j = x.split(':', 3)
...: L.append(("chr" + ':'.join(i) + "chr" + j))
...:
...: df['new1'] = L
...:
18.9 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [227]: %%timeit
...: df['new2'] = "chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")
...:
50.8 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [228]: %%timeit
...: df['new3'] = 'chr' + df['position'].str.replace('-:', '-:chr')
...:
21.5 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [229]: %%timeit
...: df['new4'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
...:
8.59 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I would like to calculate the mean of age excluding the value 99. In real life the dataframe is much bigger, and I have other possible variables.
Is there a more efficient way (faster or more elegant) to do it? Maybe with a pivot table or group by, or a function?
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
numpy solution is faster - first create array and then filter:
arr = df['age'].values
not99 = arr != 99
mean_for_age = arr[not99].mean()
But if need generally solution for possible select another column use your solution:
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
mean_for_age = df.loc[not99, 'another col'].mean()
Timings (depends of data, best test with real data):
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
df = pd.concat([df] * 10000, ignore_index=True)
In [14]: %%timeit
...: arr = df['age'].values
...: not99 = arr != 99
...:
...: mean_for_age = arr[not99].mean()
...:
496 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %%timeit
...: not99 = df['age'] != 99
...: mean_for_age = df.loc[not99, 'age'].mean()
...:
1.82 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %%timeit
...: df.query("age != 99")['age'].mean()
...:
4.26 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)