hi I have the following Dataframe
import pandas as pd
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df['new'] = ''
df['new'].iloc[0] = 100
df
I tried to calculate (beginning in row 1) in column 'new' the previous value divided by the value of 'col1'+ 1.
For example in row one, column new: 100/(0.12+1) = 89,285
For example in row two, column new: 89,285/(-0.10+1) = 99,206
and so on
I already tried to use a lambda function - without success. Thanks for help
Try this:
df['new'].iloc[0] = 100
for i in range(1,df.shape[0]):
prev = df['new'].iloc[i-1]
df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
Output:
col1 new
-------------------
0 0.02 100
1 0.12 89.2857
2 -0.10 99.2063
3 -0.07 106.673
4 0.01 105.617
I think numba is way how working with loops here if performance is important:
from numba import jit
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df.loc[0, 'new'] = 100
#jit(nopython=True)
def f(a, b):
for i in range(1, a.shape[0]):
a[i] = a[i-1] / (b[i] +1)
return a
df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
print (df)
col1 new
0 0.02 100.000000
1 0.12 89.285714
2 -0.10 99.206349
3 -0.07 106.673494
4 0.01 105.617321
Performance for 5000 rows:
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 1000, ignore_index=True)
In [168]: %timeit df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
277 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: %%timeit
...: for i in range(1,df.shape[0]):
...: prev = df['new'].iloc[i-1]
...: df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
...:
1.31 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
...: for i_row, row in df.iloc[1:, ].iterrows():
...: df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
...:
2.08 s ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I don't see any vectorized solution. Here's a sheer loop:
df['new'] = 100
for i_row, row in df.iloc[1:, ].iterrows():
df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
Related
I'm considering a multiindexed (i,j,k) DataFrame with one column that contains three categorical variables A, B or C.
I want to compute the frequency of the variables for each i over all (j,k). I have a solution, but I think there exists a more pythonic and efficient way of doing it.
The code for a MWE reads (in reality len(I)*len(J)*len(K) is large, in the millions, say):
import pandas as pd
import numpy as np
I = range(10)
J = range(3)
K = range(2)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
data = data.unstack(['j', 'k'])
result = data.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
You could use groupby, and also normalize your value_counts:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
To compare, first let's make the dummy data big (60 million rows):
import pandas as pd
import numpy as np
I = range(100000)
J = range(30)
K = range(20)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
Timing your original method:
data_interim = data.unstack(['j', 'k'])
data_interim.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
gives (on my machine) 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
the alternative:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
gives (on my machine) 8.86 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Baseline for s_pike's method:
%%timeit
(data.groupby(level=0)['cat']
.value_counts(normalize=True)
.unstack(level=1)
.fillna(0))
6.41 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If they're truly categorical, you can get a lot of benefit out of explicitly making the column categorical, and then using either of these methods.
They're both still about twice as fast without being categorical, but become about 3x as fast when made categorical.
data['cat'] = data['cat'].astype('category')
%%timeit
(data.groupby(level=0, as_index=False)['cat']
.value_counts(normalize=True)
.pivot(index='i', columns='cat', values='proportion'))
1.82 s ± 91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(x := data.pivot_table(index='i', columns='cat', aggfunc='value_counts')).div(x.sum(1), 0)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Outputs:
cat A B C
i
0 0.341667 0.318333 0.340000
1 0.311667 0.388333 0.300000
2 0.351667 0.350000 0.298333
3 0.363333 0.333333 0.303333
4 0.326667 0.350000 0.323333
... ... ... ...
99995 0.315000 0.313333 0.371667
99996 0.323333 0.351667 0.325000
99997 0.305000 0.353333 0.341667
99998 0.318333 0.341667 0.340000
99999 0.331667 0.340000 0.328333
[100000 rows x 3 columns]
Given a dataframe df1 table that maps ids to names:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
and a second dataframe df2 which contains lists of names:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
This is a working method to achieve the same result using apply:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)
I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
If all values match:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
Test for 4k rows:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One way using operator.itemgetter:
from operator import itemgetter
def listgetter(x):
i = itemgetter(*x)(d)
return list(i) if isinstance(i, tuple) else [i]
d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)
Output:
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
Benchmark on 100k rows:
d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)
%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2["ids2"] = df2["names"].apply(listgetter)
# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this seems to work:
df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])
interested to know if this is the best approach
I want to parse row values as columns and use them to look up values in a pandas dataframe
tried iterrows and .loc indexing without success
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
build toy dataset
coltable = StringIO("""NA;NB;NC;ND;pair;desired_result
10;60;50;20;NANB;70
20;30;10;5;NANC;30
40;30;20;10;NCND;30
""")
df = pd.read_csv(coltable, sep=";")
I want to access the column elements of the pair (eg first row NA=10 and NB=60 and use those values to create a new column (desired_result=10+60=70).
I want the function to create the new column in pandas to be compatible with np.vectorize as the dataset is huge
Something like this:
df['newcol'] = np.vectorize(myfunc)(pair=df['pair'])
thanks a lot for any assistance you can give!
Use DataFrame.lookup:
a = df.lookup(df.index, df['pair'].str[:2])
b = df.lookup(df.index, df['pair'].str[2:])
df['new'] = a + b
print (df)
NA NB NC ND pair desired_result new
0 10 60 50 20 NANB 70 70
1 20 30 10 5 NANC 30 30
2 40 30 20 10 NCND 30 30
Also if no missing values is possible use list comprehension or apply:
#repeat dataframe 10000 times
df = pd.concat([df] * 10000, ignore_index=True)
In [263]: %%timeit
...: a = df.lookup(df.index, df['pair'].str[:2])
...: b = df.lookup(df.index, df['pair'].str[2:])
...:
...: df['new'] = a + b
...:
59.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %%timeit
...: a = df.lookup(df.index, [x[:2] for x in df['pair']])
...: b = df.lookup(df.index, [x[2:] for x in df['pair']])
...:
...: df['new'] = a + b
...:
60.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %%timeit
...: a = df.lookup(df.index, df['pair'].apply(lambda x: x[:2]))
...: b = df.lookup(df.index, df['pair'].apply(lambda x: x[2:]))
...:
...: df['new'] = a + b
...:
...:
56.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Let's say I have the following dataframe:
df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])
score correct_pred
0 0.1 0
1 0.2 1
2 0.3 1
3 0.4 0
And I would like for each row to compute the proportion of lines with score below and the proportion of correct_pred with score equal or above.
That is, for the second row for instance, 25% of the rows have a score below 0.2 and 66% of rows equal or above 0.2 have a correct pred. The output would then look like:
threshold percentage_filtered percentage_correct_pred
0.1 0 0.5
0.2 0.25 0.66
0.3 0.5 0.5
0.4 0.75 0
So far I do it using this piece of code:
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
threshold_mask = df.score < threshold
out.loc[len(out)] = [threshold,
np.mean(threshold_mask),
df[~threshold_mask].correct_pred.mean()]
Which works, but it is terribly slow on a real-size dataframe. So I need a faster version, I suspect there is a more vectorized method, maybe using numpy.cumsum or something?
I will assume that score may have repeated values, but if it does not it would also work (although it could be simpler). This is a way to get that result:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
columns=['score', 'correct_pred'])
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
.groupby('score')['correct_pred']
.agg(['count', 'sum'])
.reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
print(result)
# threshold percentage_filtered percentage_correct_pred
# 0 0.1 0.00 0.500000
# 1 0.2 0.25 0.666667
# 2 0.3 0.50 0.500000
# 3 0.4 0.75 0.000000
Performance:
np.random.seed(123)
df = pd.DataFrame({'score': np.arange(0, 1, 0.0005),
'correct_pred':np.random.choice([1,0], size=2000)
})
print (df)
score correct_pred
0 0.0000 1
1 0.0005 0
2 0.0010 1
3 0.0015 1
4 0.0020 1
... ...
1995 0.9975 0
1996 0.9980 0
1997 0.9985 1
1998 0.9990 1
1999 0.9995 1
[2000 rows x 2 columns]
In [208]: %timeit do_it_jdehesa()
9.57 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [209]: %timeit do_it()
5.83 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [210]: %timeit do_it1()
3.21 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [211]: %timeit do_it2()
92.5 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Lets improve the runtime by a factor of 10.
For reference:
df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])
def do_it():
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
threshold_mask = df.score < threshold
out.loc[len(out)] = [threshold,
np.mean(threshold_mask),
df[~threshold_mask].correct_pred.mean()]
%timeit do_it()
13 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Firstly, we take all call of pandas methods out of the loop such as:
def do_it1():
score_values = df.score.values
score_list = list(set(score_values))
correct_pred = df.correct_pred.values
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in score_list:
mask = score_values < threshold
out.loc[len(out)] = [threshold,
np.mean(mask),
np.mean(correct_pred[~mask])]
%timeit do_it1()
9.67 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then we also create the dataframe only after getting the results
def do_it2():
score_values = df.score.values
score_list = list(set(score_values))
correct_pred = df.correct_pred.values
result = []
for threshold in score_list:
mask = score_values < threshold
result.append((threshold,np.mean(mask),np.mean(correct_pred[~mask])))
out = pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
%timeit do_it2()
960 µs ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
EDIT:
To take jdehesas answer into account:
df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
columns=['score', 'correct_pred'])
def do_it_jdehesa():
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
.groupby('score')['correct_pred']
.agg(['count', 'sum'])
.reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
%timeit do_it_jdehesa()
13.5 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT2: Just optimizing the cuntion a little more - yet no where near as fast as jdehesas answer
def do_it5():
dfarray = df.values
n = dfarray.size
score_values = dfarray[:,0]
score_list = np.unique(score_values)
correct_pred = dfarray[:,1]
result = []
for threshold in score_list:
mask = score_values<threshold
result.append((threshold, np.count_nonzero(mask)/n, np.count_nonzero(correct_pred[~mask])/n))
result = pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
I'm looking for a solution that doesn't involve an .apply or lambda function that loops through the list and ads the string at desired index. I have a column like this with many entries:
df = pd.DataFrame(["1:77631829:-:1:77641672:-"], columns=["position"])
position
0 1:77631829:-:1:77641672:-
I'd like:
position
0 chr1:77631829:-:chr1:77641672:-
So insert "chr" at beginning and after third colon :
I would have thought something like this would do, but insert hasn't been implemented in series:
"chr" + df["position"].str.split(":").insert(3, "chr").str.join(":")
This does it, but looks inefficient:
"chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")
I think you can use split by 3 value of :, then extract head and tail of lists - join head, add ch to tail, prepend ch and last append to list L:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
print (df)
position
0 1:77631829:-:1:77641672:-
1 1:77631829:-:1:77641672:-
L = []
for x in df["position"]:
*i, j = x.split(':', 3)
L.append(("chr" + ':'.join(i) + "chr" + j))
df['new'] = L
print (df)
position new
0 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
1 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
Hack solution with comments:
'chr' + df['position'].str.replace('-:', '-:chr')
Faster with list comprehension and f-strings:
df['new'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
Performance:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
#[20000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [226]: %%timeit
...: L = []
...: for x in df["position"]:
...: *i, j = x.split(':', 3)
...: L.append(("chr" + ':'.join(i) + "chr" + j))
...:
...: df['new1'] = L
...:
18.9 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [227]: %%timeit
...: df['new2'] = "chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")
...:
50.8 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [228]: %%timeit
...: df['new3'] = 'chr' + df['position'].str.replace('-:', '-:chr')
...:
21.5 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [229]: %%timeit
...: df['new4'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
...:
8.59 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)