Pandas change values in column based on values in other column - python

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data to_delete
[1,2,3,4] [2]
[0,1,2] [0,2]
This is what I want to end up with:
new_df
>>>> data to_delete
[1,2,4] [2]
[1] [0,2]
I could iterate over the rows by hand and calculate the new data for each one like this:
new_data = []
for _,v in df.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
df.assign(data=new_data)
but I'm looking for a better way to do this.

The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:
df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df.values]
print(df)
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
Timings on a 20K row dataframe:
df_large = pd.concat([df]*10000, axis=0)
%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_data = []
for _,v in df_large.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large.apply(lambda row: np.delete(row["data"],
row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You should use the apply function in order to apply a function to every row in the dataframe:
df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)

An other solution based on starmap:
This solution is based on a less known tool from the itertools module called starmap.
Check its doc, it's worth a try!
import pandas as pd
import numpy as np
from itertools import starmap
df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
'to_delete': [np.array([2]),np.array([0,2])]})
# Solution:
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
zip(df['data'],df['to_delete'])))
df2['data'] = pd.DataFrame(zip(A))
df2
prints out:
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]

Related

Combine two columns in pandas dataframe but in specific order

For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively. If I combine them into one column I get first all the zeroes, then all the ones.
I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column. So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].
I process 100K+ rows of data. What is the fastest or optimal way to achieve this?
Thanks in advance!
Try:
import pandas as pd
df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones": [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)
Output
[0 1 0 1 0 1]
Micro-Benchmarks
import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones": [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"])))
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use numpy.flatten() like below as alternative:
import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()
Benchmark (runnig on colab):
df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones": [1] * 10_000_000})
%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop
I don't know if this is the most optimal solution but it should solve your case.
df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l
This may help you: Alternate elements of different columns using Pandas
pd.concat(
[df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)

pandas better runtime, going trough dataframe

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?
Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Split and replace all strings in a pandas dataframe

I have a large dataframe where each row contains a string.
I want to split each string into several columns, and also replace two character types.
The code below does the job, but it is slow on a large dataframe. Is there a faster way than using a for loop?
import re
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = pd.DataFrame({'col1': [0,0], 'col2': [0,0], 'col3': [0,0]})
for i in range(df.shape[0]):
df_new.iloc[i, :] = re.split(',', df.iloc[i, 0].replace('[', '').replace(']', ''))
You can do it with:
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = df[0].str[1:-1].str.split(",", expand=True)
df_new.columns = ["col1", "col2", "col3"]
The idea is to first get rid of the [ and ] and then split by , and expand the dataframe. The last step would be to rename the columns.
Your solution should be changed with Series.str.strip and Series.str.split:
df1 = df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
print(df1)
col0 col1 col2
0 3.4 3.4 2.5
1 3.4 3.4 2.5
If performance is important use list comprehension instead pandas functions:
df1 = pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
Timings:
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [208]: %timeit df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
61.5 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
29.8 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to transform a sparse pandas dataframe to a 2d numpy array

I have a dataframe df containing the columns x, y (both starting at 0) and several value columns. The x and y coordinates are not complete, meaning many x-y combinations, and sometimes complete x or y values are missing. I would like to create a 2-d numpy array with the complete matrix of shape (df.x.max() + 1, (df.y.max()+1)), and missing values replaced with np.nan. pd.pivot comes already quite close, but does not fill completely missing x/y values.
The following code already achieves what is needed, but due to the for loop, this is rather slow:
img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
col = 'value'
for ind, line in df.iterrows():
img[line.x, line.y] = line[value]
A significantly faster version goes as follows:
ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
temp = df.loc[readout].set_index(['x', 'y'])['value']
s_img.loc[temp.index] = temp
img = s_img.unstack().values
The question is whether a vectorized method exists which might make the code shorter and faster.
Thanks for any hints in advance!
Often the fastest way to populate a NumPy array is simply to allocate an array and then assign values
to it using a vectorized operator or function. In this case, np.put seems ideal since it allows you to assign values using a (flat) array of indices and an array of values.
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
img = np.full((nrows, ncols), np.nan)
ind = df['x']*ncols + df['y']
np.put(img, ind, df['value'])
Here is a benchmark which shows using np.put can be 82x faster than alt (the unstacking method)
for making a (100, 100)-shaped resultant array:
In [184]: df = make_df(100,100)
In [185]: %timeit orig(df)
161 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [186]: %timeit alt(df)
31.2 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit using_put(df)
378 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [188]: 31200/378
Out[188]: 82.53968253968254
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
def make_df(nrows, ncols):
df = pd.DataFrame(np.arange(nrows*ncols).reshape(nrows, ncols))
df.index.name = 'x'
df.columns.name = 'y'
ind_x = np.random.choice(np.arange(nrows), replace=False, size=nrows//2)
ind_y = np.random.choice(np.arange(ncols), replace=False, size=ncols//2)
df = df.drop(ind_x, axis=0).drop(ind_y, axis=1).stack().reset_index().rename(columns={0:'value'})
return df
def orig(df):
img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
col = 'value'
for ind, line in df.iterrows():
img[line.x, line.y] = line['value']
return img
def alt(df):
ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
# temp = df.loc[readout].set_index(['x', 'y'])['value']
temp = df.set_index(['x', 'y'])['value']
s_img.loc[temp.index] = temp
img = s_img.unstack().values
return img
def using_put(df):
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
img = np.full((nrows, ncols), np.nan)
ind = df['x']*ncols + df['y']
np.put(img, ind, df['value'])
return img
Alternatively, since your DataFrame is sparse, you might be interested in creating a sparse matrix:
import scipy.sparse as sparse
def using_coo(df):
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
result = sparse.coo_matrix(
(df['value'], (df['x'], df['y'])), shape=(nrows, ncols), dtype='float64')
return result
As one would expect, making sparse matrices (from sparse data) is even faster (and requires less memory) than creating dense NumPy arrays:
In [237]: df = make_df(100,100)
In [238]: %timeit using_put(df)
381 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [239]: %timeit using_coo(df)
196 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [240]: 381/196
Out[240]: 1.9438775510204083

How to vectorize (make use of pandas/numpy) instead of using a nested for loop

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
#nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution #AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
m1 = np.less_equal.outer(df1['in'], df2['out'])
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())
Or, replace first three lines with this long line:
m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()
For very large problems, dask is recommended.
Timing Tests:
Here is a timing comparison when using 1000 and 1500-long arrays:
In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
...: df1 = pd.DataFrame(data=dict1)
...:
...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
...: df2 = pd.DataFrame(data=dict2)
Author's original method (Python loops):
In [167]: def f(df1, df2):
...: ans = []
...: for i in range(len(df1)):
...: for j in range(len(df2)):
...: if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
...: ans.append(df1['vals'][i]*df2['vals'][j])
...: return np.sum(ans)
...:
...:
In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Ben.T method:
In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vectorized solution proposed here:
In [171]: def g(df1, df2):
...: m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
...: return np.outer(df1['vals'], df2['vals'])[m].sum()
...:
...:
In [172]: %timeit g(df1, df2)
7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your answer:
471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 (3+ times slower):
df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()
1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2 (2+ times slower):
ans = []
for name, row in df1.iterrows():
_in = row['in']
_out = row['out']
_vals = row['vals']
ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)
1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3 (3+ times faster):
df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
df2_vals = df2.values
df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]
144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In Method 3 you can view the solution by performing:
ans[ans.nonzero()]
Out[]: array([ 50000., 80000., 160000., 60000.]
I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)
One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2
df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) &
(df1['out'] >= row['in'])].sum()*row['vals'],1)
then just sum this column
df2['ans'].sum()

Categories

Resources