i'm doing a matrix calculation using pandas in python.
my raw data is in the form of list of strings(which is unique for each row).
id list_of_value
0 ['a','b','c']
1 ['d','b','c']
2 ['a','b','c']
3 ['a','b','c']
i have to do a calculate a score with one row and against all the other rows
score calculation algorithm:
Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 ,
resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size
repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.
Create N * N matrix:
- 0 1 2 3
0 1 0.6 1 1
1 0.6 1 1 1
2 1 1 1 1
3 1 1 1 1
At present i'm using the pandas dummies approach to calculate the score:
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))
but there is an repetition in calculation after the diagonal of the matrix, the score calculation till diagonal is sufficient. for eg:
calculation of score of ID 0, will be only till ID(row,column) (0,0), score for ID(row,column) (0,1),(0,2),(0,3) can be copied from ID(row,column) (1,0),(2,0),(3,0).
Detail on the calculation:
i need to calculate till the diagonal, that is till the yellow colored box(the diagonal of matrix), the white values are already calculated in the green shaded area (for ref), i just have to transpose the green shaded area to white.
how can i do this in pandas?
First of all here is a profiling of your code. First all commands separately, and then as you posted it.
%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)
The above profiling returned the following results:
Explode : 1000 loops, best of 3: 201 µs per loop
Dummies : 1000 loops, best of 3: 697 µs per loop
Sum : 1000 loops, best of 3: 1.36 ms per loop
Dot : 1000 loops, best of 3: 453 µs per loop
Sum2 : 10000 loops, best of 3: 162 µs per loop
Divide : 100 loops, best of 3: 1.81 ms per loop
Running Your two lines together results in:
100 loops, best of 3: 5.35 ms per loop
Using a different approach relying less on the (sometimes expensive) functionality of pandas, the code I created takes just about a third of the time by skipping the calculation for the upper triangular matrix and the diagonal as well.
import numpy as np
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
d0 = set(df.iloc[i].list_of_value)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(df)):
df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])
With df given as
df = pd.DataFrame(
[[['a','b','c']],
[['d','b','c']],
[['a','b','c']],
[['a','b','c']]],
columns = ["list_of_value"])
the profiling for this code results in a running time of only 1.68ms.
1000 loops, best of 3: 1.68 ms per loop
UPDATE
Instead of operating on the entire DataFrame, just picking the Series that is needed gives a huge speedup.
Three methods to iterate over the entries in the Series have been tested, and all of them are more or less equal regarding the performance.
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
# get the Series from the DataFrame
dfl = df.list_of_value
for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems(): # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
There are a lot of pitfalls with pandas. E.g. always access the rows of a DataFrame or Series via df.iloc[0] instead of df[0]. Both works but df.iloc[0] is much faster.
The timings for the first matrix with 4 elements each with a list of size 3 resulted in a speedup of about 3 times as fast.
1000 loops, best of 3: 443 µs per loop
And when using a bigger dataset I got far better results with a speedup of over 11:
# operating on the DataFrame
10 loop, best of 3: 565 ms per loop
# operating on the Series
10 loops, best of 3: 47.7 ms per loop
UPDATE 2
When not using pandas at all (during the calculation), you get another significant speedup. Therefore you simply need to convert the column to operate on into a list.
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])
# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))
for i, d0 in enumerate(dfl):
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
On the data provided in the question we only see a slightly better result compared to the first update.
1000 loops, best of 3: 363 µs per loop
But when using bigger data (100 rows with lists of size 15) the advantage gets obvious:
100 loops, best of 3: 5.26 ms per loop
Here a comparison of all the suggested methods:
+----------+-----------------------------------------+
| | Using the Dataset from the question |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop |
+----------+-----------------------------------------+
| Answer | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop |
+----------+-----------------------------------------+
Although this question is well answered I will show a more readable and also very efficient alternative:
from itertools import product
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
product(df['list_of_value'], repeat=2)))
pd.DataFrame(index=df['id'],
columns=df['id'],
data=np.array(values).reshape(len_df, len_df))
id 0 1 2 3
id
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
%%timeit
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
product(df['list_of_value'], repeat=2)))
pd.DataFrame(index=df['id'],
columns=df['id'],
data=np.array(values).reshape(len_df, len_df))
850 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
#convert the column of the DataFrame to a list
dfl = list(df.list_of_value)
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))
for i, d0 in enumerate(dfl):
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
470 µs ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am not inclined to change your first line, although I'm sure it could be faster, because it's not going to be the bottleneck as your data gets larger. But the second line could be, and is also extremely easy to improve:
Change this:
s.dot(s.T).div(s.sum(1))
To:
arr=s.values
np.dot( arr, arr.T ) / arr[0].sum()
That's just doing it in numpy instead of pandas, but often you'll get a huge speedup. On your small, sample data it will only speed up by 2x, but if you increase your dataframe from 4 rows to 400 rows, then I see a speedup of over 20x.
As an aside, I would be inclined to not worry about the triangular aspect of the problem, at least as far as speed. You have to make the code considerably more complex and you probably aren't even gaining any speed in a situation like this.
Conversely, if conserving storage space is important, then obviously retaining only the upper (or lower) triangle will cut your storage needs by slightly more than half.
(If you really do care about the triangular aspect for dimensionality numpy does have related functions/methods but I don't know them offhand and, again, it's not clear to me if it's worth the extra complexity in this case.)
Related
I was trying to find where the function crosses the line x=0. I utilized the fact that when the function crosses x-axis, its sign changes.
Now, I have a dataframe like this, I want to find the TWO rows that are closest to zeros, given that function crosses the x-axis at two points.
A value
0 105 0.662932
1 105 0.662932
2 107 0.052653 # sign changes here when A is 107
3 108 -0.228060 # among these two A 107 is closer to zero
4 110 -0.740819
5 112 -1.188906
6 142 -0.228060 # sign changes here when A is 142
7 143 0.052654 # among these two, A 143 is closer to zero
8 144 0.349638
Required output:
A value
2 107 0.052653
7 143 0.052654
import pandas as pd
data = [
[105, 0.662932],
[105, 0.662932],
[107, 0.052653], # sign changes between here
[108, -0.228060], # and here; first row has `value` closer to 0
[110, -0.740819],
[112, -1.188906],
[142, -0.228060], # sign changes between here
[143, 0.052654], # and here; second row has `value` closer to 0
[144, 0.349638],
]
df = pd.DataFrame(data, columns=["A", "value"])
# where the sign is the same between two elements, the diff is 0
# otherwise, it's either 2 or -2 (doesn't matter which for this use case)
# use periods=1 and =-1 to do a diff forwards and backwards
sign = df.value.map(np.sign)
diff1 = sign.diff(periods=1).fillna(0)
diff2 = sign.diff(periods=-1).fillna(0)
# now we have the locations where sign changes occur. We just need to extract
# the `value` values at those locations to determine which of the two possibilities
# to choose for each sign change (whichever has `value` closer to 0)
df1 = df.loc[diff1[diff1 != 0].index]
df2 = df.loc[diff2[diff2 != 0].index]
idx = np.where(abs(df1.value.values) < abs(df2.value.values), df1.index.values, df2.index.values)
df.loc[idx]
A value
2 107 0.052653
7 143 0.052654
Thanks to #Vince W. for mentioning one should use np.where; I was going with a more convoluted approach initially.
EDIT - see #useruser3483203's answer below which is much faster than this. You can improve even a bit more (2x as fast when I reran their timings) by doing the first few operations (diff, abs, compare equality) on a numpy array instead of a pandas Series too. numpy's diff is different than the one in pandas, though, in that it drops the first element instead of returning NaN for it. This means we get back the index of the first row of the sign change, instead of the second, and need to add one to get the next row.
def find_min_sign_changes(df):
vals = df.value.values
abs_sign_diff = np.abs(np.diff(np.sign(vals)))
# idx of first row where the change is
change_idx = np.flatnonzero(abs_sign_diff == 2)
# +1 to get idx of second rows in the sign change too
change_idx = np.stack((change_idx, change_idx + 1), axis=1)
# now we have the locations where sign changes occur. We just need to extract
# the `value` values at those locations to determine which of the two possibilities
# to choose for each sign change (whichever has `value` closer to 0)
min_idx = np.abs(vals[change_idx]).argmin(1)
return df.iloc[change_idx[range(len(change_idx)), min_idx]]
You can generalize the approach using numpy:
a = df.value.values
u = np.sign(df.value)
m = np.flatnonzero(u.diff().abs().eq(2))
g = np.stack([m-1, m], axis=1)
v = np.abs(a[g]).argmin(1)
df.iloc[g[np.arange(g.shape[0]), v]]
A value
2 107 0.052653
7 143 0.052654
This solution is also going to be a lot more efficient, especially as the size scales.
In [122]: df = pd.concat([df]*100)
In [123]: %timeit chris(df)
870 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [124]: %timeit nathan(df)
2.03 s ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [125]: %timeit df.loc[find_closest_to_zero_idx(df.value.values)]
1.81 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I managed get a simple solution:
import numpy as np
import pandas as pd
data = [
[105, 0.662932],
[105, 0.662932],
[107, 0.052653], # sign changes between here
[108, -0.228060], # and here; first row has `value` closer to 0
[110, -0.740819],
[112, -1.188906],
[142, -0.228060], # sign changes between here
[143, 0.052654], # and here; second row has `value` closer to 0
[144, 0.349638],
]
df = pd.DataFrame(data, columns=["A", "value"]
Solution
def find_closest_to_zero_idx(arr):
fx = np.zeros(len(arr))
fy = np.array(arr)
# lower index when sign changes in array
idx = np.argwhere((np.diff(np.sign(fx - fy)) != 0) )
nearest_to_zero = []
# test two values before and after zero which is nearer to zero
for i in range(len(idx)):
if abs(arr[idx[i][0]]) < abs(arr[idx[i][0]+1]):
nearer = idx[i][0]
nearest_to_zero.append(nearer)
else:
nearer = idx[i][0]+1
nearest_to_zero.append(nearer)
return nearest_to_zero
idx = find_closest_to_zero_idx(df.value.values)
Result
idx = find_closest_to_zero_idx(df.value.values)
df.loc[idx]
A value
2 107 0.052653
7 143 0.052654
Slow but pure pandas method
df['value_shifted'] = df.value.shift(-1)
df['sign_changed'] = np.sign(df.value.values) * np.sign(df.value_shifted.values)
# lower index where sign changes
idx = df[df.sign_changed == -1.0].index.values
# make both lower and upper index from the a-axis negative so that
# we can groupby later.
for i in range(len(idx)):
df.loc[ [idx[i], idx[i]+1], 'sign_changed'] = -1.0 * (i+1)
df1 = df[ np.sign(df.sign_changed) == -1.0]
df2 = df1.groupby('sign_changed')['value'].apply(lambda x: min(abs(x)))
df3 = df2.reset_index()
answer = df.merge(df3,on=['sign_changed','value'])
answer
A value value_shifted sign_changed
0 107 0.052653 -0.228060 -1.0
1 143 0.052654 0.349638 -2.0
I have the following data frame:
From which a want to get the average row-wise of non zero columns.
E.G. for
row 0: (1303 + 1316 + 1322 + 1315)/4
row 2: (1632 + 1628 + 1609)/3
Using replace , from 0 to np.nan
df.replace(0,np.nan).mean(1)
Use sum twice - all values by sum of Trues processes like 1:
df = df.sum(axis=1).div(df.ne(0).sum(1))
Timings:
np.random.seed(1997)
df = pd.DataFrame(np.random.randint(3, size=(1000,1000)))
#print (df)
In [60]: %timeit (df.replace(0,np.nan).mean(1))
1 loop, best of 3: 188 ms per loop
In [61]: %timeit (df.sum(axis=1).div(df.ne(0).sum(1)))
10 loops, best of 3: 21.8 ms per loop
I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))
I take 3 values of a column (third) and put these values into a row on 3 new columns. And merge the new and old columns into a new matrix A
Input timeseries in col nr3 values in col nr 1 and 2
[x x 1]
[x x 2]
[x x 3]
output : matrix A
[x x 1 0 0 0]
[x x 2 0 0 0]
[x x 3 1 2 3]
[x x 4 2 3 4]
So for brevity, first the code generates the matrix 6 rows / 3 col. The last column I want to use to fill 3 extra columns and merge it into a new matrix A. This matrix A was prefilled with 2 rows to offset the starting position.
I have implemented this idea in the code below and it takes a really long time to process large data sets.
How to improve the speed of this conversion
import numpy as np
matrix = np.arange(18).reshape((6, 3))
nr=3
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
total= np.column_stack((matrix,A))
print (total)
Here's an approach using broadcasting to get those sliding windowed elements and then just some stacking to get A -
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
Here's an efficient alternative using NumPy strides to get col2_2D -
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, shape=(nrows,nr), strides=(n,n))
It would be even better to initialize an output array of zeros of the size as total and then assign values into it with col2_2D and finally with input array matrix.
Runtime test
Approaches as functions -
def org_app1(matrix,nr):
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
return A
def vect_app1(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
return out
def vect_app2(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, \
shape=(nrows,nr), strides=(n,n))
out[nr-1:] = col2_2D
return out
Timings and verification -
In [18]: # Setup input array and params
...: matrix = np.arange(1800).reshape((60, 30))
...: nr=3
...:
In [19]: np.allclose(org_app1(matrix,nr),vect_app1(matrix,nr))
Out[19]: True
In [20]: np.allclose(org_app1(matrix,nr),vect_app2(matrix,nr))
Out[20]: True
In [21]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 646 µs per loop
In [22]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 20.6 µs per loop
In [23]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 21.5 µs per loop
In [28]: # Setup input array and params
...: matrix = np.arange(7200).reshape((120, 60))
...: nr=30
...:
In [29]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 1.19 ms per loop
In [30]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 45 µs per loop
In [31]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 27.2 µs per loop
Let's say I have a DataFrame with four columns, each of which has a threshold value against which I'd like to compare the DataFrame's values.
I would simply like the minimum value of the DataFrame or the threshold.
For example:
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
>>> df.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.711597 -0.063274
4 1.217197 0.202063 -1.407561 0.940371
thresholds = pd.Series({'A': 1, 'B': 1.1, 'C': 1.2, 'D': 1.3})
This solution works (A4 and C3 were filtered), but there must be an easier way:
df_filtered = df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
>>> df_filtered.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.200000 -0.063274
4 1.000000 0.202063 -1.407561 0.940371
Ideally, I'd like to use .loc to filter in place, but I haven't managed to figure it out. I'm using Pandas 0.14.1 (and can't upgrade).
RESPONSE Below are the timed tests of my initial proposal against the alternatives:
%%timeit
df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
1000 loops, best of 3: 990 µs per loop
%%timeit
np.minimum(df, thresholds) # <--- Simple, fast, and returns DataFrame!
10000 loops, best of 3: 110 µs per loop
%%timeit
df[df < thresholds].fillna(thresholds, inplace=True)
1000 loops, best of 3: 1.36 ms per loop
This is pretty fast (and returns a dataframe):
np.minimum( df, [1.0,1.1,1.2,1.3] )
A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...
How about:
df[df < thresholds].fillna(thresholds, inplace=True)