What would be a more elegant way to writing:
df[df['income'] > 0].count()['income']
I would like to simply count the number of column values meeting a condition (in this example, the condition is just being larger than zero, but I would like a way applicable to any arbitrary condition or set of conditions). Obviously more elegant if the column name would not need to show up twice in the expression. Should be hopefully easy.
df = pd.DataFrame([0, 30000, 75000, -300, 23000], columns=['income'])
print(df)
income
0 0
1 30000
2 75000
3 -300
4 23000
If you would like to count values in a column meeting a slightly more complex condition than just being positive, for example "value is in the range from 5000 to 25000", you can use two methods.
First, using boolean indexing,
((df['income'] > 5000) & (df['income'] < 25000)).sum()
Second, applying a function on every row of the series,
df['income'].map(lambda x: 5000 < x < 25000).sum()
Note that the second approach allows arbitrarily complex condition but is much slower than the first approach which is using low-level operations on the underlying arrays. See the documentation on boolean indexing for more information.
Related
I have a pandas dataframe named 'matrix', it looks like this:
antecedent_sku consequent_sku similarity
0 001 002 0.3
1 001 003 0.2
2 001 004 0.1
3 001 005 0.4
4 002 001 0.4
5 002 003 0.5
6 002 004 0.1
Out of this dataframe I want to create a similarity matrix for further clustering. I do it in two steps.
Step 1: to create an empty similarity matrix ('similarity')
set_name = set(matrix['antecedent_sku'].values)
similarity = pd.DataFrame(index = list(set_name), columns = list(set_name))
Step 2: to fill it with values from 'matrix':
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
elif len(matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values) < 1:
similarity.loc[ind, col] = 0
else:
similarity.loc[ind, col] = matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values[0]
The problem: it takes 4 hours to fill a matrix of shape (3000,3000).
The question: what am I doing wrong? Should I aim at speeding up the code with something like Cython/Numba or the problem lies in the archetecture of my approach and I should use built-in functions or some other clever way to transform 'matrix' into 'similarity' instead of a double loop?
P.S. I run Python 3.8.7
Iterating over pandas dataframe using loc is known to be very slow. The CPython interpreter is also known to be slow too (typically loops). Every pandas operation have a high overhead. However, the main point is that you iterate over 3000x3000 elements so to call for each element things like matrix['antecedent_sku'].values==f'{ind}' which certainly iterate over 3000 items that are strings also known to be an inefficient datatype (since the processor need to parse a variable-length UTF-8 sequence of multiple characters). Since this is done twice per iteration and you parse a new integer for each comparison, this means 3000*3000*3000*2 = 54_000_000_000 string comparisons will be performed, with overall 3000*3000*3000*2*2*3 = 324_000_000_000 characters to (inefficiently) compare! There is no chance this can be fast since this is very inefficient. Not to mention every of the 9_000_000 iterations creates/delete several temporary arrays and Pandas objects.
The first thing to do is to reduce the number of recomputed operations thanks to some precomputations. Indeed, you can store the values of matrix['antecedent_sku'].values==f'{ind}' (as Numpy arrays since pandas series are inefficient) in a dictionary indexed by ind so to fetch it faster in the loop. This should make this part 3000 time faster (since there should be only 3000 items). Even better: you can use a groupby to do that more efficiently.
Moreover, you can convert the columns to integers (ie. antecedent_sku and consequent_sku) so to avoid many expensive string comparisons.
Then you can remove useless operations like the matrix.loc[..., 'similarity'].values. Indeed, since you just want to know the length of the result, you can just use np.sum of the binary numpy array. In fact, you can even use np.any since you check if the length is less than 1.
Then you can avoid the creation of temporary Numpy array with a preallocated buffer and by specifying the output buffer in Numpy operations. For example, you can use np.logical_and(A, B, out=your_preallocated_buffer) instead of just a A & B.
Finally, if (and only if) all the previous steps are not enough to make the overall computation hundreds or thousands time faster, you can use Numba by converting your dataframe to a Numpy array first (since Numba does not support dataframe). If this is still not enough, you can use prange (instead of range) and the flag parallel=True of Numba so to use multiple threads.
Please note that Pandas is not really design to manipulate dataframes of 3000 columns and will certainly not be very fast because of that. Numpy is better suited for manipulating matrices.
Following Jerome's lead with a dictionary, I've done the following:
Step 1: to create a dictionary
matrix_dict = matrix.copy()
matrix_dict = matrix_dict.set_index(['antecedent_sku', 'consequent_sku'])['similarity'].to_dict()
matrix_dict looks like this:
{(001, 002): 0.3}
Step 2: to fill similarity with values from matrix_dict
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
else:
similarity.loc[ind, col] = matrix_dict.get((int(ind), int(col)))
Step 3: fillna with zeroes
similarity = similarity.fillna(0)
Result: x35 performance (4 hours 20 minutes to 7 minutes)
I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:
Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times
The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.
My attempt:
n = df.shape[0]/10
variable = []
condition_1 = []
condition_2 = []
for i in df:
variable.append(i)
condition_1.append(df[i].unique().shape[0] < n)
condition_2.append(most_common_value_count/second_most_common_value_count > 20)
result = pd.DataFrame({"Variables": variable,
"Condition_1": condition_1,
"Condition_2": condition_2})
The dataset df contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.
Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.
As noted in the comments, you may want to use np.unique (or pd.unique). You can set return_counts=True to get the value counts. These will be the second item in the tuple returned by np.unique, hence the [1] index below. After sorting them, the most common count will be the last value, and the second most common count will be the next to last value, so you can get them both by indexing with [-2:].
You could then construct a Boolean list indicating which columns meet your condition #2 (or rather the opposite). This list can then be used as a mask to reduce the dataframe:
def counts_ratio(s):
"""Take a pandas series s and return the
count of its most common value /
count of its second most common value."""
counts = np.sort(np.unique(s, return_counts=True)[1])[-2:]
return counts[1] / counts[0]
condition2 = [counts_ratio(df[col]) <= 20
for col in df.columns]
df_reduced = df[df.columns[condition2]]
I need to work on a column, and based on a condition (if it is True ), need to fill some random numbers for the entry(not a constant string/number ). Tried with for loop and its working, but any other fastest way to proceed similar to np.select or np.where conditions ?
I have written for loop and its working:
The 'NUMBER' column have here few entries with greater than 1000, i need to replace them by any random float in between (120,123),not the same one b/w 120-123 . I have used np.random.uniform and its working too.
for i in range(0,len(data['NUMBER'])):
if data['NUMBER'][i] >=1000:
data['NUMBER'][i]=np.random.uniform(120,123)\
'''The o/p for this code fills each entries with different values
between (120,123) in random,after replacement the entries are'''
0 7.139093
1 12.592815
2 12.712103
3 **120.305773**
4 11.941386
5 **122.548703**
6 6.357255.............etc
''' but while using codes using np.select and np.where as shown below(as
it will run faster) --> the result was replaced by same number alone
for all the entries satisfying the condition. for example instead of
having different values for the indexes 3 and 5 as shown above it
have same value of any b/w(120,123 ) for all the entries. please
guide here.'''
data['NUMBER'] =np.where(data['NUMBER'] >= 1000,np.random.uniform(120,123), data['NUMBER'])
data['NUMBER'] = np.select([data['NUMBER'] >=1000],[np.random.uniform(120,123)], [data['NUMBER']])
np.random.uniform(120, 123) is a single random number:
In [1]: np.random.uniform(120, 123)
Out[1]: 120.51317994772921
Use the size parameter to make an array of random numbers:
In [2]: np.random.uniform(120, 123, size=5)
Out[2]:
array([122.22935075, 122.70963032, 121.97763459, 121.68375085,
121.13568039])
Passing this to np.where (as the second argument) allows np.where to select from this array when the condition is True:
data['NUMBER'] = np.where(data['NUMBER'] >= 1000,
np.random.uniform(120, 123, size=len(data)),
data['NUMBER'])
Use np.select when there is more than one condition. Since there is only one condition here, use np.where.
I am currently playing with financial data, missing financial data specifically. What I'm trying to do is fill the gaps basing on gap length, for example:
- if length of the gap is lower than 5 NaN, then interpolate
- if length is > 5 NaN, then fill with values from different series
So what I am trying to accomplish here is a function that will scan series for NaN, get their length and then fill them appropriately. I just wanted to push as much as I can to pandas/numpy ops and not do it in loops etc...
Below just example, this is not optimal at all:
ser = pd.Series(np.sort(np.random.uniform(size=100)))
ser[48:52] = None
ser[10:20] = None
def count(a):
tmp = 0
for i in range(len(a)):
current=a[i]
if not(np.isnan(current)) and tmp>0:
a[(i-tmp):i]=tmp
tmp=0
if np.isnan(current):
tmp=tmp+1
g = ser.copy()
count(g)
g[g<1]=0
df = pd.DataFrame(ser, columns=['ser'])
df['group'] = g
Now we want to interpolate when gap is < 10 and put something where gap > 9
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear')
df['ready'] = df.loc[df.group>9,['ser']] = 100
To sum up, 2 questions:
- can Pandas do it robust way?
- if not, what can you suggest to make my way more robust and faster? Lets just focus on 2 points here: first there is this loop over series - it will take ages once I have, say, 100 series with gaps. Maybe something like Numba? Then, I'm interpolating on copies any suggestions on how to do it inplace?
Thanks for having a look
You could leverage interpolate's limit parameter.
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear',limit=9)
limit : int, default None.
Maximum number of consecutive NaNs to fill.
Then run interpolate() a second time with a different method or even run fillna()
After a lengthy look for an answer it turns out there is no automated way of doing fillna based on gap length.
Conclusion: one can utilize the code from the question, the idea will work.
I have a data frame in which I want to identify all pairs of rows whose time value t differs by a fixed amount, say diff.
In [8]: df.t
Out[8]:
0 143.082739
1 316.285739
2 344.315561
3 272.258814
4 137.052583
5 258.279331
6 114.069608
7 159.294883
8 150.112371
9 181.537183
...
For example, if diff = 22.2423, then we would have a match between rows 4 and 7.
The obvious way to find all such matches is to iterate over each row and apply a filter to the data frame:
for t in df.t:
matches = df[abs(df.t - (t + diff)) < EPS]
# log matches
But as I have a log of values (10000+), this will be quite slow.
Further, I want to look and check to see if any differences of a multiple of diff exist. So, for instance, rows 4 and 9 differ by 2 * diff in my example. So my code takes a long time.
Does anyone have any suggestions on a more efficient technique for this?
Thanks in advance.
Edit: Thinking about it some more, the question boils down to finding an efficient way to find floating-point numbers contained in two lists/Series objects, to within some tolerance.
If I can do this, then I can simply compare df.t, df.t - diff, df.t - 2 * diff, etc.
If you want to check many multiples, it might be best to take the modulo of df with respect to diff and compare the result to zero, within your tolerance.
Whether you use modulo or not, the efficient way to compare floats within some tolerance is numpy.allclose. In versions before 1.8, call it as numpy.testing.allcose.
So far what I've described still involved looping over rows, because you must compare each row to every other. A better, but slightly more involved approach, would use scipy.cKDTree to query all pairs within a given distance (tolerance).