I have a data frame in which I want to identify all pairs of rows whose time value t differs by a fixed amount, say diff.
In [8]: df.t
Out[8]:
0 143.082739
1 316.285739
2 344.315561
3 272.258814
4 137.052583
5 258.279331
6 114.069608
7 159.294883
8 150.112371
9 181.537183
...
For example, if diff = 22.2423, then we would have a match between rows 4 and 7.
The obvious way to find all such matches is to iterate over each row and apply a filter to the data frame:
for t in df.t:
matches = df[abs(df.t - (t + diff)) < EPS]
# log matches
But as I have a log of values (10000+), this will be quite slow.
Further, I want to look and check to see if any differences of a multiple of diff exist. So, for instance, rows 4 and 9 differ by 2 * diff in my example. So my code takes a long time.
Does anyone have any suggestions on a more efficient technique for this?
Thanks in advance.
Edit: Thinking about it some more, the question boils down to finding an efficient way to find floating-point numbers contained in two lists/Series objects, to within some tolerance.
If I can do this, then I can simply compare df.t, df.t - diff, df.t - 2 * diff, etc.
If you want to check many multiples, it might be best to take the modulo of df with respect to diff and compare the result to zero, within your tolerance.
Whether you use modulo or not, the efficient way to compare floats within some tolerance is numpy.allclose. In versions before 1.8, call it as numpy.testing.allcose.
So far what I've described still involved looping over rows, because you must compare each row to every other. A better, but slightly more involved approach, would use scipy.cKDTree to query all pairs within a given distance (tolerance).
Related
I have a pandas dataframe named 'matrix', it looks like this:
antecedent_sku consequent_sku similarity
0 001 002 0.3
1 001 003 0.2
2 001 004 0.1
3 001 005 0.4
4 002 001 0.4
5 002 003 0.5
6 002 004 0.1
Out of this dataframe I want to create a similarity matrix for further clustering. I do it in two steps.
Step 1: to create an empty similarity matrix ('similarity')
set_name = set(matrix['antecedent_sku'].values)
similarity = pd.DataFrame(index = list(set_name), columns = list(set_name))
Step 2: to fill it with values from 'matrix':
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
elif len(matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values) < 1:
similarity.loc[ind, col] = 0
else:
similarity.loc[ind, col] = matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values[0]
The problem: it takes 4 hours to fill a matrix of shape (3000,3000).
The question: what am I doing wrong? Should I aim at speeding up the code with something like Cython/Numba or the problem lies in the archetecture of my approach and I should use built-in functions or some other clever way to transform 'matrix' into 'similarity' instead of a double loop?
P.S. I run Python 3.8.7
Iterating over pandas dataframe using loc is known to be very slow. The CPython interpreter is also known to be slow too (typically loops). Every pandas operation have a high overhead. However, the main point is that you iterate over 3000x3000 elements so to call for each element things like matrix['antecedent_sku'].values==f'{ind}' which certainly iterate over 3000 items that are strings also known to be an inefficient datatype (since the processor need to parse a variable-length UTF-8 sequence of multiple characters). Since this is done twice per iteration and you parse a new integer for each comparison, this means 3000*3000*3000*2 = 54_000_000_000 string comparisons will be performed, with overall 3000*3000*3000*2*2*3 = 324_000_000_000 characters to (inefficiently) compare! There is no chance this can be fast since this is very inefficient. Not to mention every of the 9_000_000 iterations creates/delete several temporary arrays and Pandas objects.
The first thing to do is to reduce the number of recomputed operations thanks to some precomputations. Indeed, you can store the values of matrix['antecedent_sku'].values==f'{ind}' (as Numpy arrays since pandas series are inefficient) in a dictionary indexed by ind so to fetch it faster in the loop. This should make this part 3000 time faster (since there should be only 3000 items). Even better: you can use a groupby to do that more efficiently.
Moreover, you can convert the columns to integers (ie. antecedent_sku and consequent_sku) so to avoid many expensive string comparisons.
Then you can remove useless operations like the matrix.loc[..., 'similarity'].values. Indeed, since you just want to know the length of the result, you can just use np.sum of the binary numpy array. In fact, you can even use np.any since you check if the length is less than 1.
Then you can avoid the creation of temporary Numpy array with a preallocated buffer and by specifying the output buffer in Numpy operations. For example, you can use np.logical_and(A, B, out=your_preallocated_buffer) instead of just a A & B.
Finally, if (and only if) all the previous steps are not enough to make the overall computation hundreds or thousands time faster, you can use Numba by converting your dataframe to a Numpy array first (since Numba does not support dataframe). If this is still not enough, you can use prange (instead of range) and the flag parallel=True of Numba so to use multiple threads.
Please note that Pandas is not really design to manipulate dataframes of 3000 columns and will certainly not be very fast because of that. Numpy is better suited for manipulating matrices.
Following Jerome's lead with a dictionary, I've done the following:
Step 1: to create a dictionary
matrix_dict = matrix.copy()
matrix_dict = matrix_dict.set_index(['antecedent_sku', 'consequent_sku'])['similarity'].to_dict()
matrix_dict looks like this:
{(001, 002): 0.3}
Step 2: to fill similarity with values from matrix_dict
for ind in tqdm(list(similarity.index)):
for col in list(similarity.columns):
if ind==col:
similarity.loc[ind, col] = 1
else:
similarity.loc[ind, col] = matrix_dict.get((int(ind), int(col)))
Step 3: fillna with zeroes
similarity = similarity.fillna(0)
Result: x35 performance (4 hours 20 minutes to 7 minutes)
I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])
What would be a more elegant way to writing:
df[df['income'] > 0].count()['income']
I would like to simply count the number of column values meeting a condition (in this example, the condition is just being larger than zero, but I would like a way applicable to any arbitrary condition or set of conditions). Obviously more elegant if the column name would not need to show up twice in the expression. Should be hopefully easy.
df = pd.DataFrame([0, 30000, 75000, -300, 23000], columns=['income'])
print(df)
income
0 0
1 30000
2 75000
3 -300
4 23000
If you would like to count values in a column meeting a slightly more complex condition than just being positive, for example "value is in the range from 5000 to 25000", you can use two methods.
First, using boolean indexing,
((df['income'] > 5000) & (df['income'] < 25000)).sum()
Second, applying a function on every row of the series,
df['income'].map(lambda x: 5000 < x < 25000).sum()
Note that the second approach allows arbitrarily complex condition but is much slower than the first approach which is using low-level operations on the underlying arrays. See the documentation on boolean indexing for more information.
I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.
I am currently playing with financial data, missing financial data specifically. What I'm trying to do is fill the gaps basing on gap length, for example:
- if length of the gap is lower than 5 NaN, then interpolate
- if length is > 5 NaN, then fill with values from different series
So what I am trying to accomplish here is a function that will scan series for NaN, get their length and then fill them appropriately. I just wanted to push as much as I can to pandas/numpy ops and not do it in loops etc...
Below just example, this is not optimal at all:
ser = pd.Series(np.sort(np.random.uniform(size=100)))
ser[48:52] = None
ser[10:20] = None
def count(a):
tmp = 0
for i in range(len(a)):
current=a[i]
if not(np.isnan(current)) and tmp>0:
a[(i-tmp):i]=tmp
tmp=0
if np.isnan(current):
tmp=tmp+1
g = ser.copy()
count(g)
g[g<1]=0
df = pd.DataFrame(ser, columns=['ser'])
df['group'] = g
Now we want to interpolate when gap is < 10 and put something where gap > 9
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear')
df['ready'] = df.loc[df.group>9,['ser']] = 100
To sum up, 2 questions:
- can Pandas do it robust way?
- if not, what can you suggest to make my way more robust and faster? Lets just focus on 2 points here: first there is this loop over series - it will take ages once I have, say, 100 series with gaps. Maybe something like Numba? Then, I'm interpolating on copies any suggestions on how to do it inplace?
Thanks for having a look
You could leverage interpolate's limit parameter.
df['ready'] = df.loc[df.group<10,['ser']].interpolate(method='linear',limit=9)
limit : int, default None.
Maximum number of consecutive NaNs to fill.
Then run interpolate() a second time with a different method or even run fillna()
After a lengthy look for an answer it turns out there is no automated way of doing fillna based on gap length.
Conclusion: one can utilize the code from the question, the idea will work.