Add multiple columns prominent of the same operation to pandas DataFrame - python

The question
Let's say I had a DataFrame df with a numeric column x and a categorical column y. I want to calculate the q and the quantile of q in which the quantile of q is higher than 0 for each group of y. With these two values, q and quantile of q, I want to calculate the number of elements lower than quantile of q. To do this, I start with q=0 and I keep increasing q in 0.05 until quantile of q is greater than 0.
The solution
groups = df.groupby(y_col)
less_quantile = np.empty(len(groups))
quantiles = np.empty(len(groups))
qs = np.empty(len(groups))
for i, (_, group) in zip(range(len(groups)), groups):
q = 0.0
while q <= 1.0:
quantile = group[x_col].quantile(q)
if quantile > 0.0:
less_quantile[i] = (group[x_col] < quantile).sum() #count Trues
qs[i] = q
quantiles[i] = quantile
break
q += 0.05
df_final = df.drop_duplicates(y_col)\
.assign(quantile=quantiles, q=qs, less_quantile_count=less_quantile)
The problem
There are the following problems regarding the implementation above:
I'm not using any pandas or numpy optimizations (e.g. vectorization).
It's not guaranteed by pandas that drop_duplicates and groupby will have the same order of groups to assign it the way I did. The only reason I did this was that by experimentation they were the same.
What are my other options?
Using agg or apply and merge
Creating empty columns to calculate and to replicate qs, then quantiles, and then counting the ones less than quantile
What other problems can it cause?
I'll probably need to recalculate quantile lots of times.
I'll occupy lots of memory with repeated values within each row of each group.
Why do I care for that?
As a C programmer, it hurts a lot to use so much memory unnecessarily and the same for the operations. Also because it is a small DataFrame I can solve it easily, but if the DataFrame was too big, I should know the best way to do this. I'm not sure if it is a limitation of the library due to its abstraction level or if I don't know enough to solve the problem the way it has to be.
Edit 1 - Added an example
x y
0 pear 0.0
1 pear 0.0
2 pear 0.194329
3 apple 0.714319
4 apple 0.171905
5 apple 0.337234
6 apple 0.769216
7 orange 0.529154
8 orange 0.844691
# Let's take pear as an example:
quantile = group_pear.quantile(0) # 0.0
quantile = group_pear.quantile(0.05) # 0.0
...
quantile = group_pear.quantile(0.50) # 0.0
quantile = group_pear.quantile(0.55) # 0.09857
q = 0.55
# Found quantile of q with q=0.55 resulting in 0.09857
# Now I just need to count how many rows within the pear group have 'y'
# less than 0.09857
count_pear = (group_pear['y'] < 0.09857).sum()
# I just need to do the same for other groups and then produce a
# DataFrame like this
x q quantile count_less
0 pear 0.55 0.09857 2
1 apple 0.0 ... ...
2 orange 0.0 ... ...

Related

Remove outliers using groupby in data with several categories

I have a time-series with several products. I want to remove outliers using the Tukey Fence method. The idea is to create a column with a flag indicating outlier or not, using groupby. It should be like that (flag column is added by the groupby):
date prod units flag
1 a 100 0
2 a 90 0
3 a 80 0
4 a 15 1
1 b 200 0
2 b 180 0
3 b 190 0
4 b 30000 1
I was able to do it separating the prods using a for-loop and then making corresponding joins, but I wish to do it more cleanly.
I would compute the quantiles first; then derive IQR from them. Compute the fence bounds and call merge() to map these limits to the original dataframe and call eval() to check if the units are within their respective Tukey fence bounds.
# compute quantiles
quantiles = df.groupby('prod')['units'].quantile([0.25, 0.75]).unstack()
# compute interquartile range for each prod
iqr = quantiles.diff(axis=1).bfill(axis=1)
# compute fence bounds
fence_bounds = quantiles + iqr * [-1.5, 1.5]
# check if units are outside their respective tukey ranges
df['flag'] = df.merge(fence_bounds, left_on='prod', right_index=True).eval('not (`0.25` < units < `0.75`)').astype(int)
df
The intermediate fence bounds are:

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

How to calculate statistical properties before and after specific cutoff point in pandas?

I have a pandas dataframe that looks like this:
import pandas as pd
dt = pd.DataFrame({'idx':[1,2,3,4,5,1,2,3,4,5], 'id':[1,1,1,1,1,2,2,2,2,2], 'value':[5,10,15,20,25, 55,65,75,85,97]})
I have another that looks like this:
dt_idx = pd.DataFrame({'cutoff':[1,1,1,3,3,3,3,3,5,5,5,5,2,2,2,2,2,2,2,4,4]})
For the 3 "most common" cutoffs from the dt_idx (in this toy example it is 3,5 and 2), I would like to obtain the mean and the std of the value column of the dt dataframe, for the following 2 groups:
idx <= cutoff and
idx > cutoff
Is there a pythonic way to do that ?
A simple loop here is a good option. Get the cutoffs you care about using value_counts and then loop over those cutoffs. You can use groupby to get both the <= and > at the same time. Store everything in a dict, keyed by the cutoffs, and then you can concat to get a DataFrame with a MultiIndex.
d = {}
for cutoff in dt_idx.cutoff.value_counts().head(3).index:
d[cutoff] = dt.groupby(dt.idx.gt(cutoff))['value'].agg(['mean', 'std'])
pd.concat(d, names=['cutoff', 'greater_than_cutoff'])
mean std
cutoff greater_than_cutoff
2 False 33.750000 30.652624
True 52.833333 36.771819
3 False 37.500000 30.943497
True 56.750000 39.903007
5 False 45.200000 34.080949
If you want to use those cutoffs as ranges then we'll create the list, adding np.inf to the end, and we can use a single groupby with pd.cut to make the groups.
bins = dt_idx.cutoff.value_counts().head(3).index.sort_values().tolist() + [np.inf]
#[2, 3, 5, inf]
dt.groupby(pd.cut(dt.idx, bins, right=False))['value'].agg(['mean', 'std'])
# mean std
#idx
#[2.0, 3.0) 37.50 38.890873
#[3.0, 5.0) 48.75 36.371921
#[5.0, inf) 61.00 50.911688
First we get the 3 most common values, then we use GroupBy.agg for each of these values.
import numpy as np
n=3
l = dt_idx['cutoff'].value_counts()[:n].index
new_df = pd.concat({val : dt.groupby(np.where(dt['idx'].le(val),
'less than or equal',
'higher'))['value']
.agg(['mean','std'])
for val in l}, axis=1)
print(new_df)
2 3 5
mean std mean std mean std
higher 52.833333 36.771819 56.75 39.903007 NaN NaN
less than or equal 33.750000 30.652624 37.50 30.943497 45.2 34.080949
#new_df.stack(0).swaplevel().sort_index()
# mean std
#2 higher 52.833333 36.771819
# less than or equal 33.750000 30.652624
#3 higher 56.750000 39.903007
# less than or equal 37.500000 30.943497
#5 less than or equal 45.200000 34.080949

finding duplicate rows in pandas based on approximate match or formula

I have a pandas data frame
import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
"name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))
x y name
0 1.00 10.00 0ndx
1 1.00 11.00 1ndx
2 2.00 12.00 2ndx
3 3.00 12.95 3ndx
4 3.01 13.00 4ndx
5 4.00 11.00 5ndx
6 5.00 10.00 6ndx
I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.
This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.
Is there a way to do this?
you can compare with pandas.shift https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html.
Then if you wanted to compare each row to the previous, and make a column where they are within some threshold of each-other, let's say 0.1 it would follow:
eps = 0.1
df['duplicated'] = 0
df.sort_values(by=['x'],inplace=True)
df.loc[abs(df['x'] - df['x'].shift()) <= eps,'duplicated'] = 1
Then columns with a 1 would be those that are duplicated within your threshold.

Iterate pandas.DataFrame efficiently while accessing more than one index row at a time

I already read answers and blog entries about how to iterate pandas.DataFrame efficient (https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6), but i still have one question left.
Currently, my DataFrame represents a GPS trajectory containing the columns time, longitude and latitude.
Now, I want to calculate a feature called distance-to-next-point. Therefore, i not only have to iterate through the rows and doing operations on the single rows, but have to access subsequent rows in a single iteration.
i=0
for index, row in df.iterrows():
if i < len(df)-1:
distance = calculate_distance([row['latitude'],row['longitude']],[df.loc[i+1,'latitude'],df.loc[i+1,'longitude']])
row['distance'] = distance
Besides this problem, I have the same issue when calculating speed, applying smoothing or other similar methods.
Another example:
I want to search for datapoints with speed == 0 m/s and outgoing from these points I want to add all subsequent datapoints into an array until the speed reached 10 m/s (to find segments of accelerating from 0m/s to 10m/s).
Do you have any suggestions on how to code stuff like this as efficient as possbile?
You can use pd.DataFrame.shift to add shifted series to your dataframe, then feed into your function via apply:
def calculate_distance(row):
# your function goes here, trivial function used for demonstration
return sum(row[i] for i in df.columns)
df[['next_latitude', 'next_longitude']] = df[['latitude', 'longitude']].shift(-1)
df.loc[df.index[:-1], 'distance'] = df.iloc[:-1].apply(calculate_distance, axis=1)
print(df)
latitude longitude next_latitude next_longitude distance
0 1 5 2.0 6.0 14.0
1 2 6 3.0 7.0 18.0
2 3 7 4.0 8.0 22.0
3 4 8 NaN NaN NaN
This works for an arbitrary function calculate_distance, but the chances are your algorithm is vectorisable, in which case you should use column-wise Pandas / NumPy methods.

Categories

Resources