I have a 3 million rows dataframe that contains the different values :
d a0 a1 a2
0.5 10.0 5.0 1.0
0.8 10.0 2.0 0.0
I want to fill a fourth column with a linear interpolation of (a0,a1,a2) that takes the value in the "d" case,
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 3.0
0.8 10.0 2.0 0.0 3.6
newcol is the weighted average between a[int(d)] and a[int(d+1)], e.g. when d = 0.8, newcol = 0.2 * a0 + 0.8 * a1 because 0.8 is 80% of the way between 0 and 1
I found that np.interp can be used, but there is no way for me to put the three column names in variable) :
df["newcol"]=np.interp(df["d"],[0,1,2], [100,200,300])
will indeed give me
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 250.0
0.8 10.0 2.0 0.0 180.0
BUT I have no way to specify that the values vector changes :
df["newcol"]=np.interp(df["d"],[0,1,2], df[["a0","a1","a2"]])
gives me the following traceback :
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 1271, in interp
return compiled_interp(x, xp, fp, left, right)
ValueError: object too deep for desired array
Is there any way to use a different vector for values at each line? Could you think of any workaround ?
Basically, I could find no way to create this new column based on the definition :
What is the value in x = column "d" of the function that is piecewise linear
between given points and whose values at these points are described in the columns "ai"
Edit: Before, I used scipy.interp1d, which is not memory efficient, the comment helped me to solve partially my problem
Edit2 :
I tried the approach from ev-br that stated that I had to try to code the loop myself.
for i in range(len(tps)):
columns=["a1","a2","a3"]
length=len(columns)
x=np.maximum(0,np.minimum(df.ix[i,"d"],len-2))
xint = np.int(x)
xfrac = x-xint
name1=columns[xint]
name2=columns[xint+1]
tps.ix[i,"Multiplier"]=df.ix[i,name1]+xfrac*(df.ix[i,name2]-tps.ix[i,name1])
The above loop loops around 50 times a second, so I guess I have a major optimisation issue. What part of working on a DataFrame do I do wrong?
It might comes a bit too late, but I would use np.interpolate with pandas' apply function. Creating the DataFrame in your example:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
Then comes the apply function:
t.apply(lambda x: np.interp(x.d, [0,1,2], x['a0':]), axis=1)
which yields:
0 3.0
1 3.6
dtype: float64
This is perfectly usable on "normal" datasets. However, the size of your DataFrame might call for a better/more optimized solution. The processing time scales linearily, my machine clocks in 10000 lines per second, which means 5 minutes for 3 million...
OK, I have a second solution, which uses the numexpr module. This method is much more specific, but also much faster. I've measured the complete process to take 733 milliseconds for 1 million lines, which is not bad...
So we have the original DataFrame as before:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
We import the module and use it, but it requires that we separate the two cases where we will use 'a0' and 'a1' or 'a1' and 'a2' as lower/upper limits for the linear interpolation. We also prepare the numbers so they can be fed to the same evaluation (hence the -1). We do that by creating 3 arrays with the interpolation value (originally: 'd') and the limits, depending on the value of "d". So we have:
import numexpr as ne
lim = np.where(t.d > 1, [t.d-1, t.a1, t.a2], [t.d, t.a0, t.a1])
Then we evaluate the simple linear interpolation expression and finally add it as a new column like that:
x = ne.evaluate('(1-x)*a+x*b', local_dict={'x': lim[0], 'a': lim[1], 'b': lim[2]})
t['IP'] = np.where(t.d > 1, x+1, x)
Related
I have this simplified DataFrame where I want to add a new column Distance_km.
In this new column all values should be in kilometres and converted to float dtype.
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
dist
Point Distance
0 a 3km
1 b 400m
2 c 1.1km
3 d 200m
Point object
Distance object
dtype: object
How can I get this output?
Point Distance Distance_km
0 a 3.8km 3.8
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
Point object
Distance object
Distance_km float64
dtype: object
Thanks in advance!
You could use Pandas apply method to pass your distance column values to a function that converts it to a standardized unit like so
From the documentation
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type is
inferred from the return type of the applied function. Otherwise, it
depends on the result_type argument.
First create the function that will transform the data, apply can even take in a lambda
import re
def convert_to_km(distance):
'''
distance can be a string with km or m as units
e.g. 300km, 1.1km, 200m, 4.5m
'''
# split the string into value and unit ['300', 'km']
split_dist = re.match('([\d\.]+)?([a-zA-Z]+)', distance)
value = split_dist.group(1) # 300
unit = split_dist.group(2) # km
if unit == 'km':
return float(value)
if unit == 'm':
return round(float(value)/1000, 2)
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
You can then apply this funtion to your distance column
dist['Distanc_km'] = dist.apply(lambda row: convert_to_km(row['Distance']), axis=1)
dist
The output will be
Point Distance Distanc_km
0 a 3km 3.0
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
You may try following as well:
Check if second last character of the string is 'k'.
If it is then only remove the last two character i.e. 'km'
Otherwise take the characters except last one (i.e. 'm') and divide the float value by 1000
Below is the implementation using apply to Distance column:
dist['Distance_km'] = dist['Distance'].apply(lambda row: float(row[:-1])/1000 if not row[-2]=='k' else row[:-2])
Result is:
Point Distance Distance_km
a 3km 3
b 400m 0.4
c 1.1km 1.1
d 200m 0.2
Try:
# An "Weight" column marking those are in "m" units
dist["Weight"] = 1e-3
dist.loc[dist["Distance"].str.contains("km"),"Weight"] = 1
# Extract the numeric part of string and convert it to float
dist["NumericPart"] = dist["Distance"].str.extract("([0-9.]+)\w+").astype(float)
# Merge the numeric parts with their units(weights) by multiplication
dist["Distance_km"] = dist["NumericPart"] * dist["Weight"]
You will get:
Point Distance Weight NumericPart Distance_km
0 a 3km 1.000 3.0 3.0
1 b 400m 0.001 400.0 0.4
2 c 1.1km 1.000 1.1 1.1
3 d 200m 0.001 200.0 0.2
BTW: Avoid using apply if you can, that will be very slow if your data is big.
What is the difference between the pandas DataFrame interpolate function called with args 'index' and 'values' respectively? It's ambiguous from the documentation:
pandas.DataFrame.interpolate
method : str, default ‘linear’
Interpolation technique to use. One of:
‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate given length of interval.
‘index’, ‘values’: use the actual numerical values of the index."
Both appear to use the numerical values of the index, is this the case?
UPDATE:
Following ansev's answer, they do indeed do the same thing
I think it's pretty clear, imagine you're going to interpolate points. The values of your DataFrame represent the Y values, it is about filling in the missing values in Y with some logic, for them an interpolation function is used, in this case for the variable X there are two options, to assume a fixed step, independent of the index or take into account the values of the index.
Example with linear interpolation:
Here for each row the index increases by 1 upward and therefore there is no difference between the methods.
df=pd.DataFrame({'Y':[1,np.nan,3]})
print(df)
Y
0 1.0
1 NaN
2 3.0
print(df.interpolate(method = 'index'))
Y
0 1.0
1 2.0
2 3.0
print(df.interpolate())
Y
0 1.0
1 2.0
2 3.0
but if we change the index values...
df.index = [0,1,10000]
print(df.interpolate(method = 'index'))
Y
0 1.0000
1 1.0002 #(3-1)*((1-0)/(10000-0))
10000 3.0000
print(df.interpolate())
Y
0 1.0
1 2.0
10000 3.0
df.index = [0,0.1,1]
print(df.interpolate(method = 'index'))
Y
0.0 1.0
0.1 1.2 #(3-1)*((0.1-0)/(1-0))
1.0 3.0
I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])
I already read answers and blog entries about how to iterate pandas.DataFrame efficient (https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6), but i still have one question left.
Currently, my DataFrame represents a GPS trajectory containing the columns time, longitude and latitude.
Now, I want to calculate a feature called distance-to-next-point. Therefore, i not only have to iterate through the rows and doing operations on the single rows, but have to access subsequent rows in a single iteration.
i=0
for index, row in df.iterrows():
if i < len(df)-1:
distance = calculate_distance([row['latitude'],row['longitude']],[df.loc[i+1,'latitude'],df.loc[i+1,'longitude']])
row['distance'] = distance
Besides this problem, I have the same issue when calculating speed, applying smoothing or other similar methods.
Another example:
I want to search for datapoints with speed == 0 m/s and outgoing from these points I want to add all subsequent datapoints into an array until the speed reached 10 m/s (to find segments of accelerating from 0m/s to 10m/s).
Do you have any suggestions on how to code stuff like this as efficient as possbile?
You can use pd.DataFrame.shift to add shifted series to your dataframe, then feed into your function via apply:
def calculate_distance(row):
# your function goes here, trivial function used for demonstration
return sum(row[i] for i in df.columns)
df[['next_latitude', 'next_longitude']] = df[['latitude', 'longitude']].shift(-1)
df.loc[df.index[:-1], 'distance'] = df.iloc[:-1].apply(calculate_distance, axis=1)
print(df)
latitude longitude next_latitude next_longitude distance
0 1 5 2.0 6.0 14.0
1 2 6 3.0 7.0 18.0
2 3 7 4.0 8.0 22.0
3 4 8 NaN NaN NaN
This works for an arbitrary function calculate_distance, but the chances are your algorithm is vectorisable, in which case you should use column-wise Pandas / NumPy methods.
I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.
For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.
I've tried the following but it doesn't seem to work:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
If I even try this:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: 1)
I'm getting NaN in the output so it must be something to do with how pandas works in the background.
Any ideas?
EDIT:
Here is a code sample with what I'm trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
The result is:
0 NaN
1 NaN
2 2.0
3 NaN
4 2.5
5 NaN
6 3.0
7 2.0
while I wanted to have the following:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.
The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:
g1 = df.groupby(['var1'])['value'] # group values
g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values
s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation
s.reset_index(level=0, drop=True).sort_index() # drop/sort index
The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
Name: value, dtype: float64
Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.
It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.
The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.
The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.
For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:
def rolling_nanvar(df, window):
"""
Group df by 'var1' values and then calculate rolling variance,
adjusting for the number of NaN values in the window.
Note: user may wish to edit this function to control degrees of
freedom (n), depending on their overall aim.
"""
g1 = df.groupby(['var1'])['value']
g2 = df.fillna(0).groupby(['var1'])['value']
# fill missing values with 0, square values and groupby
g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])
n = g1.rolling(window).count()
mean_of_squares = g3.rolling(window).sum() / n
square_of_mean = (g2.rolling(window).sum() / n)**2
variance = mean_of_squares - square_of_mean
return variance.reset_index(level=0, drop=True).sort_index()
Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.
Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:
>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop
>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop
Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.
Can this result match your expectations?
I slightly changed your solution with min_periods parameter and right filter for nan.
In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1)
Out[164]:
0 1.0
1 2.0
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
dtype: float64
Here is an alternative implementation without list comprehension, but it also fails to populate the first entry of the output with np.nan
means = df.groupby('var1')['value'].apply(
lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean))