An efficient way to calculate deltas in the DataFrame? - python

I need to calculate the delta and I did it, but I'm using itertuples and I want to avoid use it...
There is an efficient way to do that? Take a look how I did it:
from numpy import append, around, array, float64
from numpy.random import uniform
from pandas import DataFrame
matrix = around(a=uniform(low=1.0, high=50.0, size=(10, 2)), decimals=2)
points = DataFrame(data=matrix, columns=['x', 'y'], dtype='float64')
x_column = points.columns.get_loc('x')
y_column = points.columns.get_loc('y')
x_delta = array(object=[], dtype=float64)
y_delta = array(object=[], dtype=float64)
for row, iterator in enumerate(iterable=points.itertuples(index=False, name='Point')):
if row == 0:
x_delta = append(arr=x_delta, values=0.0)
y_delta = append(arr=y_delta, values=0.0)
else:
x_delta = append(arr=x_delta, values=iterator.x / points.iat[row - 1, x_column] - 1)
y_delta = append(arr=y_delta, values=iterator.y / points.iat[row - 1, y_column] - 1)
x_delta = around(a=x_delta, decimals=2)
y_delta = around(a=y_delta, decimals=2)
points.insert(loc=points.shape[1], column='x_delta', value=x_delta)
points.insert(loc=points.shape[1], column='y_delta', value=y_delta)
print(points)
x y x_delta y_delta
0 26.08 1.37 0.00 0.00
1 8.34 6.82 -0.68 3.98
2 38.42 45.20 3.61 5.63
3 3.59 33.12 -0.91 -0.27
4 42.94 11.06 10.96 -0.67
5 31.99 17.38 -0.26 0.57
6 4.29 17.46 -0.87 0.00
7 19.68 22.28 3.59 0.28
8 27.55 12.98 0.40 -0.42
9 40.23 9.60 0.46 -0.26
Thanks a lot!

Pandas has pct_change() function which compares the current and prior element. You can achieve the same result with one line:
points[['x_delta', 'y_delta']] = points[['x', 'y']].pct_change().fillna(0).round(2)
The fillna(0) is to fix the first row which would otherwise return as NaN.

Pandas has the .diff() built in function.
Calculates the difference of a Dataframe element compared with
another element in the Dataframe (default is element in previous row).
delta_dataframe = original_dataframe.diff()
In this case delta_dataframe will give you the change between rows of the original_dataframe.

Related

Peak detection for unevenly spaced time series : dataframe with one datetime column and NaN values

I'm working with a dataframe containing environnemental values (sentinel2 satellite : NDVI) like:
Date ID_151894 ID_109386 ID_111656 ID_110006 ID_112281 ID_132408
0 2015-07-06 0.82 0.61 0.85 0.86 0.76 nan
1 2015-07-16 0.83 0.81 0.77 0.83 0.84 0.82
2 2015-08-02 0.88 0.89 0.89 0.89 0.86 0.84
3 2015-08-05 nan nan 0.85 nan 0.83 0.77
4 2015-08-12 0.82 0.77 nan 0.65 nan 0.42
5 2015-08-22 0.85 0.85 0.88 0.87 0.83 0.83
The columns correspond to different places and the nan values are due to cloudy conditions (which happen often in Belgium). There are obviously lot more values. To remove outliers, I use the method described in the timesat manual (Jönsson & Eklundh, 2015) :
it deviates more than a maximum deviation (here called cutoff) from the median
value is lower than the mean value of its immediate neighbors minus the cutoff
or it is larger than the highest value of its immediate neighbor plus the cutoff
So, I have made the code below to do so :
NDVI = pd.read_excel("C:/Python_files/Cartofor/NDVI_frene_5ha.xlsx")
date = NDVI["Date"]
MED = NDVI.median(axis = 0, skipna = True, numeric_only=True)
SD = NDVI.std(axis = 0, skipna = True, numeric_only=True)
cutoff = 1.5 * SD
for j in range(1,21): #columns
for i in range(1,480): #rows
if (NDVIF.iloc[i,j] < (((NDVIF.iloc[i-1,j] + NDVIF.iloc[i+1,j])/2) - cutoff.iloc[j])):
NDVIF.iloc[i,j] == float('NaN')
elif (NDVIF.iloc[i,j] > (max(NDVIF.iloc[i-1,j], NDVIF.iloc[i+1,j]) + cutoff.iloc[j])): #2)
NDVIF.iloc[i,j] == float('NaN')
elif ((NDVIF.iloc[i,j] >= abs(MED.iloc[j] - cutoff.iloc[j]))) & (NDVIF.iloc[i,j] <= abs(MED.iloc[j] + cutoff.iloc[j])): #1)
NDVIF.iloc[i,j] == NDVIF.iloc[i,j]
else:
NDVIF.iloc[i,j] == float('NaN')
The problem is that I need to omit the 'NaN' values for the calculations. The goal is to have a dataframe like the one above without the outliers.
Once this is made, I have to interpolate the values for a new chosen time index (e.g. one value per day or one value every five days from 2016 to 2020) and write each interpolated column on a txt file to enter it on the TimeSat software.
I hope my english is not too bad and thank you for your answers! :)

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

Looping a function with irregular index increase

I have my function:
import numpy as np
def monte_carlo1(N):
x = np.random.random (size = N)
y = np.random.random (size = N)
dist = np.sqrt(x ** 2 + y ** 2)
hit = 0
miss = 0
for z in dist:
if z <=1:
hit += 1
else:
miss +=1
hit_ratio = hit / N
return hit_ratio
What i want to do is run this function 100 times each for 10 different values of N, collecting the data into arrays.
For example, a couple of the data collections could be generated by:
data1 = np.array([monte_carlo1(10) for i in range(100)])
data2 = np.array([monte_carlo1(50) for i in range(100)])
data3 = np.array([monte_carlo1(100) for i in range(100)])
But it would be better if I could create a while loop which can iterate 10 times to produce 10 arrays of data instead of having 10 variables data1...data10.
However, I want to be able to increase the values of N inside monte_carlo(N) by irregular amounts, so in my loop i cant just add a fixed value to N each iteration.
WOuld someone suggest how I might build a loop like this?
Thanks
EDIT:
N_vals = [10, 50, 100, 250, 500, 1000, 3000, 5000, 7500, 10000]
def data_mc():
for n in N_vals:
data = np.array([monte_carlo1(n) for i in range(10)])
return data
I've set up the function like this but the output of the function is just one array, which suggests im doing something wrong and the N_values isnt being cycled through
Here is a solution using a pandas.DataFrame where the row index is the value of n for that iteration and the columns represent each of the repeat iterations.
import pandas as pd
def calc(n_list, repeat):
# this will be a simply list of lists, no special 'numpy' arrays
result_list = []
for n in n_list:
result_list.append([monte_carlo1(n) for _ in range(repeat)])
return pd.DataFrame(data=result_list, index=n_list)
This would allow you to do some data analysis afterwards:
>>> from my_script import calc
>>> n_list = [10, 50, 100, 250, 500]
>>> df = calc(n_list, 10)
>>> df
0 1 2 3 4 5 6 7 8 9
10 0.600 0.800 0.700 0.800 0.600 1.000 0.800 0.800 0.700 0.900
50 0.840 0.860 0.700 0.860 0.740 0.860 0.780 0.740 0.740 0.820
100 0.770 0.780 0.730 0.790 0.780 0.730 0.760 0.740 0.770 0.690
250 0.784 0.804 0.792 0.768 0.800 0.780 0.792 0.800 0.804 0.764
500 0.798 0.776 0.782 0.786 0.768 0.798 0.786 0.774 0.774 0.796
Now you can calculate statistics per value of n:
>>> import pandas as pd
>>> stats = pd.DataFrame()
>>> stats['mean'] = df.mean(axis=1)
>>> stats['standard_dev'] = df.std(axis=1)
>>> stats
mean standard_dev
10 0.7700 0.125167
50 0.7940 0.061137
100 0.7540 0.030984
250 0.7888 0.014459
500 0.7838 0.010891
This data analysis shows you, for example, that your predictions get more accurate (smaller std) as you increase n.

Express pandas operations as pipeline

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94
I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

Pandas - get last n values from a group with an offset.

I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.
A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Categories

Resources