Resample data with pandas - python

my initial data.head() has result:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45993 entries, 2009-11-17 14:14:00 to 2012-12-16 14:26:00
Data columns (total 4 columns):
rain 45993 non-null values
temp 45993 non-null values
windspeed 45993 non-null values
dew_point 45993 non-null values
dtypes: float64(4)
2009-11-17 14:14:00 0 22.5 4.9 12.3
2009-11-17 14:44:00 0 22.3 6.1 12.1
2009-11-17 15:14:00 0 22.1 5.3 12.5
2009-11-17 15:44:00 0 22.2 3.3 12.0
2009-11-17 16:14:00 0 20.4 4.9 11.7
When i resample:
data = data.resample('30min', how ='sum')
data.head()
i get :
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 68861 entries, 2009-01-12 00:00:00 to 2012-12-16 14:00:00
Freq: 30T
Data columns (total 4 columns):
rain 45987 non-null values
temp 45987 non-null values
windspeed 45987 non-null values
dew_point 45987 non-null values
dtypes: float64(4)
2009-01-12 00:00:002 0 17.4 7.1 14.6
2009-01-12 00:30:00 0 17.4 7.2 14.7
2009-01-12 01:00:00 0 18.0 10.5 14.3
2009-01-12 01:30:00 0 18.3 9.6 14.2
2009-01-12 02:00:00 0 18.4 10.8 14.8
As you see my initial date is 2009-11-17 14:14:00 but resample day start at 2009-01-12. Can anyone explain that happens?
EDIT , i did find the problem , so for others
the provided dataset had :
2009-01-12 00:00:00 value
2009-01-12 00:30:00 value ... but the next line was!!!!!
2009-01-12 01:00 value
so the missing :00 seconds made all the confusion

Related

Problem adding column to Pandas DataFrame

I have a Dataframe of raw data:
df
Out:
Date_time 10a 10b 10c 40a 40b 40c 100a 100b 100c
120 2019-02-04 16:00:00 26.7 26.9 NaN 26.7 NaN NaN 24.9 NaN NaN
121 2019-02-04 17:00:00 23.4 24.0 23.5 24.3 24.1 24.0 25.1 24.8 25.1
122 2019-02-04 18:00:00 23.1 24.0 23.3 24.3 24.1 24.0 25.1 24.8 25.1
123 2019-02-04 19:00:00 22.8 23.8 22.9 24.3 24.1 24.0 25.1 24.8 25.1
124 2019-02-04 20:00:00 NaN 23.5 22.6 24.3 24.1 24.0 25.1 24.8 25.1
I wish to create a DataFrame containing the 'Date_time' column and several columns of data means. In this instance there will be 3 means for each row, one each for 10, 40, and 100, calculating the mean values for a, b, and c for each of these numbered intervals.
means
Out:
Date_time 10cm 40cm 100cm
120 2019-02-04 16:00:00 26.800000 26.700000 24.9
121 2019-02-04 17:00:00 23.633333 24.133333 25.0
122 2019-02-04 18:00:00 23.466667 24.133333 25.0
123 2019-02-04 19:00:00 23.166667 24.133333 25.0
124 2019-02-04 20:00:00 23.050000 24.133333 25.0
I have tried the following (taken from this answer):
means = df['Date_time'].copy()
means['10cm'] = df.loc[:, '10a':'10c'].mean(axis=1)
But this results in all the mean values being clumped together in one cell at the bottom of the 'Date_time' column with '10cm' being given as the cell's index.
means
Out:
120 2019-02-04 16:00:00
121 2019-02-04 17:00:00
122 2019-02-04 18:00:00
123 2019-02-04 19:00:00
124 2019-02-04 20:00:00
10cm 120 26.800000
121 23.633333
122 23.46...
Name: Date_time, dtype: object
I believe that this is something to do with means being a Series object rather that a DataFrame object when I copy across the 'Date_time' column, but I'm not sure. Any pointers would be greatly appreciated!
It was the Series issue. Turns out writing out the question helped me realise the issue! My solution was altering the initial creation of means using to_frame():
means = df['Date_time'].copy().to_frame()
I'll leave the question up in case anyone else is having a similar issue, to save them having to spend time writing it all up!

Issues converting columns to datetime column ValueError: cannot assemble the datetimes: unconverted data remains: 2

I have this df:
CODE STATION year month day TMAX TMIN PPTOT
0 472606FA AYABACA 2001 1 1 18.0 10.1 0.0
1 472606FA AYABACA 2001 1 2 18.7 9.6 0.0
2 472606FA AYABACA 2001 1 3 19.6 9.3 0.7
3 472606FA AYABACA 2001 1 4 NaN 10.4 NaN
4 472606FA AYABACA 2001 1 5 NaN NaN NaN
... ... ... ... ... ... ... ...
7420 4725F170 HUAROS 2021 4 26 15.6 5.2 0.0
7421 4725F170 HUAROS 2021 4 27 14.4 4.6 0.0
7422 4725F170 HUAROS 2021 4 28 12.9 4.0 0.0
7423 4725F170 HUAROS 2021 4 29 13.5 3.7 0.0
7424 4725F170 HUAROS 2021 4 30 13.0 4.1 0.0
I want to convert year month day columns to datetime so i wrote this code:
df['DATE']=pd.to_datetime(df[['year','month','day']],format="%d/%m/%Y")
I tried also without the format:
df['DATE']=pd.to_datetime(df[['year','month','day']])
But i get this error:
ValueError: cannot assemble the datetimes: unconverted data remains: 2
I checked all the values and there is no nan values in year, month, day. Also there is no strange characters.
I don't know what can be the error.
I would appreciate any help.
Thanks in advance.
try with errors='coerce' parameter of to_datetime() method:
df['DATE']=pd.to_datetime(df[['year','month','day']],errors='coerce')
Note:
It is working in pandas 1.2.4 so If It is not worked then try considering upgrading pandas
Use .str.cat() to tie the entities together and then convert using pd.to_datetime. Code below
my_df['Date']=my_df[['year', 'month', 'day']].apply(lambda x: pd.to_datetime(x.astype(str).str.cat(sep='/'),format="%Y/%m/%d"), axis=1)
my_df.dtypes
CODE object
STATION object
year int64
month int64
day int64
TMAX float64
TMIN float64
PPTOT float64
Date datetime64[ns]
Use pd.to_datetime, with join
df[["year","month","day"]].apply(lambda x: pd.to_datetime('/'.join(x.values.astype(str))), axis=1)

Merge two data frames on three columns in Python

I have two data frames and I would like to merge them on the two columns Latitude and Longitude. The resulting df should include all columns.
df1:
Date Latitude Longitude LST
0 2019-01-01 66.33 17.100 -8.010004
1 2019-01-09 66.33 17.100 -6.675005
2 2019-01-17 66.33 17.100 -21.845003
3 2019-01-25 66.33 17.100 -26.940004
4 2019-02-02 66.33 17.100 -23.035009
... ... ... ... ...
and df2:
Station_Number Date Latitude Longitude Elevation Value
0 CA002100636 2019-01-01 69.5667 -138.9167 1.0 -18.300000
1 CA002100636 2019-01-09 69.5667 -138.9167 1.0 -26.871429
2 CA002100636 2019-01-17 69.5667 -138.9167 1.0 -19.885714
3 CA002100636 2019-01-25 69.5667 -138.9167 1.0 -17.737500
4 CA002100636 2019-02-02 69.5667 -138.9167 1.0 -13.787500
... ... ... ... ... ... ...
I have tried: LST_1=pd.merge(df1, df2, how = 'inner') but using merge in that way I have lost several data points, which are included in both data frames.
I am not sure if you want to merge on a specific column, if so you need to pick one with overlapping identifiers - for instance the "Date" column.
df_ = pd.merge(df1, df2, on="Date")
print(df_)
Date Latitude_x Longitude_x ... Longitude_y Elevation Value
0 01.01.2019 66.33 17.1 ... -138.9167 1.0 -18.300000
1 09.01.2019 66.33 17.1 ... -138.9167 1.0 -26.871429
2 17.01.2019 66.33 17.1 ... -138.9167 1.0 -19.885714
3 25.01.2019 66.33 17.1 ... -138.9167 1.0 -17.737500
4 02.02.2019 66.33 17.1 ... -138.9167 1.0 -13.787500
[5 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5 non-null object
1 Latitude_x 5 non-null float64
2 Longitude_x 5 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Latitude_y 5 non-null int64
6 Longitude_y 5 non-null int64
7 Elevation 5 non-null float64
8 Value 5 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 400.0+ bytes
As you have column names that are the same pandas will create _x and _y on Latitude and Longitude.
If you want all the columns and the data in one row is independent from the others, then you can use pd.concat. However, this will create some NaN values, due to missing data.
df_1 = pd.concat([df1, df2])
print(df_1)
Date Latitude Longitude ... Station_Number Elevation Value
0 01.01.2019 66.33 17.1 ... NaN NaN NaN
1 09.01.2019 66.33 17.1 ... NaN NaN NaN
2 17.01.2019 66.33 17.1 ... NaN NaN NaN
3 25.01.2019 66.33 17.1 ... NaN NaN NaN
4 02.02.2019 66.33 17.1 ... NaN NaN NaN
0 01.01.2019 69.56 -138.9167 ... CA002100636 1.0 -18.300000
1 09.01.2019 69.56 -138.9167 ... CA002100636 1.0 -26.871429
2 17.01.2019 69.56 -138.9167 ... CA002100636 1.0 -19.885714
3 25.01.2019 69.56 -138.9167 ... CA002100636 1.0 -17.737500
4 02.02.2019 69.56 -138.9167 ... CA002100636 1.0 -13.787500
df_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 4
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Elevation 5 non-null float64
6 Value 5 non-null object
dtypes: float64(3), object(4)
memory usage: 640.0+ bytes

How to add a new column to a hierarchical dataframe grouped by groupby

The following script try to calculate the resulting average of the direction and magnitude of the wind. My monthly dataframe has the following column:
data
Fecha Hora DirViento MagViento Temperatura Humedad PreciAcu
0 2011/07/01 00:00 318 6.6 21.22 100 1.7
1 2011/07/01 00:15 342 5.5 21.20 100 1.7
2 2011/07/01 00:30 329 6.6 21.15 100 4.8
3 2011/07/01 00:45 279 7.5 21.11 100 4.2
4 2011/07/01 01:00 318 6.0 21.16 100 2.5
The first thing I do is convert to radians the DirViento column
dir_rad=[]
for i in range(0, len(data['DirViento'])):
dir_rad.append(data['DirViento'][i]*(pi/180.0))
data['DirViento']=around(dir_rad,1)
Now get the columns of the components: u and v wind and add to data
Uviento=[]
Vviento=[]
for i in range(0,len(data['MagViento'])):
Uviento.append(data['MagViento'][i]*sin(data[DirViento][i]))
Vviento.append(data['MagViento'][i]*cos(data[DirViento][i]))
data['u']=around(Uviento,1)
data['v']=around(Vviento,1)
data
Data columns:
Fecha 51 non-null values
Hora 51 non-null values
DirViento 51 non-null values
MagViento 51 non-null values
Temperatura 51 non-null values
Humedad 51 non-null values
PreciAcu 51 non-null values
u 51 non-null values
v 51 non-null values
dtypes: float64(6), int64(2), object(2)
Now we indexed the dataframe and grouped
index=data.set_index(['Fecha','Hora'],inplace=True)
grouped = index.groupby(level=0)
data['u']
Fecha Hora
2011/07/01 00:00 -4.4
00:15 -1.7
00:30 -3.4
00:45 -7.4
01:00 -4.0
2011/07/02 00:00 -4.5
00:15 -4.2
00:30 -7.6
00:45 -3.8
01:00 -2.0
2011/07/03 00:00 -6.3
00:15 -13.7
00:30 -0.3
00:45 -2.5
01:00 -2.7
Now get resultant wind direction for each day
grouped.apply(lambda x: ((scipy.arctan2(mean(x['uu']),mean(x['vv'])))/(pi/180.0)))
Fecha
2011/07/01 -55.495677
2011/07/02 -39.176537
2011/07/03 -51.416339
The result obtained, I need to apply the following conditions
for i in grouped.apply(lambda x: ((scipy.arctan2(mean(x['uu']),mean(x['vv'])))/(pi/180.0))):
if i < 180:
i=i+180
else:
if i > 180:
i=i-180
else:
i=i
print i
124.504323033
140.823463279
128.5836605
How to add the previous result to the next dictionary
stat_cea = grouped.agg({'MagRes':np.mean,'DirRes':np.mean,'Temperatura':np.mean,'Humedad':np.mean,'PreciAcu':np.sum})
stat_cea
Fecha DirRes Humedad PreciAcu Temperatura
2011/07/01 100.000000 30.4 21.367059
2011/07/02 99.823529 18.0 21.841765
2011/07/03 99.823529 4.0 21.347059
You can make your own aggregate functions to apply to grouped data https://stackoverflow.com/a/10964938/2530083. So for your case you could try something like:
import numpy as np
def DirRes(group):
u=np.sum(group['MagViento'] * np.sin(np.deg2rad(group['DirViento'])))
v=np.sum(group['MagViento'] * np.cos(np.deg2rad(group['DirViento'])))
magres=np.sqrt(u*u+v*v)
magdir=np.rad2deg(np.arctan2(u,v))
if magdir<180:
magdir+=180
elif magdir>180:
magdir-=180
return magdir
def MagRes(group):
u=np.sum(group['MagViento'] * np.sin(np.deg2rad(group['DirViento'])))
v=np.sum(group['MagViento'] * np.cos(np.deg2rad(group['DirViento'])))
return np.sqrt(u*u + v*v)

Save daily averages obtained from a monthly database to a csv format file

I have a file 'tancoyol.csv' containing Fecha, Temperatura, Humedad, PreciAcu data recorded every 15 minutes. It has the following form:
Fecha DirViento MagViento Temperatura HUmedad PreciAcu
2011-07-01 00:00:00 318 6.6 21.22 100 1.7
2011-07-01 00:15:00 342 5.5 21.20 100 1.7
2011-07-01 00:30:00 329 6.6 21.15 100 4.8
2011-07-01 00:45:00 279 7.5 21.11 100 4.2
2011-07-01 01:00:00 318 6.0 21.16 100 2.5
2011-07-01 01:15:00 329 7.1 21.13 100 4.0
2011-07-01 01:30:00 300 4.7 21.15 100 1.3
2011-07-01 01:45:00 291 3.1 21.23 100 2.2
2011-07-01 02:00:00 284 7.6 21.29 100 1.3
2011-07-01 02:15:00 315 0.0 21.43 100 1.0
2011-07-01 02:30:00 281 3.6 21.47 100 3.2
2011-07-01 02:45:00 0 2.7 21.52 100 2.5
2011-07-01 03:00:00 57 1.2 21.53 100 0.0
2011-07-01 03:15:00 300 3.4 21.69 100 0.0
2011-07-01 03:30:00 359 5.9 21.67 100 0.0
2011-07-01 03:45:00 309 1.8 21.65 100 0.0
2011-07-01 04:00:00 244 13.4 21.64 100 0.0
2011-07-02 00:00:00 312 6.0 23.05 97 0.0
2011-07-02 00:15:00 318 6.3 22.79 100 0.3
2011-07-02 00:30:00 303 9.1 22.44 100 0.7
2011-07-02 00:45:00 323 6.3 22.40 100 0.3
2011-07-02 01:00:00 319 5.4 22.07 100 0.7
2011-07-02 01:15:00 4 3.9 21.89 100 0.8
2011-07-02 01:30:00 6 4.5 21.74 100 0.7
2011-07-02 01:45:00 310 5.0 21.72 100 1.3
2011-07-02 02:00:00 307 0.0 21.79 100 1.0
2011-07-02 02:15:00 5 3.4 21.78 100 1.2
2011-07-02 02:30:00 288 3.4 21.78 100 1.5
2011-07-02 02:45:00 0 2.6 21.66 100 1.5
2011-07-02 03:00:00 280 5.8 21.48 100 1.3
2011-07-02 03:15:00 29 0.0 21.43 100 1.5
2011-07-02 03:30:00 332 2.0 21.23 100 1.7
2011-07-02 03:45:00 148 0.0 21.06 100 1.5
2011-07-02 04:00:00 132 0.0 21.00 100 2.0
2011-07-03 00:00:00 308 8.0 21.93 99 0.3
2011-07-03 00:15:00 288 14.4 21.85 99 0.2
2011-07-03 00:30:00 354 3.1 21.85 99 0.3
2011-07-03 00:45:00 335 5.8 21.75 100 0.2
2011-07-03 01:00:00 274 2.7 21.68 100 0.0
2011-07-03 01:15:00 328 5.6 21.55 100 0.3
2011-07-03 01:30:00 319 7.9 21.38 100 0.2
2011-07-03 01:45:00 322 5.1 21.32 100 0.3
2011-07-03 02:00:00 317 2.8 21.21 100 0.2
2011-07-03 02:15:00 322 5.3 21.08 100 0.3
2011-07-03 02:30:00 291 4.3 21.06 100 0.2
2011-07-03 02:45:00 284 5.7 21.04 100 0.3
2011-07-03 03:00:00 310 2.7 21.05 100 0.2
2011-07-03 03:15:00 318 4.6 21.06 100 0.3
2011-07-03 03:30:00 299 7.4 21.05 100 0.2
2011-07-03 03:45:00 238 0.0 20.99 100 0.3
2011-07-03 04:00:00 310 1.4 21.05 100 0.2
The first thing I want to do is get the average of the columns for DirViento, MagViento, Temperatura and Humedad. I do this as follows:
import pandas as pd
import numpy as np
df = pd.read_csv('tancoyol.csv', parse_dates=[['Fecha','Hora']])
df1=df.set_index('Fecha_Hora')
prom_diario=df1.resample('D',how=np.mean)
print prom_diario
Fecha DirViento MagViento Temperatura Humedad PreciAcu
2011-07-01 318.000000 6.600000 21.220000 100.000000 1.700000
2011-07-02 273.470588 5.064706 21.474706 99.823529 1.688235
2011-07-03 200.705882 3.864706 21.775882 99.941176 1.076471
2011-07-04 306.812500 4.925000 21.310625 99.875000 0.231250
because the average is not done for days 1, 2 and 3, since the output is lagged, ie, the average for day 2 should correspond to the first day and so on. How to resolve this problem?
Now, instead of obtaining the average in the PreciAcu column, I would like to get the daily sum only for PreciAcu column, How I can do it?
Finally, How storing the outputs (average and sum) to a csv file
I will appreciate very much for your help
To sum one column and average others, pass a dictionary of column names and functions.
In [47]: df.resample('D', {'DirViento': np.mean, 'MagViento': np.mean, 'Temperatura': np.mean, 'HUmedad': np.mean, 'PreciAcu': np.sum})
Out[47]:
PreciAcu Temperatura HUmedad DirViento MagViento
0_1
2011-07-01 30.4 21.367059 100.000000 273.823529 5.100000
2011-07-02 18.0 21.841765 99.823529 200.941176 3.747059
2011-07-03 4.0 21.347059 99.823529 306.882353 5.105882
I don't follow your reasoning for why the output is lagged, but you can achieve it like this:
In [53]: resampled = df.resample('D', {'DirViento': np.mean, 'MagViento': np.mean, 'Temperatura': np.mean, 'HUmedad': np.mean, 'PreciAcu': np.sum})
In [54]: resampled.tshift(-1)
Out[54]:
PreciAcu Temperatura HUmedad DirViento MagViento
0_1
2011-06-30 30.4 21.367059 100.000000 273.823529 5.100000
2011-07-01 18.0 21.841765 99.823529 200.941176 3.747059
2011-07-02 4.0 21.347059 99.823529 306.882353 5.105882
To save it to a CSV is happily easy: df1.to_csv('filename.csv').
I think you're looking for the closed='right' and label='right' arguments of resample:
In [38]: hows = {'PreciAcu': 'sum'}
In [39]: func_keys = df.columns - Index(hows.keys())
In [40]: mean_funcs = zip(func_keys, ['mean'] * len(func_keys))
In [41]: hows.update(mean_funcs)
In [42]: hows
Out[42]:
{'DirViento': 'mean',
'HUmedad': 'mean',
'MagViento': 'mean',
'PreciAcu': 'sum',
'Temperatura': 'mean'}
In [48]: df.resample('D', how=hows, closed='right', label='right')
Out[48]:
PreciAcu HUmedad Temperatura DirViento MagViento
ts
2011-07-01 1.7 100.000 21.220 318.000 6.600
2011-07-02 28.7 99.824 21.475 273.471 5.065
2011-07-03 18.3 99.941 21.776 200.706 3.865
2011-07-04 3.7 99.875 21.311 306.812 4.925
And of course as #Dan Allan says, use to_csv to write your newly resampled DataFrame to a file.

Categories

Resources