Data changes while interpolating data frame using Pandas and numpy

Data changes while interpolating data frame using Pandas and numpy - python

I am trying to calculate degree hours based on hourly temperature values.
The data that I am using has some missing days and I am trying to interpolate that data. Below is some part of the data;
2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
....
2014-12-14 20:00:00 1
2014-12-14 21:00:00 0
2014-12-14 22:00:00 -1
2014-12-14 23:00:00 8
The full code is;
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
filename = 'Temperature12.xls'
df_temp = pd.read_excel(filename)
df_temp = df_temp.set_index('datetime')
ts_temp = df_temp['temp']
def inter_lin_nan(ts_temp, rule):
ts_temp = ts_temp.resample(rule)
mask = np.isnan(ts_temp)
# interpolling missing values
ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
return(ts_temp)
ts_temp = inter_lin_nan(ts_temp,'1H')
print ts_temp['2014-06-28':'2014-06-29']
def HDH (Tcurr,Tref=15.0):
if Tref >= Tcurr:
return ((Tref-Tcurr)/24)
else:
return (0)
df_temp['H-Degreehours'] = df_temp.apply(lambda row: HDH(row['temp']),axis=1)
df_temp['CDD-CUMSUM'] = df_temp['C-Degreehours'].cumsum()
df_temp['HDD-CUMSUM'] = df_temp['H-Degreehours'].cumsum()
df_temp1=df_temp['H-Degreehours'].resample('H', how=sum)
print df_temp1
Now I have two questions; while using inter_lin_nan function, it does interpolate data but it also changes the next day data and the next data is totally different from the one available in the excel file. Is this common or I have missed something?
Second question: At the end of the code I am trying to add hourly degree days values and that is why I have created another Data frame, but when I print that data frame, it still has NaN number as in the original data file. Could you please tell why this is happening?
I may be missing something very obvious as I am new to Python.

Don't use numpy when pandas has its own version.
df = pd.read_csv(filepath)
df =df.asfreq('1d') #get a timeseries with index timestamps each day.
df['somelabel'] = df['somelabel'].interpolate(method='linear') # interpolate nan values
Use as frequency to add the required frequency of timestamps to your time series, and uses interpolate() to interpolate nan values only.
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html

Related

Convert a column to a specific time format which contains different types of time formats in python

This is my data frame
df = pd.DataFrame({
'Time': ['10:00PM', '15:45:00', '13:40:00AM','5:00']
})
Time
0 10:00PM
1 15:45:00
2 13:40:00AM
3 5:00
I need to convert the time format in a specific format which is my expected output, given below.
Time
0 22:00:00
1 15:45:00
2 01:40:00
3 05:00:00
I tried using split and endswith function of str which is a complicated solution. Is there any better way to achieve this?
Thanks in advance!

here you go. One thing to mention though 13:40:00AM will result in an error since 13 is a) wrong format as AM/PM only go from 1 to 12 and b) PM (which 13 would be) cannot at the same time be AM :)
Cheers
import pandas as pd
df = pd.DataFrame({'Time': ['10:00PM', '15:45:00', '01:40:00AM', '5:00']})
df['Time'] = pd.to_datetime(df['Time'])
print(df['Time'].dt.time)
<<< 22:00:00
<<< 15:45:00
<<< 01:45:00
<<< 05:00:00

Pandas unstack() and pivot(): MemoryError

Problem description
I'd like to unstack or pivot a DataFrame, but it raises the numpy exception MemoryError: Unable to allocate 1.72 GiB for an array with shape (1844040704,) and data type bool. I have tried this with a DataFrame with a numerical index -> df.pivot() and with a Multiindex -> df.unstack() ]. Both show the same exception and I don't know a way around. I don't feel like I have an exceptionally large dataset with 175199 rows. I have previously used unstack on DataFrames with more than 5mio rows. The df will even become 2 x larger for the complete analysis!
I try to unstack with df_unstacked = df.unstack(level=0)
Additional info
Befor pivot / unstack, I had to add an unique index with df['row_num'] = np.arange(len(df)), because the dataset contains (wanted) duplicate index entries. Thats due to daylight saving time, where one day in octobre has 25 hours. The 2nd hour is duplicated.
I work with Jupyterlab from a virtualenv with python 3.7.
Package versions:
pandas==1.1.2
numpy==1.19.2
jupyterlab==2.2.8
Example data
value
target_frame row_num year
2017-01-01 01:00:00 0 2016 10,3706
2017-01-01 01:15:00 1 2016 27,2456
2017-01-01 01:30:00 2 2016 20,4022
2017-01-01 01:45:00 3 2016 14,4911
2017-01-01 02:00:00 4 2016 14,2611
... ...
2017-12-31 23:45:00 175195 2020 30,7177
2017-01-01 00:00:00 175196 2020 21,4708
2017-01-01 00:15:00 175197 2020 44,9192
2017-01-01 00:30:00 175198 2020 37,8560
2017-01-01 00:45:00 175199 2020 30,9901
[175200 rows x 1 columns]
Desired result
The index will contain duplicates. For the record, i don't care if it's an index or a regular column.
value
year 2016 2017 ... 2020
target_frame
2017-01-01 01:00:00 10,3706 11 ... 32
2017-01-01 01:15:00 27,2456 12 ... 32
2017-01-01 01:30:00 20,4022 13 ... 541
2017-01-01 01:45:00 14,4911 51 ... 123
2017-01-01 02:00:00 14,2611 56 ... 12
... ...
2017-12-31 23:45:00 30,7177 12 ... 12
2017-01-01 00:00:00 21,4708 21 ... 12
2017-01-01 00:15:00 44,9192 21 ... 13
2017-01-01 00:30:00 37,8560 21 ... 11
2017-01-01 00:45:00 30,9901 12 ... 10
[35040 rows x 5 columns]

I will try to help you by addressing the issue of lack of memory, and a way to deal with it.
As your data already has almost 2 billion records, and the error is related to memory, I will focus on that without taking into account the transformations themselves.
If you are using something like, df, df_pivoted, df_unstacked, etc. With each transformation you are creating a new variable, and multiplying your memory consumption. So it is important to clear the memory in the process. Even if your data don´t seems big enough to consume all your memory.
One way to solve this problem is to work on "chuncks" and save each transformation step to a file in order to clear the memory.
So the first step is to save the files, with a simple 'dataframe.to_csv ()'.
The second step is to make the transformations using parts of the data that fit in memory.
For this, there is an argument in the pandas.read_csv () function, called 'chuncksize' that transforms your import object into iteration TextFileReader.
that way, if you want to access the data information, you need to iterate over it.
iterator = pandas.read_csv('file.csv', chuncksize=32)
iterator.shape # will raise an error.
AttributeError: 'TextFileReader' object has no attribute 'shape'
the right way to do it:
for chunck in iterator:
print (chunck.shape)
output:
(32, ncols)
That way, to deal with your problem, you can work with chuncks and use the join functions to do the analysis as you need the data.

I think this might be a bug in pandas or numpy. There are differnet ErrorMessages with different pandas and numpy versions (Anaconda vs. pip). I coded the transformation myself and it runs in no time.
# Get the 2017 timestamp for the side_df
side_df = pd.DataFrame ({'timestamp': next_df.loc[next_df['year'] == 2017]['target_frame']})
for year in next_df['year'].unique():
side_df[year] = next_df.loc[next_df['year'] == year]['value']
display(side_df)
Results in:
timestamp 2016 2017 2018 2019 2020
8839 2017-01-01 01:00:00 10,3706 4,4184 14,7919 30,6942 31,0594
8840 2017-01-01 01:15:00 27,2456 23,7641 31,0019 40,2778 46,8350
8841 2017-01-01 01:30:00 20,4022 14,9732 23,8531 34,4941 41,3688
8842 2017-01-01 01:45:00 14,4911 9,4986 17,0181 28,8678 37,8213
8843 2017-01-01 02:00:00 14,2611 5,1241 14,0869 24,3203 34,4150
... ... ... ... ... ... ...
43874 2017-12-31 23:45:00 10,9256 15,2959 22,6000 40,1677 NaN
43875 2017-01-01 00:00:00 10,9706 4,8184 11,5150 30,9208 NaN
43876 2017-01-01 00:15:00 35,6275 25,8251 30,2893 41,5722 NaN
43877 2017-01-01 00:30:00 24,555 17,7821 24,2928 35,5510 NaN
43878 2017-01-01 00:45:00 5,61 11,7059 20,0477 31,2884 NaN
There are still some problems in the dataset (like the NaNs), but that has nothing to do with this question.

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific

df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234

an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

How to calculate a mean of measurements taken at the same time (n-hours window) on different days in pandas dataframe?

I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.

Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164

Data Frame in Panda with Time series data

I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?

To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data changes while interpolating data frame using Pandas and numpy - python

Related

Convert a column to a specific time format which contains different types of time formats in python

Pandas unstack() and pivot(): MemoryError

Calculate the sum between the fixed time range using Pandas

How to calculate a mean of measurements taken at the same time (n-hours window) on different days in pandas dataframe?

Data Frame in Panda with Time series data

Categories

Resources