How to resample using forward fill python - python

My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.

This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()

Related

Pandas Dataframe days difference with zeros and null values: TypeError: unsupported operand type(s) for +: 'Timedelta' and 'int'

I have a Dataframe with dates. I simply want to fill a new column with a difference of maximum end and minimum start dates and find the length of days.
My calculation is working but when if either column contains zero or Nan values it's going to give me this error.
Does anyone can look at the code and give a suggestion.
Thanks in advance.
# here is the Dataframe
end_d start_d
0 2021-09-11 00:00:00 2021-08-01 00:00:00
1 2021-08-29 00:00:00 2021-05-23 00:00:00
2 2021-09-04 00:00:00 2021-06-13 00:00:00
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 NaN NaN
9 NaN NaN
10 2021-09-04 00:00:00 2021-06-13 00:00:00
11
12
13
#When I use the below code if there aren't any zeros or Nan values, the code is working fine.
dsx['length'] = (dsx['end_d'] - dsx['start_d'] + pd.Timedelta(1, unit=freq)).max()
# I want something like, below Dataframe. Any suggestion?
end_d start_d length
0 2021-09-11 00:00:00 2021-08-01 00:00:00 99 days
1 2021-08-29 00:00:00 2021-05-23 00:00:00 99 days
2 2021-09-04 00:00:00 2021-06-13 00:00:00 99 days
3 0 0 99 days
4 0 0 99 days
5 0 0 99 days
6 0 0 99 days
7 0 0 99 days
8 NaN NaN 99 days
9 NaN NaN 99 days
10 2021-09-04 00:00:00 2021-06-13 00:00:00 99 days
Thanks in advance.
You can filter the dataframe for non-N/A values with skipna.
pandas.Dataframe.dropna
filtered_df = df.dropna()
df['length'] = (filtered_df['end_d'] - filtered_df['start_d'] + pd.Timedelta(1, unit=freq)).max()
That takes care of your N/A problem, but you still have an issue where your columns are filled with different data types (int and datetime). Not sure what's up there but you need to fix that.

Pandas dataframe datetime filter doesn't work

I am learning how to work with pandas dataframe and trying to pre-process some data. I have a set of data showing the weather with datatime field as a string. Each day appears twice in this dataset with 00:00 hours and 12:00 hours. I am trying to filter it and keep only data with 12:00 hours. I tried some options recommended here:
#pre-processing to get only required information
data = data[["date_time", "WindGustKmph", "humidity", "precipMM", "pressure", "tempC", "winddirDegree", "windspeedKmph"]]
print(data.head())
print(str(len(data)))
#set proper datatime index and keep only day time weather
dataIndex = pd.DatetimeIndex(data['date_time'].astype(str))
data.index = dataIndex
#filter the data
data.between_time('07:00:00', '21:00:00')
print(data.head())
print(str(len(data)))
As a result, I see that an index was added, but the filter was not applied, and my question is why?
date_time WindGustKmph ... winddirDegree windspeedKmph
0 2018-01-01 0:00 8 ... 21 4
1 2018-01-01 12:00 12 ... 79 10
2 2018-01-02 0:00 14 ... 19 7
3 2018-01-02 12:00 18 ... 57 16
4 2018-01-03 0:00 19 ... 16 9
[5 rows x 8 columns]
2192
date_time ... windspeedKmph
date_time ...
2018-01-01 00:00:00 2018-01-01 0:00 ... 4
2018-01-01 12:00:00 2018-01-01 12:00 ... 10
2018-01-02 00:00:00 2018-01-02 0:00 ... 7
2018-01-02 12:00:00 2018-01-02 12:00 ... 16
2018-01-03 00:00:00 2018-01-03 0:00 ... 9
[5 rows x 8 columns]
2192
Also, I tried another option:
data['date_time'] = pd.to_datetime(data['date_time'])
data['hours'] = data['date_time'].dt.hour
data[data['hours'] != 0]
the same result. The column was added but without filtering data.
date_time WindGustKmph ... windspeedKmph hours
0 2018-01-01 00:00:00 8 ... 4 0
1 2018-01-01 12:00:00 12 ... 10 12
2 2018-01-02 00:00:00 14 ... 7 0
3 2018-01-02 12:00:00 18 ... 16 12
4 2018-01-03 00:00:00 19 ... 9 0
[5 rows x 9 columns]
2192
Would appreciate any suggestion on what I am missing
You need to assign the filtered dataset back to data:
data = data.between_time('07:00:00', '21:00:00')
or (your second option)
data = data[data['hours'].between(7, 21)]
I don't like 7:00:00 and 21:00:00 comparisons.
Why don't you do just
data = data[data['date_time'].dt.hour == 12]

Monthly aggregated values, pandas dataframe

A sample CSV data in which the first column is a time stamp (date + time):
2018-01-01 10:00:00,23,43
2018-01-02 11:00:00,34,35
2018-01-05 12:00:00,25,4
2018-01-10 15:00:00,22,96
2018-01-01 18:00:00,24,53
2018-03-01 10:00:00,94,98
2018-04-20 10:00:00,90,9
2018-04-10 10:00:00,45,51
2018-01-01 10:00:00,74,44
2018-12-01 10:00:00,76,87
2018-11-01 10:00:00,76,87
2018-12-12 10:00:00,87,90
I already wrote some codes to do the monthly aggregated values task while waiting for someone to give me some suggestions.
Thanks #moys, anyway!
import pandas as pd
df = pd.read_csv('Sample.txt', header=None, names = ['Timestamp', 'Value 1', 'Value 2'])
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['Monthly'] = df1['Timestamp'].dt.to_period('M')
grouper = pd.Grouper(key='Monthly')
df2 = df1.groupby(grouper)['Value 1', 'Value 2'].sum().reset_index()
The output is:
Monthly Value 1 Value 2
0 2018-01 202 275
1 2018-03 94 98
2 2018-04 135 60
3 2018-12 163 177
4 2018-11 76 87
What if there's a dataset with more columns, how to motified the my code to make it automatically working on the dataset which has more columns?
2018-02-01 10:00:00,23,43,32
2018-02-02 11:00:00,34,35,43
2018-03-05 12:00:00,25,4,43
2018-02-10 15:00:00,22,96,24
2018-05-01 18:00:00,24,53,98
2018-02-01 10:00:00,94,98,32
2018-02-20 10:00:00,90,9,24
2018-07-10 10:00:00,45,51,32
2018-01-01 10:00:00,74,44,34
2018-12-04 10:00:00,76,87,53
2018-12-02 10:00:00,76,87,21
2018-12-12 10:00:00,87,90,98
You can do something like below
df.groupby(pd.to_datetime(df['date']).dt.month).sum().reset_index()
Output Here, 'date' column is the month number.
date val1 val2
0 1 202 275
1 3 94 98
2 4 135 60
3 11 76 87
4 12 163 177

Calculate mean based on time elapsed in Pandas

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64

Merge one file to other file in groups

In Python and Pandas, I have one dataframe for 2018 which looks like this:
Date Stock_id Stock_value
02/01/2018 1 4
03/01/2018 1 2
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
and a dataframe with one column which has all the 2018 dates like the following:
Date
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
etc
I want to merge these to get my first dataframe with full dates for 2018 for each stock and with NAs wherever they were not any data.
Basically, I want to have for each stock a row for each date of 2018 (where the rows which do not have any data should filled in with NAs).
Thus, I want to have the following as an output for the sample above:
Date Stock_id Stock_value
01/01/2018 1 NA
02/01/2018 1 4
03/01/2018 1 2
04/01/2018 1 NA
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
05/01/2018 2 NA
How can I do this?
I tested
data = data_1.merge(data_2, on='Date' , how='outer')
and
data = data_1.merge(data_2, on='Date' , how='right')
but I still got the original dataframe with no new dates added but only with some rows which had everywhere NAs added.
Use product for all combinations of values with Stock_id and merge with left join:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
from itertools import product
c = ['Stock_id','Date']
df = pd.DataFrame(list(product(df1['Stock_id'].unique(), df2['Date'])), columns=c)
print (df)
Stock_id Date
0 1 2018-01-01
1 1 2018-01-02
2 1 2018-01-03
3 1 2018-01-04
4 1 2018-01-05
5 1 2018-01-06
6 2 2018-01-01
7 2 2018-01-02
8 2 2018-01-03
9 2 2018-01-04
10 2 2018-01-05
11 2 2018-01-06
and
df = df[['Date','Stock_id']].merge(df1, how='left')
#if necessary specify both columns
#df = df[['Date','Stock_id']].merge(df1, how='left', on=['Date','Stock_id'])
print (df)
Date Stock_id Stock_value
0 2018-01-01 1 NaN
1 2018-01-02 1 4.0
2 2018-01-03 1 2.0
3 2018-01-04 1 NaN
4 2018-01-05 1 7.0
5 2018-01-06 1 NaN
6 2018-01-01 2 6.0
7 2018-01-02 2 9.0
8 2018-01-03 2 4.0
9 2018-01-04 2 6.0
10 2018-01-05 2 NaN
11 2018-01-06 2 NaN
Another idea, but should be slow in large data:
df = (df1.groupby('Stock_id')[['Date','Stock_value']]
.apply(lambda x: x.set_index('Date').reindex(df2['Date']))
.reset_index())
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 NaN
1 1 2018-01-02 4.0
2 1 2018-01-03 2.0
3 1 2018-01-04 NaN
4 1 2018-01-05 7.0
5 1 2018-01-06 NaN
6 2 2018-01-01 6.0
7 2 2018-01-02 9.0
8 2 2018-01-03 4.0
9 2 2018-01-04 6.0
10 2 2018-01-05 NaN
11 2 2018-01-06 NaN

Categories

Resources