How to check if any string is missing in pandas

How to check if any string is missing in pandas - python

I have following dataframe in pandas
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 02:00:01 - 02:30:00 111
2018-01-01 03:00:01 - 03:30:00 122
2018-01-01 04:00:01 - 04:30:00 111
My desired dataframe would be
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 01:30:01 - 02:00:00 0
2018-01-01 02:00:01 - 02:30:00 122
2018-01-01 02:30:01 - 03:00:00 0
2018-01-01 03:00:01 - 03:30:00 111
2018-01-01 03:30:01 - 04:00:00 0
2018-01-01 04:00:01 - 04:30:00 111
2018-01-01 04:30:01 - 05:00:00 0
2018-01-01 05:00:01 - 05:30:00 0
2018-01-01 05:30:01 - 06:00:00 0
2018-01-01 06:00:01 - 06:30:00 0
2018-01-01 06:30:01 - 07:00:00 0
2018-01-01 07:00:01 - 07:30:00 0
2018-01-01 07:30:01 - 08:00:00 0
2018-01-01 08:00:01 - 08:30:00 0
2018-01-01 09:00:01 - 09:30:00 0
2018-01-01 10:00:01 - 10:30:00 0
2018-01-01 10:30:01 - 11:00:00 0
2018-01-01 11:00:01 - 11:30:00 0
2018-01-01 11:30:01 - 12:00:00 0
2018-01-01 12:00:01 - 12:30:00 0
2018-01-01 12:30:01 - 13:00:00 0
2018-01-01 13:00:01 - 13:30:00 0
2018-01-01 13:30:01 - 14:00:00 0
2018-01-01 14:00:01 - 14:30:00 0
2018-01-01 14:30:01 - 15:00:00 0
2018-01-01 15:00:01 - 15:30:00 0
2018-01-01 15:30:01 - 16:00:00 0
2018-01-01 16:00:01 - 16:30:00 0
2018-01-01 16:30:01 - 17:00:00 0
2018-01-01 17:00:01 - 17:30:00 0
2018-01-01 18:00:01 - 18:30:00 0
2018-01-01 18:30:01 - 19:00:00 0
2018-01-01 19:00:01 - 19:30:00 0
2018-01-01 19:30:01 - 20:00:00 0
2018-01-01 20:00:01 - 20:30:00 0
2018-01-01 20:30:01 - 21:00:00 0
2018-01-01 21:00:01 - 21:30:00 0
2018-01-01 21:30:01 - 22:00:00 0
2018-01-01 22:00:01 - 22:30:00 0
2018-01-01 22:30:01 - 23:00:00 0
2018-01-01 23:00:01 - 23:30:00 0
2018-01-01 23:30:01 - 00:00:00 0
What I want to check on Date column is if in any half hourly bucket (48 buckets in total per day) there is missing data and if it is missing then that bucket has to be added in order and will have value as 0.
How can I do it in pandas?

Solution break half_hourly_bucket to 2 new columns, process it and join back:
#create DatetimeIndex
df = df.set_index('Date')
#split to new columns
df[['one','two']] = df['half_hourly_bucket'].str.split(' - ', expand=True)
#add first column to DatetimeIndex
df.index += pd.to_timedelta(df['one'])
#add mising values to DatetimeIndex
one_sec = pd.Timedelta(1, unit='s')
one_day = pd.Timedelta(1, unit='d')
df = df.reindex(pd.date_range(df.index.min().floor('D') + one_sec,
df.index.max().floor('D') + one_day - one_sec, freq='30T'))
#recreate column two
df['two'] = df.index + pd.Timedelta(30*60 - 1, unit='s')
#join together
df['half_hourly_bucket'] = (df.index.strftime('%H:%M:%S') + ' - ' +
df['two'].dt.strftime('%H:%M:%S'))
#replace missing values
df['Value'] = df['Value'].fillna(0)
df = df.rename_axis('Date').reset_index()
#filter only necessary columns
df = df[['Date','half_hourly_bucket','Value']]
print (df)
Date half_hourly_bucket Value
0 2018-01-01 00:00:01 00:00:01 - 00:30:00 123.0
1 2018-01-01 00:30:01 00:30:01 - 01:00:00 12.0
2 2018-01-01 01:00:01 01:00:01 - 01:30:00 122.0
3 2018-01-01 01:30:01 01:30:01 - 02:00:00 0.0
4 2018-01-01 02:00:01 02:00:01 - 02:30:00 111.0
5 2018-01-01 02:30:01 02:30:01 - 03:00:00 0.0
6 2018-01-01 03:00:01 03:00:01 - 03:30:00 122.0
7 2018-01-01 03:30:01 03:30:01 - 04:00:00 0.0
8 2018-01-01 04:00:01 04:00:01 - 04:30:00 111.0
9 2018-01-01 04:30:01 04:30:01 - 05:00:00 0.0
10 2018-01-01 05:00:01 05:00:01 - 05:30:00 0.0
11 2018-01-01 05:30:01 05:30:01 - 06:00:00 0.0
12 2018-01-01 06:00:01 06:00:01 - 06:30:00 0.0
13 2018-01-01 06:30:01 06:30:01 - 07:00:00 0.0
14 2018-01-01 07:00:01 07:00:01 - 07:30:00 0.0
15 2018-01-01 07:30:01 07:30:01 - 08:00:00 0.0
16 2018-01-01 08:00:01 08:00:01 - 08:30:00 0.0
17 2018-01-01 08:30:01 08:30:01 - 09:00:00 0.0
18 2018-01-01 09:00:01 09:00:01 - 09:30:00 0.0
19 2018-01-01 09:30:01 09:30:01 - 10:00:00 0.0
20 2018-01-01 10:00:01 10:00:01 - 10:30:00 0.0
21 2018-01-01 10:30:01 10:30:01 - 11:00:00 0.0
22 2018-01-01 11:00:01 11:00:01 - 11:30:00 0.0
23 2018-01-01 11:30:01 11:30:01 - 12:00:00 0.0
24 2018-01-01 12:00:01 12:00:01 - 12:30:00 0.0
25 2018-01-01 12:30:01 12:30:01 - 13:00:00 0.0
26 2018-01-01 13:00:01 13:00:01 - 13:30:00 0.0
27 2018-01-01 13:30:01 13:30:01 - 14:00:00 0.0
28 2018-01-01 14:00:01 14:00:01 - 14:30:00 0.0
29 2018-01-01 14:30:01 14:30:01 - 15:00:00 0.0
30 2018-01-01 15:00:01 15:00:01 - 15:30:00 0.0
31 2018-01-01 15:30:01 15:30:01 - 16:00:00 0.0
32 2018-01-01 16:00:01 16:00:01 - 16:30:00 0.0
33 2018-01-01 16:30:01 16:30:01 - 17:00:00 0.0
34 2018-01-01 17:00:01 17:00:01 - 17:30:00 0.0
35 2018-01-01 17:30:01 17:30:01 - 18:00:00 0.0
36 2018-01-01 18:00:01 18:00:01 - 18:30:00 0.0
37 2018-01-01 18:30:01 18:30:01 - 19:00:00 0.0
38 2018-01-01 19:00:01 19:00:01 - 19:30:00 0.0
39 2018-01-01 19:30:01 19:30:01 - 20:00:00 0.0
40 2018-01-01 20:00:01 20:00:01 - 20:30:00 0.0
41 2018-01-01 20:30:01 20:30:01 - 21:00:00 0.0
42 2018-01-01 21:00:01 21:00:01 - 21:30:00 0.0
43 2018-01-01 21:30:01 21:30:01 - 22:00:00 0.0
44 2018-01-01 22:00:01 22:00:01 - 22:30:00 0.0
45 2018-01-01 22:30:01 22:30:01 - 23:00:00 0.0
46 2018-01-01 23:00:01 23:00:01 - 23:30:00 0.0
47 2018-01-01 23:30:01 23:30:01 - 00:00:00 0.0

Related

How to extract the first and last value from a data sequence based on a column value?

I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?

Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00

How to either change the date or get rid off it after using pd.to_datetime()?

I have a df that looks as follows:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 02:15:00 20.5 1 1
and their data types are:
Datum object
Dates object
Time object
Menge float64
day int64
month int64
dtype: object
I wanted to calculate a few things like the hourly average, daily average, monthly average and for that, I had to convert the types of the Dates and Time column. For that, I did:
data_nan_dropped['Dates'] = pd.to_datetime(data_nan_dropped.Dates)
data_nan_dropped.Time = pd.to_datetime(data_nan_dropped.Time, format='%H:%M:%S')
which converted my df to:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 1900-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 1900-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 1900-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 1900-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 1900-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 1900-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 1900-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 1900-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 1900-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 1900-01-01 02:15:00 20.5 1 1
Now, in the Time column, the time is converted and has the form of 1900-01-01. I don't want that.
If possible, I would like one of the following:
The Time column be converted to datetime64[ns] without the date being displayed
or
The date that is in the Datum column be displyed there instead of
1900-01-01.
How can I achieve this?
Expected output:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 2018-01-01 02:15:00 20.5 1 1

If I understand you correctly by looking at your expected output, we can use the Datum column to create the right Time column:
df['Dates'] = pd.to_datetime(df['Dates'])
df['Time'] = pd.to_datetime(df['Datum'], format='%d/%m/%Y %H:%M')
Datum Dates Time Menge day month
0 1/1/2018 0:00 2018-01-01 2018-01-01 00:00:00 19.5 1 1
1 1/1/2018 0:15 2018-01-01 2018-01-01 00:15:00 19.0 1 1
2 1/1/2018 0:30 2018-01-01 2018-01-01 00:30:00 19.5 1 1
3 1/1/2018 0:45 2018-01-01 2018-01-01 00:45:00 19.5 1 1
4 1/1/2018 1:00 2018-01-01 2018-01-01 01:00:00 21.0 1 1
5 1/1/2018 1:15 2018-01-01 2018-01-01 01:15:00 19.5 1 1
6 1/1/2018 1:30 2018-01-01 2018-01-01 01:30:00 20.0 1 1
7 1/1/2018 1:45 2018-01-01 2018-01-01 01:45:00 23.0 1 1
8 1/1/2018 2:00 2018-01-01 2018-01-01 02:00:00 20.5 1 1
9 1/1/2018 2:15 2018-01-01 2018-01-01 02:15:00 20.5 1 1

Find missing values of datetime for every customer

CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
missing data point 1
6 17111 2018-01-01 07:00:00 1.835
7 17112 2018-01-01 00:00:00 1.095
8 17112 2018-01-01 01:00:00 1.129
missing data point 1
9 17112 2018-01-01 03:00:00 1.833
10 17112 2018-01-01 04:00:00 1.697
11 17112 2018-01-01 05:00:00 1.835
For every customer, I have hourly data. However, some data points are missing in between. I want to check the Min and Max of Usage Date and fill in the missing Usage Date in that time interval (all values are per hour) and EnergyConsumed as zero. I can later use ffill or backfill to take care of this.
Not every customer's max UsageDate is 2018-01-31 23:00:00. So we only want to extend the series till the max date of every customer.
missing point 1 is replaced by
17111 2018-01-01 06:00:00 0
missing point 2 is replaced by
17112 2018-01-01 02:00:00 0
My main point of trouble is how to find the min and max date of every customer and then generate the gaps of dates.
I have tried indexing by date and resampling but havent helped me reach the solution.
Also, I was wondering if there is a way to directly find customerID's which have missing values in the pattern described above. My data is very large and the solution provided by #Vaishali is computing heavy. Any inputs would be helpful!

You can group the Dataframe by custid and create index with desired date range. Now use this index to reindex the data
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
idx = df.groupby('CustID')['UsageDate'].apply(lambda x: pd.Series(index = pd.date_range(x.min(), x.max(), freq = 'H'))).index
df.set_index(['CustID', 'UsageDate']).reindex(idx).fillna(0).reset_index().rename(columns = {'level_1':'UsageDate'})
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Explanation: Since the Usagedates have to be all the dates in the range of minimum and maximum date for that CustID, we group the data by CustID and create a series of min and max dates using date_range. Set the dates as index of the series rather than value. The result of the groupby will be a multiindex with CUSTID as level 0 and usage date as level 1. We now use this multiindex to reindex the original dataframe. It will use the values where the index matches, assign NaN at the rest. Finally convert the NaN to 0 using fillna.

First create DatetimeIndex and then use asfreq in apply:
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H'))
.fillna(0)
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Also is possible use parameter ffill or bfill:
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H', method='ffill'))
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 1.835
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 1.129
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835

Mapping datetime columns of two tables

I created a dataframe with only datetime column with 1 second interval for jan 1, 2018 as shown in the code below.
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
ts = ts.reset_index()
ts = ts.rename(columns={'index': 'datetime'})`
df1:
datetime
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29`
I have another dataframe with a datetime column another columns
df2:
datetime a b c d e
0 2018-01-01 00:00:04 0.9
1 2018-01-01 00:00:06 0.6 0.7
2 2018-01-01 00:00:09 0.5 0.7 0.8
3 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
4 2018-01-01 00:00:17 0.9 3.5 5.5
5 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
6 2018-01-01 00:00:29 2.7 5.5 4.3 `
Now I am trying to map the datetime columns of df1 and df2 using pandas outer join and I would like my expected result to look like
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04 0.9
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06 0.6 0.7
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09 0.5 0.7 0.8
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
17 2018-01-01 00:00:17 0.9 3.5 5.5
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29 2.7 5.5 4.3 `
but my output looks like this
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29
30 2018-01-01 00:00:04 0.9
31 2018-01-01 00:00:06 0.6 0.7
32 2018-01-01 00:00:09 0.5 0.7 0.8
33 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
34 2018-01-01 00:00:17 0.9 3.5 5.5
35 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
36 2018-01-01 00:00:29 2.7 5.5 4.3 `
The code I am using to do that operation is:
test = pandas.merge(df1, df2, on = ['datetime'], how= 'outer')
I am not quite sure how to approach this issue and I would appreciate if I can get some help.

Keep ts with datetime index and use Reindex as #Scott Boston mentioned in the comments,
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime').reindex(ts.index)
a b c d e
2018-01-01 00:00:00 NaN NaN NaN NaN NaN
2018-01-01 00:00:01 NaN NaN NaN NaN NaN
2018-01-01 00:00:02 NaN NaN NaN NaN NaN
2018-01-01 00:00:03 NaN NaN NaN NaN NaN
2018-01-01 00:00:04 0.9
2018-01-01 00:00:05 NaN NaN NaN NaN NaN
2018-01-01 00:00:06 0.6 0.7
2018-01-01 00:00:07 NaN NaN NaN NaN NaN
2018-01-01 00:00:08 NaN NaN NaN NaN NaN
2018-01-01 00:00:09 0.5 0.7 0.8
2018-01-01 00:00:10 NaN NaN NaN NaN NaN
2018-01-01 00:00:11 NaN NaN NaN NaN NaN
2018-01-01 00:00:12 NaN NaN NaN NaN NaN
2018-01-01 00:00:13 NaN NaN NaN NaN NaN
2018-01-01 00:00:14 NaN NaN NaN NaN NaN
2018-01-01 00:00:15 NaN NaN NaN NaN NaN
2018-01-01 00:00:16 2.3 3.6 4.9 5.0
2018-01-01 00:00:17 0.9 3.5 5.5
Option 2: concat
pd.concat([ts, df.set_index('datetime')], axis = 1)

Resampling and add threshold information in Pandas dataframe

I have my pandas data frame in 1 min frequency, I want to do the re-sampling based on the threshold data (there are multiple thresholds in a numpy array)
Here is example of my dataset:
2018-01-01 00:01:00 0.867609
2018-01-01 00:02:00 0.544493
2018-01-01 00:03:00 0.958497
2018-01-01 00:04:00 0.371790
2018-01-01 00:05:00 0.470320
2018-01-01 00:06:00 0.757448
2018-01-01 00:07:00 0.198261
2018-01-01 00:08:00 0.666350
2018-01-01 00:09:00 0.392574
2018-01-01 00:10:00 0.627608
2018-01-01 00:11:00 0.414380
2018-01-01 00:12:00 0.120925
2018-01-01 00:13:00 0.559495
2018-01-01 00:14:00 0.260619
2018-01-01 00:15:00 0.982731
2018-01-01 00:16:00 0.996133
2018-01-01 00:17:00 0.410816
2018-01-01 00:18:00 0.366457
2018-01-01 00:19:00 0.927745
2018-01-01 00:20:00 0.626804
2018-01-01 00:21:00 0.223193
2018-01-01 00:22:00 0.007136
2018-01-01 00:23:00 0.245006
2018-01-01 00:24:00 0.491245
2018-01-01 00:25:00 0.215716
2018-01-01 00:26:00 0.932378
2018-01-01 00:27:00 0.366263
2018-01-01 00:28:00 0.522177
2018-01-01 00:29:00 0.614966
2018-01-01 00:30:00 0.670983
threshold=np.array([0.5,0.8,0.9])
What I want is to extract the data where it crosses the threshold values and if doesn't cross the threshold value just resample data at 30 min
Sample ans :
Threshold
2018-01-01 00:01:00 0.867609 0.8
2018-01-01 00:02:00 0.544493 0.5
2018-01-01 00:03:00 0.958497 0.9
2018-01-01 00:05:00 0.421055 NA
2018-01-01 00:06:00 0.757448 0.5
2018-01-01 00:07:00 0.198261 NA
2018-01-01 00:08:00 0.666350 0.5
2018-01-01 00:09:00 0.392574 NA
2018-01-01 00:10:00 0.627608 0.5
2018-01-01 00:12:00 0.414380 NA
2018-01-01 00:13:00 0.559495 0.5
2018-01-01 00:14:00 0.260619 NA
2018-01-01 00:15:00 0.982731 0.9
2018-01-01 00:16:00 0.996133 0.9
2018-01-01 00:18:00 0.388636 NA
2018-01-01 00:19:00 0.927745 0.9
2018-01-01 00:20:00 0.626804 0.5
2018-01-01 00:25:00 0.215716 NA
2018-01-01 00:26:00 0.932378 0.9
2018-01-01 00:27:00 0.366263 NA
2018-01-01 00:28:00 0.522177 0.5
2018-01-01 00:29:00 0.614966 0.5
2018-01-01 00:30:00 0.670983 0.5
I got the solution for resampling from #Scott Boston,
df = df.set_index(0)
g = df[1].lt(-22).mul(1).diff().bfill().ne(0).cumsum()
df.groupby(g).apply(lambda x: x.resample('1T', kind='period').mean().reset_index()
if (x.iloc[0] < -22).any() else
x.resample('30T', kind='period').mean().reset_index())\
.reset_index(drop=True)

Use pd.cut:
threshold=np.array([0.5,0.8,0.9]).tolist()
pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
Output:
0 0.8
1 0.5
2 0.9
3 NaN
4 NaN
5 0.5
6 NaN
7 0.5
8 NaN
9 0.5
10 NaN
11 NaN
12 0.5
13 NaN
14 0.9
15 0.9
16 NaN
17 NaN
18 0.9
19 0.5
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 0.9
26 NaN
27 0.5
28 0.5
29 0.5
Name: 1, dtype: category
Categories (3, float64): [0.5 < 0.8 < 0.9]
Now, let's add this to the datafame and filter out all consecutive NaNs.
df['Threshold'] = pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
mask = ~(df.Threshold.isnull() & (df.Threshold.isnull() == df.Threshold.isnull().shift(1)))
df[mask]
Output:
0 1 Threshold
0 2018-01-01 00:01:00 0.867609 0.8
1 2018-01-01 00:02:00 0.544493 0.5
2 2018-01-01 00:03:00 0.958497 0.9
3 2018-01-01 00:04:00 0.371790 NaN
5 2018-01-01 00:06:00 0.757448 0.5
6 2018-01-01 00:07:00 0.198261 NaN
7 2018-01-01 00:08:00 0.666350 0.5
8 2018-01-01 00:09:00 0.392574 NaN
9 2018-01-01 00:10:00 0.627608 0.5
10 2018-01-01 00:11:00 0.414380 NaN
12 2018-01-01 00:13:00 0.559495 0.5
13 2018-01-01 00:14:00 0.260619 NaN
14 2018-01-01 00:15:00 0.982731 0.9
15 2018-01-01 00:16:00 0.996133 0.9
16 2018-01-01 00:17:00 0.410816 NaN
18 2018-01-01 00:19:00 0.927745 0.9
19 2018-01-01 00:20:00 0.626804 0.5
20 2018-01-01 00:21:00 0.223193 NaN
25 2018-01-01 00:26:00 0.932378 0.9
26 2018-01-01 00:27:00 0.366263 NaN
27 2018-01-01 00:28:00 0.522177 0.5
28 2018-01-01 00:29:00 0.614966 0.5
29 2018-01-01 00:30:00 0.670983 0.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to check if any string is missing in pandas - python

Related

How to extract the first and last value from a data sequence based on a column value?

How to either change the date or get rid off it after using pd.to_datetime()?

Find missing values of datetime for every customer

Mapping datetime columns of two tables

Resampling and add threshold information in Pandas dataframe

Categories

Resources