Time Series: Fill NaNs from another dataframe

Time Series: Fill NaNs from another dataframe - python

I am working with temperature data and I have created a file that has multi-year averages of few thousand cities and the format is as below(df1)
Date City PRCP TMAX TMIN TAVG
01-Jan Zurich 0.94 3.54 0.36 1.95
01-Feb Zurich 4.12 9.14 3.04 6.09
01-Mar Zurich 4.1 5.9 0.3 3.1
01-Apr Zurich 0.32 13.78 4.22 9
01-May Zurich 9.42 11.32 5.34 8.33
.
.
.....
I have the above data for all 365 days with no nulls. Notice that the date column only has day and month because year is irrelevant.
Based on the above data I am trying to clean yearly files, my second dataframe has data in the below format(df2)
ID Date City PRCP TAVG TMAX TMIN
abcd1 2020-01-01 Zurich 0 -1.9 -0.9
abcd1 2020-01-02 Zurich 9.1 12.7 4.9
abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
abcd1 2020-01-04 Zurich 0 4.1 10.8 -2.6
.
.
.....
Each city has a unique ID. The date column has the format %y-%m-%d.
I am trying to replace the nulls in the second dataframe with the values in my first dataframe by matching day and month. This is what I tried
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce') ##date format change##
df1["Date"] = df1['Date'].dt.strftime('%d-%m')
df2 = df2.drop(columns='ID')
df2 = df2.fillna(df1) ##To replace nulls##
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')
df1["Date"] = df1['Date'].dt.strftime('%Y-%m-%d') ## Change data back to original format##
Even with this I end up with nulls in my yearly file i.e. df2{Note: df1 has no nulls}
Please suggest a better way to replace only nulls or any corrections to the code if necessary.

We can approach by adding a column Date2 onto df2 with the same format as the Date column on df1. Then, while setting both dataframes with this date format and City as index, we perform an update on df2 using .update(), as follows:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
.update() is to modify in place using non-NA values from another DataFrame. The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated. Hence, we make both dataframe with same row index so that updates will be performed on corresponding columns with same column index/labels.
Note that we use the parameter overwrite=False in .update() to ensure we only update values that are NaN in the original DataFrame df2.
Demo
Data Setup:
Added data onto df1 to showcase replacing values of df2 from df1:
print(df1)
Date City PRCP TMAX TMIN TAVG
0 01-Jan Zurich 0.94 3.54 0.36 1.95
1 02-Jan Zurich 0.95 3.55 0.37 1.96 <=== Added this row
2 01-Feb Zurich 4.12 9.14 3.04 6.09
3 01-Mar Zurich 4.10 5.90 0.30 3.10
4 01-Apr Zurich 0.32 13.78 4.22 9.00
5 01-May Zurich 9.42 11.32 5.34 8.33
print(df2) # before processing
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 NaN <=== with NaN value
1 abcd1 2020-01-02 Zurich 9.1 NaN 12.7 4.9 <=== with NaN value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.6
Run new codes:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
Result:
print(df2)
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 0.36 <== TMIN updated with df1 value
1 abcd1 2020-01-02 Zurich 9.1 1.96 12.7 4.90 <== TAVG updated with df1 value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.90
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.60

Related

Sum all columns by month?

I have a dataframe:
date C P
0 15.4.21 0.06 0.94
1 16.4.21 0.15 1.32
2 2.5.21 0.06 1.17
3 8.5.21 0.20 0.82
4 9.6.21 0.04 -5.09
5 1.2.22 0.05 7.09
I need to create 2 columns where I Sum both C and P for each month.
So new df will have 2 columns, for example for the month 4 (April) (0.06+0.94+0.15+1.32) = 2.47, so new df:
4/21 5/21 6/21 2/22
0 2.47 2.25 .. ..
Columns names and order doesn't matter, actualy a string month name even better(April 22).
I was playing with something like this, which is not what i need:
df[['C','P']].groupby(df['date'].dt.to_period('M')).sum()

You almost had it, you need to convert first to_datetime:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date'], day_first=True)
.dt.to_period('M'))
.sum()
)
Output:
C P
date
2021-02 0.06 1.17
2021-04 0.21 2.26
2021-08 0.20 0.82
2021-09 0.04 -5.09
2022-01 0.05 7.09
If you want the grand total, sum again:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date']).dt.to_period('M'))
.sum().sum(axis=1)
)
Output:
date
2021-02 1.23
2021-04 2.47
2021-08 1.02
2021-09 -5.05
2022-01 7.14
Freq: M, dtype: float64
as "Month year"
If you want a string, better convert it in the end to keep the order:
out.index = out.index.strftime('%B %y')
Output:
date
February 21 1.23
April 21 2.47
August 21 1.02
September 21 -5.05
January 22 7.14
dtype: float64

Rename hundred or more column names in pandas dataframe

I am working with the John Hopkins Covid data for personal use to create charts. The data shows cumulative deaths by country, I want deaths per day. Seems to me the easiest way is to create two dataframes and subtract one from the other. But the file has column names as dates and the code, e.g. df3 = df2 - df1 subtracts the columns with the matching dates. So I want to rename all the columns with some easy index, for example, 1, 2, 3, ....
I cannot figure out how to do this?

new_names=list(range(data.shape[1]))
data.columns=new_names
This renames the columns of data from 0 upwards.

You could re-shape the data: use dates and row labels, and use country, province as column labels.
import pandas as pd
covid_csv = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_raw = (pd.read_csv(covid_csv)
.set_index(['Country/Region', 'Province/State'])
.drop(columns=['Lat', 'Long'])
.transpose())
df_raw.index = pd.to_datetime(df_raw.index)
print( df_raw.iloc[-5:, 0:5] )
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 1269 144 1163 52 41
2020-07-28 1270 148 1174 52 47
2020-07-29 1271 150 1186 52 48
2020-07-30 1271 154 1200 52 51
2020-07-31 1272 157 1210 52 52
Now, you can use the rich set of pandas tools for time-series analysis. For example, use diff() to go from cumulative deaths to per-day rates. Or, you could compute N-day moving averages, create time-series plots, ...
print(df_raw.diff().iloc[-5:, 0:5])
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 10.0 6.0 8.0 0.0 1.0
2020-07-28 1.0 4.0 11.0 0.0 6.0
2020-07-29 1.0 2.0 12.0 0.0 1.0
2020-07-30 0.0 4.0 14.0 0.0 3.0
2020-07-31 1.0 3.0 10.0 0.0 1.0
Finally, df_raw.sum(level='Country/Region', axis=1) will aggregate all Provinces within a Country.

Thanks for the time and effort but I figured out a simple way.
for i, row in enumerate(df):
df.rename(columns = { row : str(i)}, inplace = True)
to change the columns names and then
for i, row in enumerate(df):
df.rename(columns = { row : str( i + 43853)}, inplace = True)
to change them back to the dates I want.

Pandas index match with time values

I am trying to index match 2 dataframes and write the data back to excel. The Excel file that has to be filled looks like this:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00 5.45 16:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 NaN 3.45 NaN 3.65
2 Steak Dallas 20200506.0 8.5 8.85 NaN 8.45 NaN 8.65
The 'TimeH' and 'TimeL' should be index'd from a dataframe that looks like this
Name Date Time Open High Low Close Volume VWAP Trades
4 Apple 20200505 15:30:00 3.50 3.85 3.45 3.70 1500 3.73 95
5 Apple 20200505 17:00:00 3.65 3.70 3.50 3.60 1600 3.65 54
6 Apple 20200505 20:00:00 3.80 3.85 3.35 3.81 1700 3.73 41
7 Apple 20200505 22:00:00 3.60 3.84 3.45 3.65 1800 3.75 62
4 Steak 20200506 10:00:00 8.50 8.85 8.45 8.70 1500 8.73 95
5 Steak 20200506 12:00:00 8.65 8.70 8.50 8.60 1600 8.65 54
6 Steak 20200506 14:00:00 8.80 8.85 8.45 8.81 1700 8.73 41
7 Steak 20200506 16:00:00 8.60 8.84 8.45 8.65 1800 8.75 62
And then be pasted to the excel file, which should look like this after everything has worked:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00:00 5.45 16:00:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 10:00:00 3.45 20:00:00 3.65
2 Steak Dallas 20200506.0 8.5 8.85 15:30:00 8.45 14:00:00 8.65
I was using the following code to index the values 'Open', 'High', 'Low', 'Close', which works great:
rdf13 = rdf12.groupby(['Name','Date']).agg(Open=('Open','first'),High=('High','max'),Low=('Low','min'), Close=('Close','last'),Volume=('Volume','sum'),VWAP=('VWAP','mean'),Trades=('Trades','sum')).reset_index()
result11 = pd.merge(rdf13, rdf11, how='inner', on=['Name', 'Date']).iloc[:,:-4].dropna(1).rename(columns = {"Open_x": "Open", "High_x": "High", "Low_x": "Low", "Close_x": "Close", "Volume_x": "Volume", "VWAP_x": "VWAP", "Trades_x": "Trades"})
result12 = result11.reindex(index=result11.index[::-1])
result13 = result12[['Name', 'Location', 'Date', 'Check_2','Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']].reset_index()
readfile11 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile11['Count'] = np.arange(len(readfile11))
df11 = readfile11.set_index(['Name', 'Location', 'Date'])
df12 = result13.set_index(['Name', 'Location', 'Date'])
fdf11 = df12.combine_first(df11).reset_index().reindex(readfile11.columns, axis=1).sort_values('Count')
print("Updated Day1 Data Frame")
print(fdf11)
writefdf10 = fdf11.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
But when I append it to index the TimeH value with the following code:
colnames40 = rdf12.rename(columns = {"Time": "TimeH"})
result41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
result42 = result41.reindex(index=result41.index[::-1])
result43 = result42[['Name', 'Location', 'Date', 'Check_2', 'High', 'TimeH']].reset_index()
readfile41 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile41['Count'] = np.arange(len(readfile41))
df41 = readfile41.set_index(['Name', 'Location', 'Date', 'High'])
df42 = result43.set_index(['Name', 'Location', 'Date', 'High'])
fdf41 = df42.combine_first(df41).reset_index().reindex(readfile41.columns, axis=1).sort_values('Count')
print("Updated Day3 Data Frame")
print(fdf41)
writefdf40 = fdf41.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
it does not seem to work for some reason and returns nothing, so the 'NaN' values in the 'TimeH' column stay 'NaN'. I messed around with the variables, but I either got errors because I did something wrong or it still returned 'NaN' values to me.
Can someone here help me to make python index the time values?

Apparently I just had a little typo in my code.
esult41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
should have been
esult41 = pd.merge(colnames40, rdf31, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
Problem now is that the data returns duplicate values, which makes sense because of the reference rdf31, but the issue is that df.drop_duplicates(keep='first', inplace=False) returns 'None' values for some reason, but thats outside the scope of this question.

Deleting values conditional on large values of another column

I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0

You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200

Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200

Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00

As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.

Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1

df.loc[df['Shift']>0.5,'IR'] = np.nan

How to delete rows that contain incorrectly formated date and time values?

I have the following DataFrame df:
df =
date time val1
1/17/2018 18:00 20.0
1/17/2018 18:02 21.1
1/17/2018 18:10 23.2
1/17/2018 18:12 22.0
17/1/2018 18:12 22.1
17-Jan-2018 18:12 22.0
1/18/2018 60 22.1
aa 17:30 23.3
17/1/20188 18:00 19.0
The condition to delete rows:
if the format of a field date does not correspond to '%d/%m/%Y'.
if the format of a field time does not correspond to "%H:%M".
Based on these two conditions the last 5 rows in df should be deleted to get a new clean dataframe.
How can I do it?
Thanks.

Here is one way to_datetime with errors='coerce' if the format not same as input , it will return NaN
s=pd.to_datetime(df.date+' '+df.time,format='%m/%d/%Y %H:%M',errors='coerce').notna()
df=df[s].copy()
df
Out[212]:
date time val1
0 1/17/2018 18:00 20.0
1 1/17/2018 18:02 21.1
2 1/17/2018 18:10 23.2
3 1/17/2018 18:12 22.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Time Series: Fill NaNs from another dataframe - python

Related

Sum all columns by month?

Rename hundred or more column names in pandas dataframe

Pandas index match with time values

Deleting values conditional on large values of another column

How to delete rows that contain incorrectly formated date and time values?

Categories

Resources