I have a dataset of 70000+ data points (see picture)
As you can see, in the column 'date' half of the format is different (more messy) compared to the other half (more clear). How can I make the whole format as the second half of my data frame?
I know how to do it manually, but it will take ages!
Thanks in advance!
EDIT
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
Date is in a strange format
[
EDIT 2
two data formats:
2012-01-01 00:00:00
2020-07-21T22:45:00+00:00
I've tried the below and it works, note that this assuming two key assumptions:
1- Your date fromat follows one and ONLY ONE of the TWO formats in your example!
2- The final output is a string!
If so, this should do the trick, else, it's a starting point and can be altered to you want it to look like:
import pandas as pd
import datetime
#data sample
d = {'date':['20090602123000', '20090602124500', '2020-07-22 18:45:00+00:00', '2020-07-22 19:00:00+00:00']}
#create dataframe
df = pd.DataFrame(data = d)
print(df)
date
0 20090602123000
1 20090602124500
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
#loop over records
for i, row in df.iterrows():
#get date
dateString = df.at[i,'date']
#check if it's the undesired format or the desired format
#NOTE i'm using the '+' substring to identify that, this comes to my first assumption above that you only have two formats and that should work
if '+' not in dateString:
#reformat datetime
#NOTE: this is comes to my second assumption where i'm producing it into a string format to add the '+00:00'
df['date'].loc[df.index == i] = str(datetime.datetime.strptime(dateString, '%Y%m%d%H%M%S')) + '+00:00'
else:
continue
print(df)
date
0 2009-06-02 12:30:00+00:00
1 2009-06-02 12:45:00+00:00
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
you can format the first part of your dataframe
import datetime as dt
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
this checks if all characters of the value are digits, then format the date as the second part
EDIT
the timestamp seems to be in miliseconds while they should be in seconds => / 1000
Related
I'm trying to get the index of my dataFrame to be of type datetime. My CSV file contains seperate columns of Dates and Times which i combine upon importing:
df = pd.read_csv("example.csv", sep=";", decimal=",", parse_dates=[["Date", "Time"]])
It will look like this after the import:
Date_Time
0
1012020 00:00:00
1
1012020 00:15:00
The problem is the missing leading zero on the first 9 days of each month. Pandas to_datetime() needs a leading zero for the %d format option to work. When i use format="%d%m%Y%H:%M:%S" python says "invalid syntax"
How can I convert this column to datetime?
Use Series.str.zfill (as suggested by #FObersteiner in the comments) and apply pd.to_datetime afterwards:
import pandas as pd
# changing 2nd val to `'12012020 00:15:00'` to show that
# only the 1st val is affected
data = {'Date_Time': {0: '1012020 00:00:00', 1: '12012020 00:15:00'}}
df = pd.DataFrame(data)
df['Date_Time'] = pd.to_datetime(df["Date_Time"].str.zfill(17),
format="%d%m%Y %H:%M:%S")
print(df)
Date_Time
0 2020-01-01 00:00:00
1 2020-01-12 00:15:00
print(df['Date_Time'].dtype)
datetime64[ns]
Another (admittedly, unnecessarily complicated) way to go, would be to use a regex pattern to replace all "dates" with 7 digits by their 8-digit equivalent:
df['Date_Time'] = pd.to_datetime(
df['Date_Time'].replace(r'^(\d{7}\s)',r'0\1', regex=True),
format="%d%m%Y %H:%M:%S")
Explanation r'^(\d{7}\s)':
^ assert position at start of the string
\d{7}\s matches 7 digits followed by a whitespace
The encapsulating brackets turn this into a Capturing Group
Explanation r'0\1':
\1 refers back to the Capturing Group (1st of 1 group(s)), to which we prepend 0
I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.
I'm trying to compare 2 lists of dates, by checking if the date in the first dataframe with column 'timekey' is between the 2 dates, where the 2 dates is the date in timelist and timelist - 1 year.
An example would be checking if 30Aug2020 is between 30Nov2020 and 30Nov2020-1year, I.E 30Nov2019.
I then want to have a 3rd column in the original df where it shows the difference between the timekey date and the compared timelist date.
I'm doing all of this in python using pandas.
import pandas as pd
import datetime as dt
datelist = pd.date_range(start = dt.datetime(2016,8,31), end = dt.datetime(2020,11,30), freq = '3M')
data = {'ID': ['1', '2', '3'], 'timekey': ['31Dec2016', '30Jun2017', '30Aug2018']}
df = pd.DataFrame(data)
df['timekey'] = pd.to_datetime(df['timekey'])
print(df)
print(datelist)
Here is the code I tried, but I have a value error where they say lengths must match to compare. Whats going on?
for date in datelist:
if (df['timekey'] <= datelist) & (df['timekey'] >= (datelist - pd.offsets.DateOffset(years=1))):
df['diff'] = df['timekey'] - (datelist - pd.offsets.DateOffset(years=1))
The expected output should be that for each timekey, if it is within the date range specified by the datelist, it should generate an entire new row with the same ID and timekey with the 3rd new column being the difference in months.
For example, if the timekey is 30Jun2020, it would be between 30Nov2019-30Nov2020, 30Aug2019-30Aug2020. There would be 2 rows created whereby the time difference in months would be 5 and 2 respectively.
Easiest way I could think of to solve your problem would be using the unix timestamp (which will return you the seconds passed since 1970-01-01) to compare. Therefore you would need to convert your dates to unix.
Something like this would work:
unixTime = (pd.to_datetime(<yourTime>) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
so a working example to check if a date is in-between two dates could look like this:
def checkIfInbetween(date1,date2,dateToCheck):
date1 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
date2 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
dateToCheck = (pd.to_datetime(dateToCheck) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
if(dateToCheck<date2 && dateToCheck>date1):
return True
else:
return False
df['isInbetween'] = df.apply(lamdbda x: checkIfInbetween(x['date1'], x['date2'], x['dateToCheck']))
(Code not tested)
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]