I've got a dataframe with one column filled with milliseconds that I've been able to convert somewhat into datetime format. The issue is that for two years worth of data, from 2017-2018, the time output remains at 1-1-1970. The output datetime looks like this:
27 1970-01-01 00:25:04.232399999
28 1970-01-01 00:25:04.232699999
29 1970-01-01 00:25:04.232999999
...
85264 1970-01-01 00:25:29.962799999
85265 1970-01-01 00:25:29.963099999
85266 1970-01-01 00:25:29.963399999
It seems to me that the milliseconds, which begin at 1504224299999 and end at 1529971499999, are getting added to the 10th hour of epoch and are not representing the true range that it should.
This is my code so far...
import pandas as pd
import MySQLdb
import datetime
from pandas import DataFrame
con = MySQLdb.connect(host='localhost',user='root',db='binance',passwd='abcde')
cur = con.cursor()
ms = pd.read_sql('SELECT close_time FROM btcusdt', con=con)
ms['close_time'].apply( lambda x: datetime.datetime.fromtimestamp(x/1000) )
date = pd.to_datetime(ms['close_time'])
print(date)
I'm not quite sure where I'm going wrong, so if anybody can tell me what I'm doing stupidly it'd be greatly appreciated.
If you need to apply a function that doesn't support your argument directly, you can apply it element wise using dummy function lambda.
Also, you need to assign back to your original panda series to overwrite it, use:
ms['close_time'] = ms['close_time'].apply( lambda x: datetime.datetime.fromtimestamp(x/1000) )
If you want to use pandas.to_datetime directly. use:
pd.to_datetime(ms['close_time'], unit = 'ms')
PS. There might be difference in datetime obtained from these two methods
Related
I have a question. I have a set of numeric values that are a date, but apparently the date is wrongly formatted and coming out of SAS. For example, I have the value 5893 that is in SAS 19.02.1976 when formatted correctly. I want to achieve this in Python/PySpark. From what I've found until now, there is a function fromtimestamp.
However, when I do this, it gives a wrong date:
value = 5893
date = datetime.datetime.fromtimestamp(value)
print(date)
1970-01-01 02:38:13
Any proposals to get the correct date? Thank you! :-)
EDIT: And how would the code look like when this operation is imposed on a dataframe column rather than a variable?
The Epoch, as far as SAS is concerned, is 1st January 1960. The number you have (5893) is the number of elapsed days since that Epoch. Therefore:
from datetime import timedelta, date
print(date(1960, 1, 1) + timedelta(days=5893))
...will give you the desired result
import numpy as np
import pandas as pd
ser = pd.Series([19411.0, 19325.0, 19325.0, 19443.0, 19778.0])
ser = pd.to_timedelta(ser, unit='D') + pd.Timestamp('1960-1-1')
I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?
Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]
Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)
I am trying to import a dataframe from a spreadsheet using pandas and then carry out numpy operations with its columns. The problem is that I obtain the error specified in the title: TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The reason for this is that my dataframe contains a column with dates, like:
ID Date
519457 25/02/2020 10:03
519462 25/02/2020 10:07
519468 25/02/2020 10:12
... ...
And Numpy requires the format to be floating point numbers, as so:
ID Date
519457 43886.41875
519462 43886.42153
519468 43886.425
... ...
How can I make this change without having to modify the spreadsheet itself?
I have seen a lot of posts on the forum asking the opposite, and asking about the error, and read the docs on xlrd.xldate, but have not managed to do this, which seems very simple.
I am sure this kind of problem has been dealt with before, but have not been able to find a similar post.
The code I am using is the following
xls=pd.ExcelFile(r'/home/.../TwoData.xlsx')
xls.sheet_names
df=pd.read_excel(xls,"Hoja 1")
df["E_t"]=df["Date"].diff()
Any help or pointers would be really appreciated!
PS. I have seen solutions that require computing the exact number that wants to be obtained, but this is not possible in this case due to the size of the dataframes.
You can convert the date into the Unix timestamp. In python, if you have a datetime object in UTC, you can the timestamp() to get a UTC timestamp. This function returns the time since epoch for that datetime object.
Please see an example below-
from datetime import timezone
dt = datetime(2015, 10, 19)
timestamp = dt.replace(tzinfo=timezone.utc).timestamp()
print(timestamp)
1445212800.0
Please check the datetime module for more info.
I think you need:
#https://stackoverflow.com/a/9574948/2901002
#rewritten to vectorized solution
def excel_date(date1):
temp = pd.Timestamp(1899, 12, 30) # Note, not 31st Dec but 30th!
delta = date1 - temp
return (delta.dt.days) + (delta.dt.seconds) / 86400
df["Date"] = pd.to_datetime(df["Date"]).pipe(excel_date)
print (df)
ID Date
0 519457 43886.418750
1 519462 43886.421528
2 519468 43886.425000
Python 3.6.0
I am importing a file with Unix timestamps.
I’m converting them to Pandas datetime and rounding to 10 minutes (12:00, 12:10, 12:20,…)
The data is collected from within a specified time period, but from different dates.
For our analysis, we want to change all dates to the same dates before doing a resampling.
At present we have a reduce_to_date that is the target for all dates.
current_date = pd.to_datetime('2017-04-05') #This will later be dynamic
reduce_to_date = current_date - pd.DateOffset(days=7)
I’ve tried to find an easy way to change the date in a series without changing the time.
I was trying to avoid lengthy conversions with .strftime().
One method that I've almost settled is to add the reduce_to_date and df['Timestamp'] difference to df['Timestamp']. However, I was trying to use the .date() function and that only works on a single element, not on the series.
GOOD!
passed_df['Timestamp'][0] = passed_df['Timestamp'][0] + (reduce_to_date.date() - passed_df['Timestamp'][0].date())
NOT GOOD
passed_df['Timestamp'][:] = passed_df['Timestamp'][:] + (reduce_to_date.date() - passed_df['Timestamp'][:].date())
AttributeError: 'Series' object has no attribute 'date'
I can use a loop:
x=1
for line in passed_df['Timestamp']:
passed_df['Timestamp'][x] = line + (reduce_to_date.date() - line.date())
x+=1
But this throws a warning:
C:\Users\elx65i5\Documents\Lightweight Logging\newmain.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The goal is to have all dates the same, but leave the original time.
If we can simply specify the replacement date, that’s great.
If we can use mathematics and change each date according to a time delta, equally as great.
Can we accomplish this in a vectorized fashion without using .strftime() or a lengthy procedure?
If I understand correctly, you can simply subtract an offset
passed_df['Timestamp'] -= pd.offsets.Day(7)
demo
passed_df=pd.DataFrame(dict(
Timestamp=pd.to_datetime(['2017-04-05 15:21:03', '2017-04-05 19:10:52'])
))
# Make sure your `Timestamp` column is datetime.
# Mine is because I constructed it that way.
# Use
# passed_df['Timestamp'] = pd.to_datetime(passed_df['Timestamp'])
passed_df['Timestamp'] -= pd.offsets.Day(7)
print(passed_df)
Timestamp
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
using strftime
Though this is not ideal, I wanted to make a point that you absolutely can use strftime. When your column is datetime, you can use strftime via the dt date accessor with dt.strftime. You can create a dynamic column where you specify the target date like this:
pd.to_datetime(passed_df.Timestamp.dt.strftime('{} %H:%M:%S'.format('2017-03-29')))
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
Name: Timestamp, dtype: datetime64[ns]
I think you need convert df['Timestamp'].dt.date to_datetime, because output of date is python date object, not pandas datetime object:
df=pd.DataFrame({'Timestamp':pd.to_datetime(['2017-04-05 15:21:03','2017-04-05 19:10:52'])})
print (df)
Timestamp
0 2017-04-05 15:21:03
1 2017-04-05 19:10:52
current_date = pd.to_datetime('2017-04-05')
reduce_to_date = current_date - pd.DateOffset(days=7)
df['Timestamp'] = df['Timestamp'] - reduce_to_date + pd.to_datetime(df['Timestamp'].dt.date)
print (df)
Timestamp
0 2017-04-12 15:21:03
1 2017-04-12 19:10:52
I have some measurements that happened on specific days in a dictionary. It looks like
date_dictionary['YYYY-MM-DD'] = measurement.
I want to calculate the variance between the measurements within 7 days from a given date. When I convert the date strings to a datetime.datetime, the result looks like a tuple or an array, but doesn't behave like one.
Is there an easy way to generate all the dates one week from a given date? If so, how can I do that efficiently?
You can do this using - timedelta . Example -
>>> from datetime import datetime,timedelta
>>> d = datetime.strptime('2015-07-22','%Y-%m-%d')
>>> for i in range(1,8):
... print(d + timedelta(days=i))
...
2015-07-23 00:00:00
2015-07-24 00:00:00
2015-07-25 00:00:00
2015-07-26 00:00:00
2015-07-27 00:00:00
2015-07-28 00:00:00
2015-07-29 00:00:00
You do not actually need to print it, datetime object + timedelta object returns a datetime object. You can use that returned datetime object directly in your calculation.
Using datetime, to generate all 7 dates following a given date, including the the given date, you can do:
import datetime
dt = datetime.datetime(...)
week_dates = [ dt + datetime.timedelta(days=i) for i in range(7) ]
There are libraries providing nicer APIs for performing datetime/date operations, most notably pandas (though it includes much much more). See pandas.date_range.