How to format Twitter (and other) timestamps? - python

UPDATE: The problem was dirty data and not a data type issue. The above options SHOULD work if your data is clean. In my case, I had about 10 records where the language code had been shifted over into the timestamp field :(
ORIGINAL POST:
I am trying to work with Twitter timestamps which look like this:
df.created_at.head()
0 2015-10-23T07:57:45.000Z
1 2015-10-23T07:56:04.000Z
2 2015-10-23T07:48:26.000Z
3 2015-10-23T07:48:07.000Z
4 2015-10-23T07:44:09.000Z
Name: created_at, dtype: object
I am trying to convert 'created_at' into a datetime data type. I have tried a few ways of doing this but they all give me errors.
If I try to change the data type I get this error:
df.created_at.astype('datetime64[ns]')
ValueError: Error parsing datetime string "en" at position 0
If I use a tweaked version of #Alexander's suggestion below, I get this error:
s = pd.Series(df.created_at)
datetime_idx = pd.DatetimeIndex(pd.to_datetime(s))
ValueError: Unable to convert 0 2015-10-23T07:57:45.000Z...
This approach gives me the following error:
pd.to_datetime(df.created_at, format="%Y-%m-%dT%H:%M:%S.000Z")
ValueError: time data u'en' does not match format '%Y-%m-%dT%H:%M:%S.000Z' (match)

Is this what you're looking for? I just used DatetimeIndex on the series converted to datetime with to_datetime.
s = pd.Series(['2015-10-23T07:57:45.000Z', '2015-10-23T07:56:04.000Z', '2015-10-23T07:48:26.000Z', '2015-10-23T07:48:07.000Z', '2015-10-23T07:44:09.000Z'], name='created_at')
datetime_idx = pd.DatetimeIndex(pd.to_datetime(s))
>>> datetime_idx
DatetimeIndex(['2015-10-23 07:57:45', '2015-10-23 07:56:04', '2015-10-23 07:48:26', '2015-10-23 07:48:07', '2015-10-23 07:44:09'], dtype='datetime64[ns]', freq=None, tz=None)

Related

How can I add a zero to dates in a string so all months are 2 characters? [duplicate]

Using a Python script, I need to read a CVS file where dates are formated as DD/MM/YYYY, and convert them to YYYY-MM-DD before saving this into a SQLite database.
This almost works, but fails because I don't provide time:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%Y-%m-%d")
#ValueError: time data did not match format: data=21/12/2008 fmt=%Y-%m-%d
print lastconnection
I assume there's a method in the datetime object to perform this conversion very easily, but I can't find an example of how to do it. Thank you.
Your example code is wrong. This works:
import datetime
datetime.datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
The call to strptime() parses the first argument according to the format specified in the second, so those two need to match. Then you can call strftime() to format the result into the desired final format.
you first would need to convert string into datetime tuple, and then convert that datetime tuple to string, it would go like this:
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime('%Y-%m-%d')
I am new to programming. I wanted to convert from yyyy-mm-dd to dd/mm/yyyy to print out a date in the format that people in my part of the world use and recognise.
The accepted answer above got me on the right track.
The answer I ended up with to my problem is:
import datetime
today_date = datetime.date.today()
print(today_date)
new_today_date = today_date.strftime("%d/%m/%Y")
print (new_today_date)
The first two lines after the import statement gives today's date in the USA format (2017-01-26). The last two lines convert this to the format recognised in the UK and other countries (26/01/2017).
You can shorten this code, but I left it as is because it is helpful to me as a beginner. I hope this helps other beginner programmers starting out!
Does anyone else else think it's a waste to convert these strings to date/time objects for what is, in the end, a simple text transformation? If you're certain the incoming dates will be valid, you can just use:
>>> ddmmyyyy = "21/12/2008"
>>> yyyymmdd = ddmmyyyy[6:] + "-" + ddmmyyyy[3:5] + "-" + ddmmyyyy[:2]
>>> yyyymmdd
'2008-12-21'
This will almost certainly be faster than the conversion to and from a date.
#case_date= 03/31/2020
#Above is the value stored in case_date in format(mm/dd/yyyy )
demo=case_date.split("/")
new_case_date = demo[1]+"-"+demo[0]+"-"+demo[2]
#new format of date is (dd/mm/yyyy) test by printing it
print(new_case_date)
If you need to convert an entire column (from pandas DataFrame), first convert it (pandas Series) to the datetime format using to_datetime and then use .dt.strftime:
def conv_dates_series(df, col, old_date_format, new_date_format):
df[col] = pd.to_datetime(df[col], format=old_date_format).dt.strftime(new_date_format)
return df
Sample usage:
import pandas as pd
test_df = pd.DataFrame({"Dates": ["1900-01-01", "1999-12-31"]})
old_date_format='%Y-%m-%d'
new_date_format='%d/%m/%Y'
conv_dates_series(test_df, "Dates", old_date_format, new_date_format)
Dates
0 01/01/1900
1 31/12/1999
The most simplest way
While reading the csv file, put an argument parse_dates
df = pd.read_csv("sample.csv", parse_dates=['column_name'])
This will convert the dates of mentioned column to YYYY-MM-DD format
Convert date format DD/MM/YYYY to YYYY-MM-DD according to your question, you can use this:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
print(lastconnection)
df is your data frame
Dateclm is the column that you want to change
This column should be in DateTime datatype.
df['Dateclm'] = pd.to_datetime(df['Dateclm'])
df.dtypes
#Here is the solution to change the format of the column
df["Dateclm"] = pd.to_datetime(df["Dateclm"]).dt.strftime('%Y-%m-%d')
print(df)

Converting a string to Timestamp with Pyspark

I am currently attempting to convert a column "datetime" which has values that are dates/times in string form, and I want to convert the column such that all of the strings are converted to timestamps.
The date/time strings are of the form "10/11/2015 0:41", and I'd like to convert the string to a timestamp of form YYYY-MM-DD HH:MM:SS. At first I attempted to cast the column to timestamp in the following way:
df=df.withColumn("datetime", df["datetime"].cast("timestamp"))
Though when I did so, I received null for every value, which lead me to believe that the input dates needed to be formatted somehow. I have looked into numerous other possible remedies such as to_timestamp(), though this also gives the same null results for all of the values. How can a string of this format be converted into a timestamp?
Any insights or guidance are greatly appreciated.
Try:
import datetime
def to_timestamp(date_string):
return datetime.datetime.strptime(date_string, "%m/%d/%Y %H:%M")
df = df.withColumn("datetime", to_timestamp(df.datetime))
You can use the to_timestamp function. See Datetime Patterns for valid date and time format patterns.
df = df.withColumn('datetime', F.to_timestamp('datetime', 'M/d/y H:m'))
df.show(truncate=False)
You were doing it in the right way, except you missed to add the format ofstring type which is in this case MM/dd/yyyy HH:mm. Here M is used for months and m is used to detect minutes. Having said that, see the code below for reference -
df = spark.createDataFrame([('10/11/2015 0:41',), ('10/11/2013 10:30',), ('12/01/2016 15:56',)], ("String_Timestamp", ))
from pyspark.sql.functions import *
df.withColumn("Timestamp_Format", to_timestamp(col("String_Timestamp"), "MM/dd/yyyy HH:mm")).show(truncate=False)
+----------------+-------------------+
|String_Timestamp| Timestamp_Format|
+----------------+-------------------+
| 10/11/2015 0:41|2015-10-11 00:41:00|
|10/11/2013 10:30|2013-10-11 10:30:00|
|12/01/2016 15:56|2016-12-01 15:56:00|
+----------------+-------------------+

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?
Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]
Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)

How to convert dataframe dates into floating point numbers?

I am trying to import a dataframe from a spreadsheet using pandas and then carry out numpy operations with its columns. The problem is that I obtain the error specified in the title: TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The reason for this is that my dataframe contains a column with dates, like:
ID Date
519457 25/02/2020 10:03
519462 25/02/2020 10:07
519468 25/02/2020 10:12
... ...
And Numpy requires the format to be floating point numbers, as so:
ID Date
519457 43886.41875
519462 43886.42153
519468 43886.425
... ...
How can I make this change without having to modify the spreadsheet itself?
I have seen a lot of posts on the forum asking the opposite, and asking about the error, and read the docs on xlrd.xldate, but have not managed to do this, which seems very simple.
I am sure this kind of problem has been dealt with before, but have not been able to find a similar post.
The code I am using is the following
xls=pd.ExcelFile(r'/home/.../TwoData.xlsx')
xls.sheet_names
df=pd.read_excel(xls,"Hoja 1")
df["E_t"]=df["Date"].diff()
Any help or pointers would be really appreciated!
PS. I have seen solutions that require computing the exact number that wants to be obtained, but this is not possible in this case due to the size of the dataframes.
You can convert the date into the Unix timestamp. In python, if you have a datetime object in UTC, you can the timestamp() to get a UTC timestamp. This function returns the time since epoch for that datetime object.
Please see an example below-
from datetime import timezone
dt = datetime(2015, 10, 19)
timestamp = dt.replace(tzinfo=timezone.utc).timestamp()
print(timestamp)
1445212800.0
Please check the datetime module for more info.
I think you need:
#https://stackoverflow.com/a/9574948/2901002
#rewritten to vectorized solution
def excel_date(date1):
temp = pd.Timestamp(1899, 12, 30) # Note, not 31st Dec but 30th!
delta = date1 - temp
return (delta.dt.days) + (delta.dt.seconds) / 86400
df["Date"] = pd.to_datetime(df["Date"]).pipe(excel_date)
print (df)
ID Date
0 519457 43886.418750
1 519462 43886.421528
2 519468 43886.425000

Changing simple column of strings to date using to_datetime giving strange error

Trying to change date string to date in python. Fairly simple and not sure why not working. I realize I could consolidate the code a little but can someone point out why this is not working. Thanks.
data_act1.ActionDate data_act1.Action_Date_use
20201112 11/12/2020
20200415 04/15/2020
With dataframe like above, this code should work:
data_act1['month'] = data_act1['ActionDate'].str[4:6]
data_act1['day'] = data_act1['ActionDate'].str[6:]
data_act1['year'] = data_act1['ActionDate'].str[:4]
data_act1['Action_Date_use'] = data_act1['month']+"/"+data_act1['day']+"/"+data_act1['year']
data_act1.date_final = pd.to_datetime(data_act1.Action_Date_use, format = '%m/%d/%Y')
Running this yields this error:
ValueError: time data '//' does not match format '%m/%d/%Y' (match)
Update: Had instances of ActionDate being empty so that caused the error above. After removing those, I'm getting this error:
TypeError: cannot convert the series to <class 'int'>
Tried
data_act1['date_final'] = pd.datetime(data_act1['Action_Date_use'].astype(str), format = '%m/%d/%Y')
and still getting same error.
Check if you dont have any cells with '//' :
print(data_act1[data_act1['Action_Date_use']=='//'])
If yes maybe do a replace:
data_act1['Action_Date_use'] =data_act1['Action_Date_use'].str.replace('//','01/01/1990')

Categories

Resources