How do I turn string into datetime values and merge - python

I have a string of yearly data month 1-12, trying to convert it to datetime.month values and then converge it on the main df that already has dt.month values according to some date
usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"])
usage_12month['MONTH'] = usage_12month['MONTH'].dt.month
display(usage_12month)
merge = pd.merge(df,usage_12month, how='left',on='MONTH')
ValueError: Given date string not likely a datetime.
​get the error on the 1st line

.dt.month on a datetime returns an int. So I'm assuming you want to convert usage_12month["MONTH"] from a string to an int to be able to merge it with the other df.
There is a simplier way than converting it to a datetime. You could replace the first two lines by usage_12month["MONTH"]= pd.to_numeric(usage_12month["MONTH"]) and it should work.
--
The error you get on the first line is because you don't specify to the to_datetime function how to interpet the string as a datetime (the number in the string could represent a day, an hour...).
To make your way work you have to give a 'format' parameter to the to_datetime function. In your case, your string contains only the month number, so the format string would be '%m' (see https://strftime.org/) : usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"], format = '%m')
When you're supplying the function with a "usual" date fromat like 'yyyy/mm/dd' it guesses how to interpret it, but it is alway better to provide a format to the function.

Related

how to convert "2021-09-01T14:37:40.537Z" into "2021-09-01T14:37:40.5370000+00:00" in python?

How to convert the string "2021-09-01T14:37:40.537Z" into "2021-09-01T14:37:40.5370000+00:00" in python.we have a hive table which is have the datetime in the format of "2021-09-01T14:37:40.537Z" but we want to convert that into "2021-09-01T14:37:40.5370000+00:00" format in python.
General Solution
The datetime module has methods for converting datetimes to and from string representations with arbitrary formats. Specifically, datetime.strptime(date_string, format) converts from string to datetime object, and datetime.strftime(format) converts from datetime to string. By providing different formats to each method, we can convert between them.
from datetime import datetime
inp = "2021-09-01T14:37:40.537Z"
date = datetime.strptime(inp, "%Y-%m-%dT%H:%M:%S.%fZ")
output = date.strftime("%Y-%m-%dT%H:%M:%S.%f0+00:00")
Note that this doesn't take into account timezones. If some entries in your table use time zones other than GMT+00:00, a differnt solution is required.
Alternative solution
In this case, since the two specified formats are so similar, a simpler solution would suffice, although it wouldn't work for other cases. Simply trim the final Z from the input string and append 0000+00:00 as follows:
inp = "2021-09-01T14:37:40.537Z"
output = inp[:-1] + "0000+00:00"

Howto convert a string to a date format

Hi guys would appreciate some help. I'm analyzng a series (a set of columns) that has a date format like this:
'1060208'
The first three digits represent the year where the first digit, '1' exists for comparison purposes. in the case above, the year is 2006. the 4th and 5th digit represent the month and the rest represents the day. I want to convert these dates to something like this
106-02-08
So that i can use .groupby to sort per month or year. Here is my code so far
class Data:
def convertdate(self):
self.dates.apply(lambda x:x[0:3] + '-' + x[3:5] + '-' + x [5:7])
return self.dates
when I run this, I get the error:
TypeError: 'int' object is not subscriptable
Can you please tell me what went wrong? Or can you suggest some alternative way to do this? Thank you so much.
Assumings that dates is a list of int, you can do:
input_dates = [1060208, 1060209]
input_dates_to_str = map(lambda x: str(x), input_dates)
output = list(map(lambda x: '-'.join([x[0:3], x[3:5], x[5:]]), input_dates_to_str))
Anyway, when working with dates I suggest you using datetime package.
Quick answer to your question: 1060208 is an integer, integers are not subscriptable, so you need to change it to a string.
Some other thoughts:
Where is your data? Is this all in a pandas dataframe? If so why are you writing classes to convert your data? There are better/faster ways of doing it. Like convert your intgeger date to a string, get rid of the first digit, and convert it to datetime.
What does "where 1 is put for comparison purposes" mean? It could have been recorded that way but obviously a date and a flag (I assume it is some kind of flag) should not be represented in the same field. So why don't you put that 1 in a field of its own?

Plotly axis shows Datetime as numbers instead of dates

I am plotting my Dataframe using Plotly but for some reason, my Datetime values gets converted numbers instead of getting displayed as letters
fig.add_trace(go.Scatter(x=df2plt["PyDate"].values,
y=df2plt["Data"].values))
If df2plt["PyDate"] is already in datetime format:
fig.add_trace(go.Scatter(x=df2plt["PyDate"],
y=df2plt["Data"].values))
Else:
fig.add_trace(go.Scatter(x=pd.to_datetime(df2plt["PyDate"]) ,
y=df2plt["Data"].values))
You can change the display with the variable format:
*format : string, default None
strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds. See strftime documentation for more information on choices: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
*

How I can write the format of my date when I use apply and datetime functions?

I want to put days first on my datatime format. On other application I've used the following:
df2.Timestamp = pd.to_datetime(df2.Timestamp,dayfirst=True) #the format is d/m/yyyy
Now I want to use apply function, because I have more than one column and instead of doing it in 4 rows I wanted to do it in one row using apply.
df2[["Detection_time", "Device_ack", "Reset/Run", "Duration"]] = df2[["Detection_time", "Device_ack", "Reset/Run", "Duration"]].apply(pd.to_datetime)
But I don't know how to configure "dayfirst" argument.
You can use:
(df2[["Detection_time", "Device_ack", "Reset/Run", "Duration"]]
.apply(pd.to_datetime,dayfirst=True))

Pyspark Dataframe: Check if values in date columns are valid

I have a spark DataFrame imported from a CSV file.
After applying some manipulations (mainly deleting columns/rows), I try to save the new DataFrame to Hadoop which brings up an error message:
ValueError: year out of range
I suspect that some columns of type DateType or TimestampType are corrupted. At least in one column I found an entry with a year '207' - this seems to create issues.
**How can I check if the DataFrame adheres to the required time ranges?
I thought about writing a function that takes the DataFrame and gets for each DateType / TimestampType-Column the minimum and the maximum of values, but I cannot get this to work.**
Any ideas?
PS: In my understanding, spark would always check and enforce the schema. Would this not include a check for minimum/maximum values?
For validating the date, regular Expressions can help.
for example: to validate a date with date format MM-dd-yyyy
step1: make a regular expression for your date format. for MM-dd-yyyy it will be ^(0[1-9]|[12][0-9]|3[01])[- \/.](0[1-9]|1[012])[- \/.](19|20)\d\d$
You can use this code for implementation.
This step will help finding invalid dates which wont parse and cause error.
step2: convert the string to date.
the following code can help
import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf
object FormatChecker extends java.io.Serializable {
val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
case Failure(_) => true
case _ => false
}
}
val df = sc.parallelize(Seq(
"01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")
invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

Categories

Resources