Python - Convert epoch in object format to datetime - python

I'm pulling data from API that gives me date values in epoch in milliseconds, but the values are in this format:
I am trying to convert these to datetime with pd.to_datetime(df1.loc[:, 'Curve Value'], unit='ms'), but am getting the error
ValueError: non convertible value 1,609,459,200,000.0000000000 with
the unit 'ms'
I then tried to format the column into float with df1["Curve Value"] = df["Curve Value"].astype(float), but get then get this error
ValueError: could not convert string to float:
'1,609,459,200,000.0000000000'
I've also tried various ways to remove the commas and convert to float, but get errors trying that, as well.

A bit unwieldy and requires importing datetime from datetime but it works as far as I can see.
df['Real Time'] = df['Curve Value'].apply(lambda t: datetime.fromtimestamp(int(''.join(t.split(',')[:4]))))

Related

Pandas converting string to numeric - getting invalid literal for int() with base 10 error

I am trying to convert data from a csv file to a numeric type so that I can find the greatest and least value in each category. This is a short view of the data I am referencing:
Course
Grades_Recieved
098321
A,B,D
324323
C,B,D,F
213323
A,B,D,F
I am trying to convert the grades_received to numeric types so that I can create new categories that list the highest grade received and the lowest grade received in each course.
This is my code so far:
import pandas as pd
df = pd.read_csv('grades.csv')
df.astype({Grades_Recieved':'int64'}).dtypes`
I have tried the code above, I have tried using to_numeric, but I keep getting an error: invalid literal for int() with base 10: 'A,B,D' and I am not sure how to fix this. I have also tried getting rid of the ',' but the error remains the same.
You can't convert a list of non-numeric strings into int/float, but you can get the desired result doing something like this:
df['Highest_Grade'] = df['Grades_Recieved'].str.split(',').apply(lambda x: min(x))
df['Lowest_Grade'] = df['Grades_Recieved'].str.split(',').apply(lambda x: max(x))

How do I turn string into datetime values and merge

I have a string of yearly data month 1-12, trying to convert it to datetime.month values and then converge it on the main df that already has dt.month values according to some date
usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"])
usage_12month['MONTH'] = usage_12month['MONTH'].dt.month
display(usage_12month)
merge = pd.merge(df,usage_12month, how='left',on='MONTH')
ValueError: Given date string not likely a datetime.
​get the error on the 1st line
.dt.month on a datetime returns an int. So I'm assuming you want to convert usage_12month["MONTH"] from a string to an int to be able to merge it with the other df.
There is a simplier way than converting it to a datetime. You could replace the first two lines by usage_12month["MONTH"]= pd.to_numeric(usage_12month["MONTH"]) and it should work.
--
The error you get on the first line is because you don't specify to the to_datetime function how to interpet the string as a datetime (the number in the string could represent a day, an hour...).
To make your way work you have to give a 'format' parameter to the to_datetime function. In your case, your string contains only the month number, so the format string would be '%m' (see https://strftime.org/) : usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"], format = '%m')
When you're supplying the function with a "usual" date fromat like 'yyyy/mm/dd' it guesses how to interpret it, but it is alway better to provide a format to the function.

Pandas Sales Analysis Help - ValueError: could not convert string to float: ''

I'm currently running a sales analysis on an excel file with roughly 500 transactions. I have a category called "Sale Price" which should be read in as a float. Pandas read in the dtype as an object, and when trying to change the dtype to a float using:
df['Sale Price'].fillna(0).astype(float)
I get the following error:
ValueError: could not convert string to float: ''
I've tried mixing in various command combinations such as:
df.loc[pd.to_numeric(df['Sale Price'], errors='coerce').isnull()]
and:
pd.to_numeric(df['Sale Price']).astype(int)
in order to convert the column to a float, but now I'm thinking the issue is in how the data is being read in. I used the basic:
df = pd.read_excel('...')
Hopefully someone can help clarify where the issue is coming from as I've been stuck for awhile. Thank you!
You could replace your empty strings with 0 before changing it to float:
df["Sale Price"] = df["Sale Price"].astype(str).str.strip().replace("",0).astype(float)
You have a empty string somewhere in your sale price. As indicated by the error
ValueError: could not convert string to float: ''
To fix this first run:
df['Sale Price'] = df['Sale Price'].where(df['Sale Price'] != '', 0)
This will replace any empty strings with zero

Plotly axis shows Datetime as numbers instead of dates

I am plotting my Dataframe using Plotly but for some reason, my Datetime values gets converted numbers instead of getting displayed as letters
fig.add_trace(go.Scatter(x=df2plt["PyDate"].values,
y=df2plt["Data"].values))
If df2plt["PyDate"] is already in datetime format:
fig.add_trace(go.Scatter(x=df2plt["PyDate"],
y=df2plt["Data"].values))
Else:
fig.add_trace(go.Scatter(x=pd.to_datetime(df2plt["PyDate"]) ,
y=df2plt["Data"].values))
You can change the display with the variable format:
*format : string, default None
strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds. See strftime documentation for more information on choices: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
*

Pyspark Dataframe: Check if values in date columns are valid

I have a spark DataFrame imported from a CSV file.
After applying some manipulations (mainly deleting columns/rows), I try to save the new DataFrame to Hadoop which brings up an error message:
ValueError: year out of range
I suspect that some columns of type DateType or TimestampType are corrupted. At least in one column I found an entry with a year '207' - this seems to create issues.
**How can I check if the DataFrame adheres to the required time ranges?
I thought about writing a function that takes the DataFrame and gets for each DateType / TimestampType-Column the minimum and the maximum of values, but I cannot get this to work.**
Any ideas?
PS: In my understanding, spark would always check and enforce the schema. Would this not include a check for minimum/maximum values?
For validating the date, regular Expressions can help.
for example: to validate a date with date format MM-dd-yyyy
step1: make a regular expression for your date format. for MM-dd-yyyy it will be ^(0[1-9]|[12][0-9]|3[01])[- \/.](0[1-9]|1[012])[- \/.](19|20)\d\d$
You can use this code for implementation.
This step will help finding invalid dates which wont parse and cause error.
step2: convert the string to date.
the following code can help
import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf
object FormatChecker extends java.io.Serializable {
val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
case Failure(_) => true
case _ => false
}
}
val df = sc.parallelize(Seq(
"01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")
invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

Categories

Resources