Convert timestamp to datetime for a Vaex dataframe - python

I have a parquet file that I have loaded as a Vaex dataframe. The parquette file has a column for a timestamp in the format 2022-10-12 17:10:00+00:00.
When I try to do any kind of analysis with my dataframe I get the following error.
KeyError: "Unknown variables or column: 'isna(timestamp)'"
When I remove that column everything works. I assume that the time column is not in the correct format. But I have been having trouble converting it.
I tryed
df['timestamp']= pd.to_datetime(df['timestamp'].astype(str)) but I get the error <class 'vaex.expression.Expression'> is not convertible to datetime I assume I can't mix pandas and vaex.
I am also having trouble producing a minimal reproducible example since when I create a dataframe, the datatime column would be a string and everthing works fine.

Related

Pandas DataFrame to BigQuery date format Issue

I am trying to write a Pandas DataFrame to BigQuery on Cloud Functions using bq.load_table_from_dataframe. One of the columns, df['date'], is formatted as datetime64[ns] . The table schema for date is DATETIME. When I tried to run the script it gives an error
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to timestamp using function cast_timestamp
I tried changing the BigQuery schema to DATE instead but it raises this error
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to date32 using function cast_date32
Is there a way to still write the DataFrame to BigQuery while still maintaining the DATE or DATETIME schema?

How to treat date as plain text with pandas?

I use pandas to read a .csv file, then save it as .xls file. Code as following:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='GB18030')
print(df)
df.to_excel('filename.xls')
There's a column contains date like '2020/7/12', it's looks like pandas recognized it as date and output it to '2020-07-12' automatically. I don't want to format this column, or any other columns like this, I'd like to keep all data remain the same as plain text.
This convertion happens at read_csv(), because print(df) already outputs YYYY-MM-DD, before to_excel().
I tried use df.info() to check the data type of that column, the data type is object. Then I added argument dtype=pd.StringDtype() to read_csv() and it doesn't help.
The file contains Chinese characters so I set encoding to GB18030, don't know if this matters.
My experience concerning pd.read_csv indicates that:
Only columns convertible to int or float are by default
converted to respective types.
"Date-like" strings are still read as strings (the column type in
the resulting DataFrame is actually object).
If you want read_csv to convert such column to datetime type, you
should pass parse_dates parameter, specifying a list of columns to be
parsed as dates. Since you didn't do it, no source column should be
converted to datetime type.
To check this detail, after you read file, run file.info() and check
the type of the column in question.
So if respective Excel file column is of Date type, then probably
this conversion is caused by to_excel.
And one more remark concerning variable names:
What you have read using read_csv is a DataFrame, not a file.
Actual file is the source object, from which you read the content,
but here you passed only file name.
So don't use names like file to name the resulting DataFrame, as this
is misleading. It is much better to use e.g. df.
Edit following a comment as of 05:58Z
To check in full extent what you wrote in your comment, I created
the following CSV file:
DateBougth,Id,Value
2020/7/12,1031,500.15
2020/8/18,1032,700.40
2020/10/16,1033,452.17
I ran: df = pd.read_csv('Input.csv') and then print(df), getting:
DateBougth Id Value
0 2020/7/12 1031 500.15
1 2020/8/18 1032 700.40
2 2020/10/16 1033 452.17
So, at the Pandas level, no format conversion occurred in DateBougth
column. Both remaining columns, contain numeric content, so they were
silently converted to int64 and float64, but DateBought remained as object.
Then I saved this df to an Excel file, running: df.to_excel('Output.xls')
and opened it with Excel. The content is:
So neither at the Excel level any data type conversion took place.
To see the actual data type of B2 cell (the first DateBougth),
I clicked on this cell and pressed Ctrl-1, to display cell formatting.
The format is General (not Date), just as I expected.
Maybe you have some outdated version of software?
I use Python v. 3.8.2 and Pandas v. 1.0.3.
Another detail to check: Look at your code after pd.read_csv.
Maybe somewhere you put instruction like df.DateBought = pd.to_datetime(df.DateBought) (explicit type conversion)?
Or at least format conversion. Note that in my environment
there was absolutely no change in the format of DateBought column.
Problem solved. I double checked my .csv file, opened it with notepad, the data is 2020-07-12, which displays as 2020/7/12 on Office. Turns out that Office reformatted date to yyyy/m/d (based on your region). I'm developing a tool to process and import data to DB for my company, we did these work manually by copy and paste so no one noticed this issue. Thanks to #Valdi_Bo for his investigate and patience.

Pandas timestamps in ISO format cause Exasol error when importing

When using pyexasol's import_from_pandas(df) for a DataFrame, df, which has a datetime column, Exasol (6.2) throws an error because it can't parse the ISO-formatted string representation of the dataframe column. Specifically, the "+00:00" final characters are unparsable by Exasol. My current workaround is to turn all pandas datetime columns into string columns, but that can cost a lot of time.
What's the right way to import datetime columns from Pandas dataframes into an existing Exasol table with a TIMESTAMP column type?
PyEXASOL creator is here.
You may use import_params dictionary argument to pass additional parameters to pandas.to_csv() method which is used internally. One of such parameters is date_format. Just pass the right format compatible with Exasol.
I'll consider adding this parameter by default.
Hope it helps!

Passing PySpark pandas_udf data limit?

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

pandas to_datetime not working

I can't seem to apply to_datetime to a pandas dataframe column, although I've done it dozens of times in the past. The following code tells me that any random value in the "Date Time" column is a string, after I try to convert it to a timestamp. The 'errors=coerce' should convert any parsing errors to 'NaT', but instead I still have '2015-10-10 12:31:04' as a string.
import pandas as pd
df=pd.read_csv(...)
df["Date Time"]=pd.to_datetime(df["Date Time"],errors="coerce")
print str(type(df["Date Time"][9]))+" 1"##########
Why would pandas not raise an error, or not convert parsing errors to 'NaT'?
Here are a few rows of the csv. The real file has a million rows coming from different sources, so it is possible that date formatting is not uniform, however in that case I would expect datetime to return 'NaT' or raise an error, depending on the error argument.
Accuracy,Activity,Altitude,Bearing,Date Time,Date(GMT),Description,Distance,Latitude,Longitude,Name,Speed,_FileNames,datenum
,,null,,,,,,sj,,,,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/aacy.csv,17054710926
0.0,,0.0,0.0,,,,0.00292115,50.67713796,4.61960233,,4.5,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/aars.csv,17054710926
0.0,,0.0,0.0,2015-01-31 15:10:,,,0.00404488,39.91572515,116.43714731,,5.4,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/abch.csv,17054710926
0.0,Walk/Run,0.0,0.0,2015-01-11 10:36:22,,,0,39.94002308,116.43548671,tfdeddd,0.0,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/abbj.csv,20150111
0.0,Walk/Run,0.0,0.0,2015-01-11 10:36:24,,,0.00968132,39.93998097,116.43558673,,2.7,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/abbj.csv,20150111
0.0,Walk/Run,0.0,0.0,2015-01-11 10:36:26,,,0.00768588,39.94003147,116.43552386,,4.5,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/abbj.csv,20150111
0.0,Walk/Run,0.0,0.0,2015-01-11 10:36:28,,,0.00239565,39.94007265,116.43551403,,3.6,C:/Users/Alexis/Dropbox/Location/Path Tracking Lite/abbj.csv,20150111

Categories

Resources