Pandas DataFrame to BigQuery date format Issue - python

I am trying to write a Pandas DataFrame to BigQuery on Cloud Functions using bq.load_table_from_dataframe. One of the columns, df['date'], is formatted as datetime64[ns] . The table schema for date is DATETIME. When I tried to run the script it gives an error
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to timestamp using function cast_timestamp
I tried changing the BigQuery schema to DATE instead but it raises this error
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to date32 using function cast_date32
Is there a way to still write the DataFrame to BigQuery while still maintaining the DATE or DATETIME schema?

Related

Convert timestamp to datetime for a Vaex dataframe

I have a parquet file that I have loaded as a Vaex dataframe. The parquette file has a column for a timestamp in the format 2022-10-12 17:10:00+00:00.
When I try to do any kind of analysis with my dataframe I get the following error.
KeyError: "Unknown variables or column: 'isna(timestamp)'"
When I remove that column everything works. I assume that the time column is not in the correct format. But I have been having trouble converting it.
I tryed
df['timestamp']= pd.to_datetime(df['timestamp'].astype(str)) but I get the error <class 'vaex.expression.Expression'> is not convertible to datetime I assume I can't mix pandas and vaex.
I am also having trouble producing a minimal reproducible example since when I create a dataframe, the datatime column would be a string and everthing works fine.

BigQuery load job from pandas dataframe timestamp column reading as unix nanoseconds, not microseconds

I have had a script running for a few months but ran into an issue today in a load job from a pandas df, with a timestamp column
df.published_at[0]
gives
Timestamp('2022-04-28 20:59:51-0700', tz='pytz.FixedOffset(-420)')
When I try to load to BigQuery through a load job, I get the following error:
[{'reason': 'invalidQuery', 'location': 'query', 'message': 'Cannot return an invalid timestamp value of 1651204791000000000 microseconds relative to the Unix epoch. The range of valid timestamp values is [0001-01-01 00:00:00, 9999-12-31 23:59:59.999999]; error in writing field published_at'}]
It seems that BigQuery is somehow reading that timestamp as Unix nanoseconds (1651204791000000000), not microseconds (which would be 1651204791000000) which is putting it out of the range of acceptable values. Why is it doing that?
I used a workaround to just use a string for that column before the load job, and the BQ schema accepts it as a timestamp. I'm just curious why this issue might have come up now and not previously?
I come here 5 months later (29 September 2022) because I have the exact same problem.
I'm trying to load data to BigQuery, from python, with client.load_table_from_json. One of my columns is a "processed_at" column which stores datetime objects (dtype: datetime64[ns, UTC]). I specify the right type in my table_schema :
table_schema = [
bigquery.SchemaField("processed_at", "TIMESTAMP", mode="NULLABLE")
]
I get this error :
BadRequest: 400 Cannot return an invalid timestamp value of 1664454374000000000 microseconds relative to the Unix epoch.
The range of valid timestamp values is [0001-01-01 00:00:00, 9999-12-31 23:59:59.999999]; error in writing field processed_at
BigQuery really seems to think in microsecondes, instead of nanoseconds, and thus all my datetimes fall out of range.
I will try to cast them as strings, thanks for the workaround.

Pandas to_sql using wrong data type, can it be changed?

I've loaded some SQL Server data into a Pandas Dataframe. Here some transformations take place. Once completed I am trying to dump the Dataframe back into SQL using the SQLAlchemy to_sql function. The destination table is automatically created by SQLAlchemy.
However, I am getting this error message...
The conversion of a datetime2 data type to a datetime data type resulted in an out-of-range value.
This is because the Dataframe contains a number of date fields. Some of the data in these date fields are set to '0001-01-01' which obviously won't fit into a DATETIME data type, but would fit into a DATETIME2 data type.
Is there a way to force DATETIME2 to be used instead of DATETIME?
You can specify DATETIME2 as a dtype, as shown below.
df.to_sql(
'Temp',
target_engine,
schema = 'dbo',
if_exists = 'replace',
chunksize = 250000,
index = False,
dtype={"CreatedDate": DATETIME2}
)

Converting Datetimeoffet datatype of sql server to use in python data frame

I have a column in tale having data type as datetime offset. While querying data and storing in dataframe of pandas in python I am getting error
Arguments: (ProgrammingError('ODBC SQL type -155 is not yet supported. column-index=1 type=-155', 'HY106'),)
How do I convert that to a value to be used in a dataframe. The value must be accurate conversion.
And I am exporting data frame to excel so date properties must also hold in excel like filtering and sorting.

upload pandas dataframe to BigQuery using to_gbq rewrites integer numbers

I need to uploaded ~1000 dataframes to BigQuery, I'm using pandas.io.gbq.to_gbq
I have the code like this:
to_gbq(df, tableid, projectid, chunksize=10000, if_exists='append')
I am also writing all these dataframes to csv and the data all looks good, however, when uploading dfs to bigquery, for some of the dfs pandas just decides one of my integer column is float type, so I have this line of code to force pandas to read it as integer
df = df.astype({"ISBN": int})
Then I looked at the data pushed into BigQuery, the schema mismatch error is gone, but the numbers are all different to what they are in the CSV export (which I suppose is the same as in the original dataframe)...
For example, ISBN 9781607747307 in the CSV is now shown as 1967214315 in the BigQuery table
To troubleshoot, I printed out the dtype of dataframe, and noticed the above row is forcing the column to be INT64 dtype, whereas before the column that didn't go through the astype conversion has INT32 dtype.
Can I get pandas to see ISBN column as integer, but not change the numbers when upload to bigquery?
Thank you in advance!

Categories

Resources