upload pandas dataframe to BigQuery using to_gbq rewrites integer numbers - python

I need to uploaded ~1000 dataframes to BigQuery, I'm using pandas.io.gbq.to_gbq
I have the code like this:
to_gbq(df, tableid, projectid, chunksize=10000, if_exists='append')
I am also writing all these dataframes to csv and the data all looks good, however, when uploading dfs to bigquery, for some of the dfs pandas just decides one of my integer column is float type, so I have this line of code to force pandas to read it as integer
df = df.astype({"ISBN": int})
Then I looked at the data pushed into BigQuery, the schema mismatch error is gone, but the numbers are all different to what they are in the CSV export (which I suppose is the same as in the original dataframe)...
For example, ISBN 9781607747307 in the CSV is now shown as 1967214315 in the BigQuery table
To troubleshoot, I printed out the dtype of dataframe, and noticed the above row is forcing the column to be INT64 dtype, whereas before the column that didn't go through the astype conversion has INT32 dtype.
Can I get pandas to see ISBN column as integer, but not change the numbers when upload to bigquery?
Thank you in advance!

Related

How to treat date as plain text with pandas?

I use pandas to read a .csv file, then save it as .xls file. Code as following:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='GB18030')
print(df)
df.to_excel('filename.xls')
There's a column contains date like '2020/7/12', it's looks like pandas recognized it as date and output it to '2020-07-12' automatically. I don't want to format this column, or any other columns like this, I'd like to keep all data remain the same as plain text.
This convertion happens at read_csv(), because print(df) already outputs YYYY-MM-DD, before to_excel().
I tried use df.info() to check the data type of that column, the data type is object. Then I added argument dtype=pd.StringDtype() to read_csv() and it doesn't help.
The file contains Chinese characters so I set encoding to GB18030, don't know if this matters.
My experience concerning pd.read_csv indicates that:
Only columns convertible to int or float are by default
converted to respective types.
"Date-like" strings are still read as strings (the column type in
the resulting DataFrame is actually object).
If you want read_csv to convert such column to datetime type, you
should pass parse_dates parameter, specifying a list of columns to be
parsed as dates. Since you didn't do it, no source column should be
converted to datetime type.
To check this detail, after you read file, run file.info() and check
the type of the column in question.
So if respective Excel file column is of Date type, then probably
this conversion is caused by to_excel.
And one more remark concerning variable names:
What you have read using read_csv is a DataFrame, not a file.
Actual file is the source object, from which you read the content,
but here you passed only file name.
So don't use names like file to name the resulting DataFrame, as this
is misleading. It is much better to use e.g. df.
Edit following a comment as of 05:58Z
To check in full extent what you wrote in your comment, I created
the following CSV file:
DateBougth,Id,Value
2020/7/12,1031,500.15
2020/8/18,1032,700.40
2020/10/16,1033,452.17
I ran: df = pd.read_csv('Input.csv') and then print(df), getting:
DateBougth Id Value
0 2020/7/12 1031 500.15
1 2020/8/18 1032 700.40
2 2020/10/16 1033 452.17
So, at the Pandas level, no format conversion occurred in DateBougth
column. Both remaining columns, contain numeric content, so they were
silently converted to int64 and float64, but DateBought remained as object.
Then I saved this df to an Excel file, running: df.to_excel('Output.xls')
and opened it with Excel. The content is:
So neither at the Excel level any data type conversion took place.
To see the actual data type of B2 cell (the first DateBougth),
I clicked on this cell and pressed Ctrl-1, to display cell formatting.
The format is General (not Date), just as I expected.
Maybe you have some outdated version of software?
I use Python v. 3.8.2 and Pandas v. 1.0.3.
Another detail to check: Look at your code after pd.read_csv.
Maybe somewhere you put instruction like df.DateBought = pd.to_datetime(df.DateBought) (explicit type conversion)?
Or at least format conversion. Note that in my environment
there was absolutely no change in the format of DateBought column.
Problem solved. I double checked my .csv file, opened it with notepad, the data is 2020-07-12, which displays as 2020/7/12 on Office. Turns out that Office reformatted date to yyyy/m/d (based on your region). I'm developing a tool to process and import data to DB for my company, we did these work manually by copy and paste so no one noticed this issue. Thanks to #Valdi_Bo for his investigate and patience.

Converting Datetimeoffet datatype of sql server to use in python data frame

I have a column in tale having data type as datetime offset. While querying data and storing in dataframe of pandas in python I am getting error
Arguments: (ProgrammingError('ODBC SQL type -155 is not yet supported. column-index=1 type=-155', 'HY106'),)
How do I convert that to a value to be used in a dataframe. The value must be accurate conversion.
And I am exporting data frame to excel so date properties must also hold in excel like filtering and sorting.

Passing PySpark pandas_udf data limit?

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

export table to csv keeping format python

I have a dataframe grouped by 3 variables. It looks like:
https://i.stack.imgur.com/q8W0y.png
When I export the table to csv, the format changes. I want to keep the original
Any ideas?
Thanks!
Pandas to_csv (and csv in general) does not support the MultiIndex used in your data. As such, it just stores the indices "long" (so each level of the MultiIndex would be a column, and each row would have its index value.) I suspect that's what you are calling "format changes".
The upshot is that if you expect to save a pandas dataframe to csv and then reestablish the dataframe from the csv, then you need to re-index the dataframe to the MultiIndex yourself, after importing it.

Convert numbers to strings when reading an excel spreadsheet into a pandas DataFrame

I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:
my_frame.my_col = my_frame.my_col.astype('str')
This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I'd like to work out how to tell pandas to read columns as strings, or do the conversion later so that I get values in their original (non-scientific) format.
pandas.read_csv() has a dtype argument:
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
I solve it with round, if you do round(number,5) in most case you will not lose data and you will get zero in the case of 8.027770e+14

Categories

Resources