Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError - python

I am trying to use the BigQuery package to interact with Pandas DataFrames. In my scenario, I query a base table in BigQuery, use .to_dataframe(), then pass that to load_table_from_dataframe() to load it into a new table in BigQuery.
My original problem was that str(uuid.uuid4()) (for random ID's) was automatically being converted to bytes instead of string, so I am forcing a schema instead of allowing it to auto-detect what to make.
Now, though, I passed a job_config with a job_config dict that contained the schema, and now I get this error:
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 903, in load_table_from_dataframe
job_config.source_format = job.SourceFormat.PARQUET
AttributeError: 'dict' object has no attribute 'source_format'
I already had PyArrow installed, and tried also installing FastParquet, but it didnt help, and this didn't happen before I tried to force a schema.
Any ideas?
https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#using-bigquery-with-pandas
https://google-cloud-python.readthedocs.io/en/latest/_modules/google/cloud/bigquery/client.html#Client.load_table_from_dataframe
Looking in to the actual package it seems that it forces Parquet format, but like I said, I had no issue before, just now that I'm trying to give a table schema.
EDIT: This only happens when I try to write to BigQuery.

Figured it out. After weeding through Google's documentation I forgot to put:
load_config = bigquery.LoadJobConfig()
load_config.schema = SCHEMA
Oops. Never loaded the config dict from the BigQuery package.

Related

Same code structure - different results when calling .to_sql()

Background
I have 3 separate python scripts that share the same structure and effectively do the same thing - call a request to an API, retrieve XML data, convert XML to ElementTree object then to pandas DataFrame object then use .to_sql() to import that dataframe into an oracle database. This was successful for two out of three of the scripts I have written but the third is not writing to the DB, there are no errors returned, the table is created empty, the script hangs
Code from successful files:
oracle_db = sa.create_engine('oracle://sName:sName#123.456.78/testDB')
connection = oracle_db.connect()
df.to_sql('TABLE_NAME', connection, if_exists='append',index = False)
I would post the code for the unsuccessful file but it is quite literally the same besides the table and variable name.
What I have Tried
I have tried to use cx_oracle's engine to drive the connection to the
DB with no success:
conn = cx_Oracle.connect("sName", "sName","123.456.789.1/Test", encoding = "UTF-8")
I have verified the dataframe is valid.
I have verified the connection to the DB.
SOLVED - there was a column that was strictly integers so I had to specify the data type in the to.sql() call.

Read files from hdfs - pyspark

I am new to Pyspark, when I execute the below code, I am getting attribute error.
I am using apache spark 2.4.3
t=spark.read.format("hdfs:\\test\a.txt")
t.take(1)
I expect the output to be 1, but it throws error.
AttributeError: dataframereader object has no attribute take
You're not using the API properly:
format is used to specify the input data source format you want
Here, you're reading text file so all you have to do is:
t = spark.read.text("hdfs://test/a.txt")
t.collect()
See related doc

Pyarrow.lib.Schema vs. pyarrow.parquet.Schema

When I try to load across a many-partitioned parquet file, some of the schema get inferred invalidly because of missing data which fills the schema in with nulls. I would think specifying the schema in the pyarrow.parquet.ParquetDataset would fix this but I don't know how to construct a schema of the correct pyarrow.parquet.Schema type. Some example code:
import pyarrow as pa
import pa.parquet as pq
test_schema = pa.schema([pa.field('field1', pa.string()), pa.field('field2', pa.float64())])
paths = ['test_root/partition1/file1.parquet', 'test_root/partition2/file2.parquet']
dataset = pq.ParquetDataset(paths, schema=schema)
And the error:
AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
But I can't find any documentation on how to construct a pyarrow.parquet.Schema schema as in the docs (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html) and have only made a pyarrow.lib.Schema which gives the above error.
There is not an API to construct a Parquet schema in Python yet. You can use one that you read from a particular file, though (see pq.ParquetFile(...).schema).
Could you open an issue on the ARROW JIRA project to request the feature to construct Parquet schemas in Python?
https://issues.apache.org/jira
So thank you (whoever you are) if there was the ticket and fix in ARROW JIRA of this.
I was able to merge schemas of files in dataset and read dataset:
import pyarrow as pa
import pa.parquet as pq
merged_schema = pa.schema([])
for filename in os.listdir(dataset_folder):
schema_ = pq.read_table(os.path.join(dataset_folder, filename)).schema
merged_schema = pa.unify_schemas([schema_, merged_schema])
Read dataset:
dset = pq.ParquetDataset(
'my_dataset_folder',
schema=merged_schema,
use_legacy_dataset=False
).read()

bsddb and reprepro (berkeley) database

I'm trying to read the database files created by reprepro. I don't have that much experience with bdb, so I might be confused here, but it looks like the database is layered in some way.
If I simply do btopen('path/to/packages.db', 'r'), I get the database object with contents like:
In [4]: packages.items()
Out[4]:
[('local-lenny|main|amd64', '\x00\x00\x00\x04'),
('local-lenny|main|i386', '\x00\x00\x00\x02'),
('local-lenny|main|powerpc', '\x00\x00\x00\x14'),
('local-lenny|main|source', '\x00\x00\x00\x06'),
('local-lenny|main|sparc', '\x00\x00\x00\x12')]
However the db4.6_dump shows:
VERSION=3
format=bytevalue
database=local-lenny|main|sparc
type=btree
db_pagesize=4096
HEADER=END
<loads of data>
The file itself is identified as: /var/packages/db/packages.db: Berkeley DB (Btree, version 9, native byte-order) by file.
How do I get to that contents? If I understand it correctly, I got only the names of actual databases in keys(). How do I get to the contents of those dbs now?
And the answer seems to be that the "nice" version of the bsddb interface doesn't support multi btree tables inside one file. You can open such table explicitly via bsddb.db, using:
env = db.DBEnv()
env.open(None, db.DB_CREATE | db.DB_INIT_MPOOL)
internal_db = db.DB(env)
internal_db.open("the filename", "the internal db name", db.DB_BTREE, db.DB_RDONLY)

App Engine - Save response from an API in the data store as file (blob)

I'm banging my head against the wall with this one:
What I want to do is store a file that is returned from an API in the data store as a blob.
Here is the code that I use on my local machine (which of course works due to an existing file system):
client.convertHtml(html, open('html.pdf', 'wb'))
Since I cannot write to a file on App Engine I tried several ways to store the response, without success.
Any hints on how to do this? I was trying to do it with StringIO and managed to store the response but then weren't able to store it as a blob in the data store.
Thanks,
Chris
Found the error. Here is how it looks like right now (simplified).
output = StringIO.StringIO()
try:
client.convertURI("example.com", output)
Report.pdf = db.Blob(output.getvalue())
Report.put()
except pdfcrowd.Error, why:
logging.error('PDF creation failed %s' % why)
I was trying to save the output without calling "getvalue()", that was the problem. Perhaps this is of use to someone in the future :)

Categories

Resources