I have around 4M (million) lines that I am reading into a dataframe from BQ, but I find that it no longer seems to be working. As I cannot isolate that something has changed, I want to know if there is anything to change to the code to make it more performant?
My code is the following:
def get_df_categories(table_name):
query = """
select cat, ref, engine from `{table_name}`
""".format(table_name=table_name)
df = client.query(query).to_dataframe()
return df
Better read it via list_rows method in batches. In this way you can try to use multithread to read data for a fixed size. This will help you see output much faster and you will be able to handle heavy data loads in a systematic manner. You can also pass which fields you wish to see in the output. This replicates the column names inside your select clause in the sql query.
Here is the document that will help you get started. https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html
Related
I came across problem, that when I query for large data (35M rows, 22GB data), the same query got executed multiple times (e.g. 400 times) in background. I understand that data is partitioned/shuffled in some way. It greatly increases query cost.
This is how I query for data:
from google.cloud import bigquery
bqclient = bigquery.Client(project)
query_job = bqclient.query(query).result()
df_result = query_job.to_dataframe()
Where project and query are Python strings.
I am using google-cloud-bigquery==2.30.1.
I am looking for any programmatic solutions to reduce query costs. E.g. is there different class/config/object/method/library that would handle such queries in better way?
I suspect it's because you're calling result() twice, once when you run query_job = bqclient.query(query).result() and once when you run df_result = query_job.to_dataframe() (by calling query_job again). Not sure why it's running so many times but probably has to do with how result() works (https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query should have more info)
The "basic" answer to what you want is df_result = bqclient.query(query).to_dataframe(). However, if you're querying a large dataset, this will likely take time. See
Reading from BigQuery into a Pandas DataFrame and performance issues for a better way to do this.
Side note on reducing query costs: in case you're working in a local Python environment, you probably don't want to be processing 22GB worth of data there. If you're, say, building an ML model, you probably want to extract say 1/1000th of you data (a simple LIMIT in SQL won't reduce your query costs, you want to partition your table on a date column and filter on that OR create a new table with a subset of rows and query that) and work on that. Note the model on your subset of data won't be accurate, it's just to make sure your Python code works on that data. You'd then deploy your code to a cloud environment and run it on the full data. Good end-to-end example here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive2/structured/solutions.
I have a table where I wrote 1.6 million records, and each has two columns: an ID, and a JSON string column.
I want to select all of those records and write the json in each row as a file. However, the query result is too large, and I get the 403 associated with that:
"403 Response too large to return. Consider specifying a destination table in your job configuration."
I've been looking at the below documentation around this and understand that they recommend specifying a table for the results and viewing them there, BUT all I want to do is select * from the table, so that would effectively just be copying it over, and I feel like I would run into the same issue querying that result table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery.FIELDS.allow_large_results
What is the best practice here? Pagination? Table sampling? list_rows?
I'm using the python client library as stated in the question title. My current code is just this:
query = f'SELECT * FROM `{project}.{dataset}.{table}`'
return client.query(query)
I should also mention that the IDs are not sequential, they're just alphanumerics.
The best practice and efficient way is to export your data and then download it instead of querying the whole table (SELECT *).
From there, you may extract your needed data from the exported files (eg. CSV, JSON, etc) using python code without having to wait for your code to finish the SELECT * query.
I am working on a Python Script to get the data out from sales force. Everything seems to be working fine I am passing a custom SOQL query to get the data but the challenge is that query is returning only the first 500 rows but there are close to 653000 results inside my object.
"data= sf.query('SELECT {} from abc'.format(','.join(column_names_list)))"
now i tried using querymore() and queryall() but that doesn't seem to work too.
At the end my intent is to load all this information in a dataframe and push this to a table and keep looking for new records by scheduling this code. Is there a way to achieve this. ?
In order to retrieve all the records using a single method just try
data= sf.query_all('SELECT {} from abc'.format(','.join(column_names_list)))
Refer following link
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.