Separating multiple query results - python

I am running multiple (about 60) queries in impala using impala shell from a file and outputting to a file. I am using :
impala-shell -q "query_here; query_here; etc;" -o output_path.csv -B --output_delimiter=','
The issue is that they are not separated between queries, so query 2 would be directly appended as a new row right onto the bottom of query 1. I need to separate the results to math them up with each query, however I do not know where each query's results are done and another begins because it is a continuous CSV file.
Is there a way to run multiple queries like this and leave some type of space or delimiter between query results or any way to separate the results by which query they came from?
Thanks.

You could insert your own separators by issuing some extra queries, for example select '-----'; between the real queries.
Writing results of individual queries to local files is not yet possible, but there is already a feature request for it (IMPALA-2073). You can, however, easily save query results into HDFS as CSV files. You just have to create a new table to store the results specifying row format delimited fields terminated by ',', then use insert into table [...] select [...] to populate it. Please refer to the documentation sections Using Text Data Files with Impala Tables and INSERT Statement for details.
One comment suggested running the individual queries as separate commands and saving their results into separate CSV files. If you choose this solution, please be aware that DDL statements like create table are only guaranteed to take immediate effect in the connection in which they were issued. This means that creating a table and then immediately querying it in another impala shell is prone to failure. Even if you find that it works correctly, it may fail the next time you run it. On the other hand, running such queries one after the other in the same shell is always okay.

Related

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

How To Store Query Results (Using Python)

Background:
I have an application written in Python to monitor the status of tools. The tools send their data from specific runs and it all gets stored in an Oracle database as JSON files.
My Problem/Solution:
Instead of connecting to the DB and then querying it repeatedly when I want to compare the current run data to the previous run's data, I want to make a copy of the database query so that I can compare the new run data to the copy that I made instead of to the results of the query.
The reason I want to do this is because constantly querying the server for the previous run's data is slow and puts unwanted load/usage on the server.
For the previous run's data there are multiple files associated with it (because there are multiple tools) and therefore each query has more than one file that would need to be copied. Locally storing the copies of the files in the query is what I intended to do, but I was wondering what the best way to go about this was since I am relativity new to doing something like this.
So any help and suggestions on how to efficiently store the results of a query, which are multiple JSON files, would be greatly appreciated!
As you described querying the db too many times is not an option. OK in that case I would do this the following way :
When your program starts you get the data for all tools as a set of JSON-Files per tool right? OK. I am not sure how you get the data by querying the tools directly or by querying the db .. does not matter.
You check if you have old data in the "cache-dictionary" for that tool. If yes do your compare and store the "new data" as "previous data" in the cache. Ready for the next run. Do this for all tools. This loops forever :-)
This "cache dictionary" now can be implemented in memory or on disk. For your amount of data I think memory is just fine.
With that approach you do not have to query the db for the old data. The case that you cannot do the compare if you do not have old data in the "cache" at program start could be handled that you try to get it from db (risking long query times but what to do :-)

Write from a query to table in BigQuery only if query is not empty

In BigQuery it's possible to write to a new table the results of a query. I'd like the table to be created only whenever the query returns at least one row. Basically I don't want to end up creating empty table. I can't find an option to do that. (I am using the Python library, but I suppose the same applies to the raw API)
Since you have to specify the destination on the query definition and you don't know what it will return when you run it can you tack a LIMIT 1 to the end?
You can check the row number in the job result object and then re run the query without the limiter if there are results into your new table.
There's no option to do this in one step. I'd recommend running the query, inspecting the results, and then performing a table copy with WRITE_TRUNCATE to commit the results to the final location if the intermediate output contains at least one row.

Query data from a PostgreSQL dump

I have a dump that was made from a PostgreSQL database. I want to check for some information in that dump, specifically checking if there are entries in a certain table with certain values in certain fields.
This is for a Python program that should run automatically on many different inputs on customer machines, so I need a programmatic solution, not manually opening the file and looking for where that table is defined. I could restore the dump to a database and then delete it, but I'm worrying that this operation is heavy or that it has side-effects. I want there to be no side-effects to my query, I just want to do the check without it affecting anything in my system.
Is that possible in any way? Preferably in Python?
Any dump format: restore and query
The most practical thing to do is restore them to a temporary PostgreSQL database then query the database. It's by far the simplest option. If you have a non-superuser with createdb rights you can do this pretty trivially and safely with pg_restore.
SQL-format
If it's a plaintext (.sql) format dump, if desperate and you know the dumps were not created with the --inserts or --column-inserts options and you don't use the same table name in multiple schemas, you could just search for the text
COPY tablename (
at the start of a line, then read the COPY-format data (see below) until you find \. at the start of a line.
If you do use the same table name in different schemas you have to parse the dump to find the SET search_path entry for the schema you want, then start looking for the desired table COPY statement.
Custom-format
However, if the dump is in the PostgreSQL custom format, which you should always prefer and request by using -Fc with pg_dump, it is IIRC really a tar file with a custom header. You can either seek within it to find the tar header then extract it, or you can use pg_restore to list its header and then extract individual tables.
For this task I'd do the latter. To list tables in the dump:
pg_restore --list out.dump
To dump a particular table as tab-separated COPY format by qualified name, e.g. table address in schema public:
pg_restore -n public -t address out.dump
The output has a bunch of stuff you can't get pg_restore to skip at the start, but your script can just look for the word COPY (uppercase) at the start of a line and start reading on the next line, until it reaches a \. at the end of a line. For details on the format see the PostgreSQL manual on COPY
Of course you need the pg_restore binary for this.
Make sure there is no PGDATABASE environment variable set when you invoke pg_restore. Otherwise it'll restore to a DB instead of printing output to stdout.
Dump the database to a CSV file (or a CSV file for each table) and then you can load and query them using pandas.
You could convert your dump to INSERT INTO dump with this little tool I've written :
https://github.com/freddez/pg-dump2insert
It will be easier to grep specific table data in this form.

How to export a large table (100M+ rows) to a text file?

I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.

Categories

Resources