QueryItems azure-cosmos package

QueryItems azure-cosmos package - python

I'm trying to query a cosmos db and store a table into a pandas data frame (or just as a list, the problem is the same), using the following code
table_link= 'dbs/'+database_name+'/colls/'+container_name
query= 'SELECT * FROM '+container_name
df=pd.DataFrame(client.QueryItems(table_link,query,
{'enableCrossPartitionQuery': True}))
but I have two problems with the output (see the image attached).
First, I have extra columns id, $pk, &id....that shouldn't be there (I could just ask for the columns that I want in the query, but I have several tables and that would mean to write a different one for each one of them). And second, for the actual columns of the table, I get a dict with two keys "t" and "v" beign v the real value of that field.
Any help? I'm not sure if this is the expected behaviour of QueryItems, but I don't see any way to avoid it.

Please check the Target API for your Cosmos DB account. More than likely it is Table API. If the API is not SQL API, you will need to use the SDK specific for a particular API of Cosmod DB account.

Related

How do you handle large query results for a simple select in bigquery with the python client library?

I have a table where I wrote 1.6 million records, and each has two columns: an ID, and a JSON string column.
I want to select all of those records and write the json in each row as a file. However, the query result is too large, and I get the 403 associated with that:
"403 Response too large to return. Consider specifying a destination table in your job configuration."
I've been looking at the below documentation around this and understand that they recommend specifying a table for the results and viewing them there, BUT all I want to do is select * from the table, so that would effectively just be copying it over, and I feel like I would run into the same issue querying that result table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery.FIELDS.allow_large_results
What is the best practice here? Pagination? Table sampling? list_rows?
I'm using the python client library as stated in the question title. My current code is just this:
query = f'SELECT * FROM `{project}.{dataset}.{table}`'
return client.query(query)
I should also mention that the IDs are not sequential, they're just alphanumerics.

The best practice and efficient way is to export your data and then download it instead of querying the whole table (SELECT *).
From there, you may extract your needed data from the exported files (eg. CSV, JSON, etc) using python code without having to wait for your code to finish the SELECT * query.

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.

As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

Insertion into SQL Server based on condition

I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.

SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.

How to insert historical data on their respective partitions

I have a database that has records stretching back to 2014 that I have to migrate it to BigQuery, and I think that using the partitioned tables feature will help on the performance of the database.
So far, I loaded a small sample of the real data via the web UI, and while the table was already partitioned, all the data went to a single partition with the date that I had run the query in, which was expected, to be fair.
I searched the documentation sites and I ran into this, which I'm not sure if that's what I'm looking for.
I have two questions:
1) In the above example, they use the decorator on a SELECT query, but can I use it on a INSERT query as well?
2) I'm using the Python client to connect to the BigQuery API, and I while I found the table.insert_data method, I couldn't find anything that refers specifically to insert in the partitions, and I'm wondering if I missed it or I will have to use the query API to also insert data.

Investigated this a bit more:
1) I don't think I've managed to run an INSERT query at all, but this is moot for me, because..
2) Turns out that it is possible to insert in the partitions directly using the Python client, but it wasn't obvious to me:
I was using this snippet to insert some data into a table:
from google.cloud import bigquery
items = [
(1, 'foo'),
(2, 'bar')
]
client = bigquery.Client()
dataset = client.dataset('<dataset>')
table = dataset.table('<table_name>')
table.reload()
print table.insert_data(items)
The key is appending a $ and a date (say, 20161201) to the table name in the selector, like so:
table = dataset.table('<table_name>$20161201')
And it should insert in the correct partition.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.