How to read a csv using sql - python

I would like to know how to read a csv file using sql. I would like to use group by and join other csv files together. How would i go about this in python.
example:
select * from csvfile.csv where name LIKE 'name%'

SQL code is executed by a database engine. Python does not directly understand or execute SQL statements.
While some SQL database store their data in csv-like files, almost all of them use more complicated file structures. Therefore, you're required to import each csv file into a separate table in the SQL database engine. You can then use Python to connect to the SQL engine and send it SQL statements (such as SELECT). The engine will perform the SQL, extract the results from its data files, and return them to your Python program.
The most common lightweight engine is SQLite.

littletable is a Python module I wrote for working with lists of objects as if they were database tables, but using a relational-like API, not actual SQL select statements. Tables in littletable can easily read and write from CSV files. One of the features I especially like is that every query from a littletable Table returns a new Table, so you don't have to learn different interfaces for Table vs. RecordSet, for instance. Tables are iterable like lists, but they can also be selected, indexed, joined, and pivoted - see the opening page of the docs.
# print a particular customer name
# (unique indexes will return a single item; non-unique
# indexes will return a Table of all matching items)
print(customers.by.id["0030"].name)
print(len(customers.by.zipcode["12345"]))
# print all items sold by the pound
for item in catalog.where(unitofmeas="LB"):
print(item.sku, item.descr)
# print all items that cost more than 10
for item in catalog.where(lambda o : o.unitprice>10):
print(item.sku, item.descr, item.unitprice)
# join tables to create queryable wishlists collection
wishlists = customers.join_on("id") + wishitems.join_on("custid") + catalog.join_on("sku")
# print all wishlist items with price > 10
bigticketitems = wishlists().where(lambda ob : ob.unitprice > 10)
for item in bigticketitems:
print(item)
Columns of Tables are inferred from the attributes of the objects added to the table. namedtuples are good also, as well as a types.SimpleNamespaces. You can insert dicts into a Table, and they will be converted to SimpleNamespaces.
littletable takes a little getting used to, but it sounds like you are already thinking along a similar line.

You can easily query an SQL Database using PHP script. PHP runs serverside, so all your code will have to be on a webserver (the one with the database). You could make a function to connect to the database like this:
$con= mysql_connect($hostname, $username, $password)
or die("An error has occured");
Then use the $con to accomplish other tasks such as looping through data and creating a table, or even adding rows and columns to an existing table.
EDIT: I noticed you said .CSV file. You can upload a CSV file into a SQL database and create a table out of it. If you are using a control panel service such as phpMyAdmin, you can simply import a CSV file into your database like this:
If you are looking for a free web host to test your SQL and PHP files on, check out x10 hosting.

Related

How to query bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?

So I have several tables with each product for each year and tables go like:
2020product5, 2019product5, 2018product6 and so on. I have added two custom parameters in google data studio as well named year and product_id but could not use them in table names themselves. I have used parameterized queries before but in conditions like where product_id = #product_id but this setup only works if all of the data is in same table which is not the current case with me. In python I use string formatters like f"{year}product{product_id}" but that obviously does not work in this case...
Using Bigquery Default CONCAT & FORMAT functions does not help as both throw following validation error: Table-valued function not found: CONCAT at [1:15]
So how do I get around with querying bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?
After much research I (kinda) sorted it out. Turns out it is a database level feature to query schema-level entities e.g. table names dynamically. BigQuery does not support formatting within table name like tables as per in question (e.g. 2020product5, 2019product5, 2018product6) cannot be queried directly. However, it does have a TABLE_SUFFIX function which allow you to access tables dynamically given that changes in table names are located at the end of the table. (This feature also allowed for dateweise partitioning and many tools which use BQ as data sink, utilize this. So If you are using BQ as data sink, there is good chance that your original data source is already doing so). Thus, table names like (product52020, product52019, product62018) as well can be accessed dynamically and of course from data studio too using following:
SELECT * FROM `project_salsa_101.dashboards.product*` WHERE _table_Suffix = CONCAT(#product_id,#year)
P.S.: Used python to create a dirty script which looped through products and tables and copied and created new ones which goes as follows: (Adding script with formatted string so it might be useful for anyone with such case wtih nominal effort)
import itertools
credentials = service_account.Credentials.from_service_account_file(
'project_salsa_101-bq-admin.json')
project_id = 'project_salsa_101'
schema = 'dashboards'
client = bigquery.Client(credentials= credentials,project=project_id)
for product_id, year in in itertools.product(product_ids, years):
df = client.query(f"""
SELECT * FROM `{project_id}.{schema}.{year}product{product_id}`
""").result().to_dataframe()
df.to_gbq(project_id = project_id,
destination_table = f'{schema}.product{product_id}{year}',
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
if_exists = 'replace')
client.query(f"""
DROP TABLE `{project_id}.{schema}.{year}product{product_id}`""").result()

storing sql in config file

I am using a big nested sql query in python and need to run that query using different dates (date is a field name used in the query).
I also need to change the table name (assume that there are various tables where I need to update the date or insert new records).
Now I am finding it a bit clunky and want to shift the query to config (or .ini) file. I also want to do this so that user can easily change the query without opening the code.
I am able to read the sql but python can't change the variables inside the code.
For example in .ini file the sql is stored as [SQL]:
p_insert_query = insert into + tbl_p + <...nested sql>
I read this inside python and having the tbl_p already defined as 'My_tbl' in python but the query string is not updating the table name.
Is there any other way to this?
You could store a .sql or .txt file containing a "parameterized query".
If you use psycopg library you can do it that way (as stated in the doc: http://initd.org/psycopg/docs/sql.html) :
from psycopg2 import sql
my_query_text = "insert into {} values (%s, %s)" # just load that str from .sql or .txt file instead
tbl_p = "my_table_name"
cur.execute(sql.SQL(my_query_text).format(sql.Identifier(tbl_p)), [10, 20]) # [10,20] are sample values

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?
To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

How to bulk insert data to mysql with python

Currently i'm using Alchemy as a ORM, and I look for a way to speed up my insert operation, I have bundle of XML files to import
for name in names:
p=Product()
p.name="xxx"
session.commit()
i use above code to insert my data paser from batch xml file to mysql,it's very slow
also i tried to
for name in names:
p=Product()
p.name="xxx"
session.commit()
but it seems didn't change anything
You could bypass the ORM for the insertion operation and use the SQL Expression generator instead.
Something like:
conn.execute(Product.insert(), [dict(name=name) for name in names])
That should create a single statement to do your inserting.
That example was taken from lower down the same page.
(I'd be interested to know what speedup you got from that)

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks
You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference
For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.
If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)
If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.
Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.
what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html
This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.
I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.

Categories

Resources