Python - link database query to python dataframe - python

I have a spreadsheet with seven thousand rows of user ids. I need to query a database table and return results matching the ids in the spreadsheet.
My current approach is to read the entire database table into a pandas data frame and then merge with another data frame created from the spreadsheet. I'd prefer not to read the entire table into memory due to it's size. Is there anyway to do this without reading in the entire table? In Access or SAS, I could write a query that links the locally created table (i.e. created from spreadsheet) with the database table.
Current code that reads entire table into memory
# read spreadsheet
external_file = pd.read_excel("userlist.xlsx")
# query
qry = "select id,term_code,group_code from employee_table"
# read table from Oracle database
oracle_data = pd.read_sql(qry,connection)
# merge spreadsheet with oracle data
df = pd.merge(external_file,oracle_data,on=['id','term_code'])
I realize the following isn't possible but I would like to be able to query the database like this where "external_file" is a data frame created from my spreadsheet (or at least find an equivalent solution):
query = """
select a.id,
a.term_code,
a.group_code
from employee_table a
inner join external_file b on a.id = b.id and a.term_code=b.term_code
"""

i think you could use a xlwing (https://www.xlwings.org) to create a function that reads the id columns and create the query you want

Related

How to insert PDF table data into database

I have extracted pdf table data by using Camelot but now how can I do put my table data into my database like do I need to convert it into CSV? like is there any other way to put it into my database? and is there any other way to choose my specific tables or just put the number of the tables. cause in here I need to specify my table no. to be extracted.
def tables_extract(file_name):
filename_with_path = 'upload/media/pos/pdfs/{}'.format(file_name)
tables = camelot.read_pdf(filename_with_path, pages="1-end")
table= tables[2].df
Below is the table data in the pdf which I want to put the values into my DB

pyodbc - write a new column of data to existing table in ms access

I have a ms access db I've connected to with (ignore the ... in the drive name, it's working):
driver = 'DRIVER={...'
con = pyodbc.connect(driver)
cursor = con.cursor()
I have a pandas dataframe which is exactly the same as a table in the db except there's an additional column. Basically I pulled the table with pyodbc, merged it with external excel data to add this additional column, and now want to push the data back to the ms access table with the new column. The pandas df containing the new information is merged_df['Item']
Trying things like below does not work, I've had a variety of errors.
cursor.execute("insert into ToolingData(Item) values (?)", merged_df['Item'])
con.commit()
How can I push the new column to the original table? Can I just write over the entire table instead? Would that be easier? Since merged_df is literally the same thing with the addition of one new column.
If the target MS Access table does not already contain a field to house the data held within the additional column, you'll first need to execute an alter table statement to add the new field.
For example, the following will add a 255-character text field called item to the table ToolingData:
alter table ToolingData add column item text(255)

Store results from SQL Queries in a separate pandas data frame with the date of execution

I have created a lookup table (in Excel) which has the table and column name for the various tables and the the column names under these table along with all the SQL queries to be run on these fields. Below is an example table.
Results from all SQL Queries are in the format Total_Count and Fail_Count. I want to output these results along with all the information in the current version of the lookup table and date of execution into a separate table.
Sample result Table:
Below is the code I used to get the results together in the same lookup table but have trouble storing the same results in a separate result_set table with separate columns for total and fail counts.
df['Results'] = ''
from pandas import DataFrame
for index, row in df.iterrows():
cur.execute(row["SQL_Query"])
df.loc[index,'Results'] = (cur.fetchall())
It might be easier to load the queries into a DataFrame directly using the read_sql method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html.
The one caveat is that you need to use a sqlAlchemy engine for the connection. I also find itertuples easier to work with.
One you have those your code is merely:
df['Results'] = ''
for row in df.itertuples():
df_result = pd.read_sql(row.Sql_Query)
df.loc[row.Table_Name, 'Total_count'] = df_result.total_count
df.loc[row.Table_Name, 'Fail_count'] = df_result.fail_count
Your main problem above is that you're passing two columns from the result query to one column in df. You need to pass each column separately.

Create multiple views in shell script or big query

When I am exporting data from MySQL to BigQuery, some data are been duplicated. As a way to fix this, I thought of creating views of this tables using row number. The query to do this is shown below. The problem is that a lot of tables in my dataset are duplicated and possibly when I add new tables and export them to big query, they will have duplicated data and I don't want to create this type of query every time that a I add a new table in my dataset (I want that, in the moment I export a new table, a view to this table is created). Is this possible to do in a loop in the query (like 'for each table in my data set, do this')? Is this possible to do in shell script (when export a table to big query, create a view for this table)? In last case, is this possible to do in python?
SELECT
* EXCEPT (ROW_NUMBER)
FROM
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY id order by updated_at desc) ROW_NUMBER
FROM dataset1.table1
)
WHERE ROW_NUMBER = 1
It is definitely can be done in python.
I would recommend to use gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python
So I think you script should be something like this
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
client = bigquery.Client()
dataset_ref = client.dataset('dataset_name')
tables = list(client.list_tables(dataset_ref))
for tab in tables:
table = dataset.table("v_{}".format(tab.name))
table.view_query = "select * from `my-project.my.dataset.{}`".format(tab.name)
#if creating legacy view comment out next line
table.view_query_legacy_sql = False
table.create()

BigQuery insert job instead of streaming

I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?
Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.

Categories

Resources