I have the data from google sheet in data frame and using Pandas df._tosql to import the data in Postgres RDS.
def gsheet2df(spreadsheet_name, sheet_num):
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
credentials_path = 'billing-342104-8b351a7a2813.json'
credentials = sac.from_json_keyfile_name(credentials_path, scope)
client = gspread.authorize(credentials)
sheet = client.open(spreadsheet_name).get_worksheet(sheet_num).get_all_records()
df = pd.DataFrame.from_dict(sheet)
print(df)
return df
def write2db(ed):
connection_string = "postgresql+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
connection = engine.connect()
ed.to_sql('user_data', con = engine, if_exists = 'append', index='user_id')
But I have two use cases that are not been handled and explored a lot also.
When I am importing the data to DB the column names are the sheet column names. I want 2 extra columns to be added for every row one is the time at which it is getting updated and another is if deleted. With values, this should be there for every row.
I have imported the data once but now I want to sync it again and update the DB based on the changes in the sheet. I don't want to wipe out the complete DB.
Any suggestions to achieve the same.
Related
I am quite new in Python, any advice or link will help.
I have created two python scripts, -
Main.py which calls SQLcon.py.
SQLcon.py only creates connection to SQL server and downloads data based on multiple queries.
Later,
Main.py code reads/creates pandas dataframes from excel files which are downloaded by SQLcon and does calculations and etc and etc.
the File for the SQL connection and queries in the SQLcon.py has the main structure as below
Problems:
A) Quite a lot of queries are done and quite a lot of temporary files are created.
B) I do not want to keep the SQL related code on the Main file
Wanted Outcome:
I want to use dfX = pd.read_sql_query(qryX, engine) (or similar) in the main file and to get rid of part for saving/reading excel files.
Also, - would be nice to keep one connection during all these queries as multiple re-connections will slow down the code.
I am not sure how to start...
Thinking of putting main SQL connection into the function and call it from Main...
But it will create multiple re-connections...
import sqlalchemy as sa # and other imports
load_dotenv()
# .env passwords and etc.
'''...'''
# creating SQL connection via sqlalchemy
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = sa.create_engine(connection_url)
engine.echo = False
# creating dfs
df1 = pd.read_sql_query(qry1, engine)
dfA = pd.read_sql_query(qryA, engine)
dfZ = pd.read_sql_query(qryZ, engine)
engine.dispose() #not sure if dispose() is needed
# saving dfs
df1.to_excel(r'C:\Test\df1_tbl_Data.xlsx', index=False)
dfA.to_excel(r'C:\Test\dfA_tbl_Data.xlsx', index=False)
dfZ.to_excel(r'C:\Test\dfZ_tbl_Data.xlsx', index=False)
Consider building a collection of your data pulls in a user defined method. Then, call it whenever needed by main or other scripts:
SQLcon.py
import sqlalchemy as sa
# and other
imports load_dotenv()
# .env passwords and etc. '''...'''
def pull_data():
# creating SQL connection via sqlalchemy
connection_url = URL.create(
"mssql+pyodbc",
query={"odbc_connect": connection_string}
)
engine = sa.create_engine(connection_url)
engine.echo = False
# creating dfs
df_dict = {
"df1": pd.read_sql_query(qry1, engine),
"dfA": pd.read_sql_query(qryA, engine),
"dfZ": pd.read_sql_query(qryZ, engine)
}
# releasing engine
engine.dispose()
return df_dict
Main.py (import above as a module)
from SQLcon import pull_data
...
# CALL AS NEEDED
df_dict = pull_data()
# ACCESS DICT ELEMENTS
df_dict["df1"]
df_dict["dfA"]
df_dict["dfZ"]
...
I am trying to import some data from the database (Postgre SQL) to work with them in Python. I tried with the code below, which seems quite similar to the ones I've found on the internet.
import psycopg2
import sqlalchemy as db
import pandas as pd
engine = db.create_engine('database specifications')
connection = engine.connect()
metadata = db.MetaData()
data = db.Table(tabela, metadata, schema=shema, autoload=True, autoload_with=engine)
query = db.select([data])
ResultProxy = connection.execute(query)
ResultSet = ResultProxy.fetchall()
df = pd.DataFrame(ResultSet)
However, it returns data without column names. What did I forget?
It turned out the only thing needed is adding
columns = data.columns.keys()
df.columns = columns
There is a great debate about that in this thread.
I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?
Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123
I am trying to populate an already created google sheet from my sql table using python and gspread.
I can update the sheet one row at a time using a for loop, but i have a lot of data to add to the sheet and want to do a column at a time or more if possible.
Any suggestions here's what i've been using and i get an error: Object of type 'Row' is not JSON serializable
#!/usr/bin/python3
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import dbconnect
#credentials for google
gc = gspread.authorize(credentials)
worksheet = gc.open('NAMEOFWS').sheet1
cell_list = worksheet.range('A2:A86')
#connect to database using dbconnect and grab cursor
query = "select loc from table"
cursor.execute(query)
results = cursor.fetchall()
cell_values = (results)
for i, val in enumerate(cell_values):
cell_list[i].value = val
worksheet.update_cells(cell_list)
I am not sure how to do this with gspread, but you can modify you code very easily and use pygsheets and it allows to update a column all at once. Also, I am not sure what your data looks like so the below may need to be altered or you may need to alter your data set a little. Hope this helps.
import pygsheets
gc = pygsheets.authorize(service_file = 'client_secret2.json')
# Open spreadsheet and select worksheet
sh = gc.open('Api_Test')
wks = sh.sheet1
#update Column
notes = [1,2,3,4] #this is the dataset to getupdated in the column
wks.update_col(4, notes, 1)# 4 is the number of column, notes is the dataset to update, 1 skips the first row (I used a header)
I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
df=[]
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
print(df)
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("127.0.0.1","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
conn.commit()
conn.close()
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
"<db_server_url>",
"<db_name>"))
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.