I'm trying to create a pandas data frame using the Snowflake Packages in python.
I run some query
sf_cur = get_sf_connector()
sf_cur.execute("USE WAREHOUSE Warehouse;")
sf_cur.execute("""select Query"""
)
print('done')
The output is roughly 21k rows. Then using
df = pd.DataFrame(sf_cur.fetchall())
takes forever, even on a limit sample of only 100 rows. Is there a way to optimize this, ideally the bigger query would be run in a loop so handling even bigger data sets would be ideal.
as fetchall() copies all the result in memory, you should try to iterate over the cursor object directly and map it to a data frame inside the for block
cursor.execute(query)
for row in cursor:
#build the data frame
Other example, just to show:
query = "Select ID from Users"
cursor.execute(query)
for row in cursor:
list_ids.append(row["ID"])
Use df = cur.fetch_pandas_all() to build pandas dataframe on top of results.
Related
I find a SQLITE database but I'm totaly new with SQL and not sure where to start...
As you see in the screenshot, I manage to open the data inside pandas df and the daily flow are split in columns named "FLOW1, FLOW2... FLOW31".
I want to extract all daily flow history and stack it in a different columns for each STATION_NUMBER. (columns=station, row/index=datetime) ex:
date STATION#1 STATION#2 ...
1-1-1969 value value
2-1-1969 value value
here my small code used to get there:
conn = sqlite3.connect(r"E:\Python\Data\hydrology_forcasting\Caravan\timeseries\csv\hydat" + "\Hydat.sqlite3")
for row in conn.execute("SELECT name FROM sqlite_master WHERE type = 'table'"):
# get dayly flow df of the screenshot
flow = pd.read_sql_query("SELECT * from DLY_FLOWS", conn)
My way is make an for loop to pick values one at the time and copy it to a new df... but I'm sure this is not the more efficient way and sql as build in method or something to do that...?
I am a beginner in programming and please excuse me if the question is stupid.
See the code below. It takes two values from a csv file named headphones_master_data.csv ( price, link) and writes the data into a MySQL table. When writing the data , the date is also being written to the table.
There are 900 rows in the file. When you see the writing part, the my_cursor.execute(sql, val) function is executed 900 times (the number of rows).
It got me thinking and I wanted to see if there are other ways to improve the data writing part. I came up with two ideas and they are as follows.
1 - Convert all the lists ( price, link) into a dictionary and write the dictionary. So the my_cursor.execute(sql, val) function is executed just once.
2 - Convert the lists into a data frame and write that into the database so the write happens just once.
Which method is the best one? Are there any drawbacks of writing the data only once. More importantly, Am I thinking about the optimization correctly?
''''
import pandas as pd
import pymysql
data = pd.read_csv("headphones-master_data.csv") #read csv file and save this into a variable named data
link_list = data['Product_url'].tolist() #taking athe url value from the data vaiable and turn into a list
price_list = data['Sale_price'].tolist()
crawled_date = time.strftime('%Y-%m-%d') #generate the date format compatiable with MySQL
connection = pymysql.connect(host='localhost',
user='root',
password='passme123##$',
db='hpsize') #connection obhect to pass the database details
my_cursor = connection.cursor() #curser object to communicate with database
for i in range(len(link_list)):
link = link_list[i]
price = price_list[i]
sql = "INSERT INTO comparison (link, price, crawled_date) VALUES (%s, %s, %s)" #sql query to add data to database with three variables
val = link , price , crawled_date #the variables to be addded to the SQL query
my_cursor.execute(sql, val) #execute the curser obhect to insert the data
connection.commit() #commit and make the insert permanent
my_cursor.execute("SELECT * from comparison") #load the table contents to verify the insert
result = my_cursor.fetchall()
for i in result:
print(i)
connection.close()
''''
The best way in my opinion is to pass the data into a DataFrame and then use the .to_sql method in order to save the data in your MySQL database.
This method take an argument (method='multi') which allows you to insert all the data in the DataFrame in one go and within a very short time.. This works if your database allows multi-writing.
Read more here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
I have the following code that iterates through the rows of certain tables in AWS. It grabs the first 50k rows and keeps going as long as there are 50k more rows to grab and it works extremely quickly because I'm usually only getting the last 2 days worth of data.
top=50000
i=0
days = 2
df = pd.DataFrame()
result = pd.DataFrame()
curs = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
while((i==0) or (len(df)==top)):
start_time = (dt.datetime.now()-timedelta(days=days)).strftime("%Y-%m-%d %H:%M:%S")
sql=f'SELECT * FROM {str.upper(table)} WHERE INSERTED_AT >= \'{start_time}\' OR UPDATED_AT >= \'{start_time}\' LIMIT {top} OFFSET {i}'
curs.execute(sql)
data = curs.fetchall()
df=pd.DataFrame([i.copy() for i in data])
result = result.append(df,ignore_index=True)
#load result to snowflake
i += top
The trouble is I have a very large table that is about 7 million rows long and growing exponentially. I found that if I backload all its data (day=1000) that I will be missing data probably because each iteration what was 0-50k,50k-100k, etc. has now changed since the table loaded more data whilst I was running the while loop.
What is a better way to load data into snowflake that will avoid missing data issues? Do I have to use parallelization to get all these pieces of the table at once? Even if top=3mil I still find I'm missing large amounts of data, likely due to the lag time it takes me to load while the actual table rows are incrementing. Is there a standardized block of code that excels for large tables?
I would skip the Python and favor the UNLOAD Snowflake command.
UNLOAD lets you dump the contents of a Redshift table into an S3 bucket:
https://community.snowflake.com/s/article/How-To-Migrate-Data-from-Amazon-Redshift-into-Snowflake
unload ('select * from emp where date = GET_DATE()')
to 's3://mybucket/mypath/'
credentials 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
delimiter '\001'
null '\\N'
escape
[allowoverwrite]
[gzip];
You could set up a stored procedure that runs and have a schedule that kicks it off once per day (I use Astronomer/Airflow for this).
From there you can build an external table on top of the bucket:
https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html
I have created a lookup table (in Excel) which has the table and column name for the various tables and the the column names under these table along with all the SQL queries to be run on these fields. Below is an example table.
Results from all SQL Queries are in the format Total_Count and Fail_Count. I want to output these results along with all the information in the current version of the lookup table and date of execution into a separate table.
Sample result Table:
Below is the code I used to get the results together in the same lookup table but have trouble storing the same results in a separate result_set table with separate columns for total and fail counts.
df['Results'] = ''
from pandas import DataFrame
for index, row in df.iterrows():
cur.execute(row["SQL_Query"])
df.loc[index,'Results'] = (cur.fetchall())
It might be easier to load the queries into a DataFrame directly using the read_sql method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html.
The one caveat is that you need to use a sqlAlchemy engine for the connection. I also find itertuples easier to work with.
One you have those your code is merely:
df['Results'] = ''
for row in df.itertuples():
df_result = pd.read_sql(row.Sql_Query)
df.loc[row.Table_Name, 'Total_count'] = df_result.total_count
df.loc[row.Table_Name, 'Fail_count'] = df_result.fail_count
Your main problem above is that you're passing two columns from the result query to one column in df. You need to pass each column separately.
I have an Oracle DB with over 5 Million rows with columns of type varchar and blob. In order to connect to the database and read the records I use python 3.6 with a JDBC driver and the library JayDeBeApi. What I am trying to achieve is to read each row, perform some
operations on the records (use a regex for example) and then store the new record values in a new table. I don't want to load all records in the memory, so what I want to do is to consequently fetch them from the database, store the fetched data, process it and then add it to the other table.
Currently I fetch all the records at once instead for example first 1000, then the next 1000 and so on. This is what I have so far:
statement = "... a select statement..."
connection= dbDriver.connect(jclassname,[driver_url,username,password],jars,)
cursor = connection.cursor()
cursor.execute(statement)
fetched = cursor.fetchall()
for result in fetched:
preprocess(result)
cursor.close()
How could I modify my code to fetch consequently and where to put the second statement which inserts the new values in the other table?
As you said, fetchall() is a bad idea in this case, as it loads all the data into the memory.
In order to avoid that you can iterate over cursor object itself:
cur.execute("SELECT * FROM test")
for row in cur: # iterate over result set row by row
do_stuff_with_row(row)
cur.close()