Process very large 900M row MySQL table line by line with Python

Process very large 900M row MySQL table line by line with Python - python

I often need to process several hundred million rows of a MySQL table on a line by line basis using Python. I want a script that is robust and does not need to be monitored.
Below I pasted a script that classifying the language of the message field in my row. It utilizes the sqlalchemy and MySQLdb.cursors.SSCursor modules. Unfortunately this script consistently throws a 'Lost connection to MySQL server during query' error after 4840 rows when I run remotely and 42000 rows when I run locally.
Also, I have checked and max_allowed_packet = 32M on my MySQL server's /etc/mysql/my.cnf file as per the answers to this stackoverflow question Lost connection to MySQL server during query
Any advice for either fixing this error, or using another approach to use Python for processing very large MySQL files in a robust way would be much appreciated!
import sqlalchemy
import MySQLdb.cursors
import langid
schema = "twitterstuff"
table = "messages_en" #900M row table
engine_url = "mysql://myserver/{}?charset=utf8mb4&read_default_file=~/.my.cnf".format(schema)
db_eng = sqlalchemy.create_engine(engine_url, connect_args={'cursorclass': MySQLdb.cursors.SSCursor} )
langid.set_languages(['fr', 'de'])
print "Executing input query..."
data_iter = db_eng.execute("SELECT message_id, message FROM {} WHERE langid_lang IS NULL LIMIT 10000".format(table))
def process(inp_iter):
for item in inp_iter:
item = dict(item)
(item['langid_lang'], item['langid_conf']) = langid.classify(item['message'])
yield item
def update_table(update_iter):
count = 0;
for item in update_iter:
count += 1;
if count%10 == 0:
print "{} rows processed".format(count)
lang = item['langid_lang']
conf = item['langid_conf']
message_id = item['message_id']
db_eng.execute("UPDATE {} SET langid_lang = '{}', langid_conf = {} WHERE message_id = {}".format(table, lang, conf, message_id))
data_iter_upd = process(data_iter)
print "Begin processing..."
update_table(data_iter_upd)

According to MySQLdb developer Andy Dustman,
[When using SSCursor,] no new queries can be issued on the connection until
the entire result set has been fetched.
That post says that if you issue another query you will get a "commands out of sequence" error, which is not the error you are seeing. So I am not sure that the following will necessarily fix your problem. Nevertheless, it might be worth trying to remove SSCursor from your code and use the simpler default Cursor just to test if that is the source of the problem.
You could, for example, use LIMIT chunksize OFFSET n in your SELECT statement
to loop through the data set in chunks:
import sqlalchemy
import MySQLdb.cursors
import langid
import itertools as IT
chunksize = 1000
def process(inp_iter):
for item in inp_iter:
item = dict(item)
(item['langid_lang'], item['langid_conf']) = langid.classify(item['message'])
yield item
def update_table(update_iter, engine):
for count, item in enumerate(update_iter):
if count%10 == 0:
print "{} rows processed".format(count)
lang = item['langid_lang']
conf = item['langid_conf']
message_id = item['message_id']
engine.execute(
"UPDATE {} SET langid_lang = '{}', langid_conf = {} WHERE message_id = {}"
.format(table, lang, conf, message_id))
schema = "twitterstuff"
table = "messages_en" #900M row table
engine_url = ("mysql://myserver/{}?charset=utf8mb4&read_default_file=~/.my.cnf"
.format(schema))
db_eng = sqlalchemy.create_engine(engine_url)
langid.set_languages(['fr', 'de'])
for offset in IT.count(start=0, step=chunksize):
print "Executing input query..."
result = db_eng.execute(
"SELECT message_id, message FROM {} WHERE langid_lang IS NULL LIMIT {} OFFSET {}"
.format(table, chunksize, offset))
result = list(result)
if not result: break
data_iter_upd = process(result)
print "Begin processing..."
update_table(data_iter_upd, db_eng)

Related

How query asynchronous postgres with aws lambda python?

In my case I use the pycopg2 client and I need to create a table but it gives me a time out error, this is obviously because the table takes a long time and exceeds the 15 min limit.

For my little purposes I found the following documentation that helped me a lot psycopg doc
I will leave the small implementation, note that I have separated the connection as aconn because it works differently than the normal connection, for example it does not use commit
The little detail is async_ =True in the connection line
import select
import psycopg2
def wait(conn):
while True:
state = conn.poll()
if state == psycopg2.extensions.POLL_OK:
break
elif state == psycopg2.extensions.POLL_WRITE:
select.select([], [conn.fileno()], [])
elif state == psycopg2.extensions.POLL_READ:
select.select([conn.fileno()], [], [])
else:
raise psycopg2.OperationalError("poll() returned %s" % state)
db_host = db_secret["host"]
db_name = db_secret["dbname"]
db_user = db_secret["username"]
db_pass = db_secret["password"]
aconn = None
stringConn = "dbname='%s' user='%s' host='%s' password='%s'" % (db_name, db_user, db_host, db_pass)
aconn = psycopg2.connect(stringConn , async_ =True)
wait(aconn)
acursor = aconn.cursor()
query ="CREATE TABLE CHEMA.TABLE AS SELECT * FRO BLA "
acursor.execute(query, params={})
wait(acursor.connection)
aconn.close()
#END AND EXIT

mycursor.executemany UPDATE not working as expected

Question:
I have a python script to scrape and website it gets 2 variables and stores them in 2 lists. I then use executemany to update MySQL database using one variable to match a pre-existing row to insert the other variable into.
Code:
Python Script
import mysql.connector
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, re
mydb = mysql.connector.connect(
host="host",
user="user",
passwd="passwd",
database="database"
)
mycursor = mydb.cursor()
d = webdriver.Chrome('D:/Uskompuf/Downloads/chromedriver')
d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
def cpus(_source):
result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
_titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
return _titles, [a for *_, [a] in filter(None, data)]
_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
print(sql, list(zip(_titles, _cpus)))
_last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
for i in range(2, int(_last_page)+1):
d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}')
time.sleep(3)
_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
mydb.commit()
MySQL UPDATE code
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
MySQL UPDATE code print
print(sql, list(zip(_cpus, _titles)))
MySQL UPDATE code print output
UPDATE cpu set family = %s where name = %s [('Pinnacle Ridge', 'AMD Ryzen 5 2600'), ('Coffee Lake-S', 'Intel Core i7-8700K'),...
First 2 rows of table
Expected result
The first variable is the name and that is the variable that needs to be matched the second variable is the family to be updated to row. The name matches perfectly and there are no errors when running the program however all family values are null.
Not sure what the best way to go solving this, I though i could make a fiddle but not sure about the list in executemany?
Other
If you need any more information please let me know.
Thanks

Just had to add:
mydb.commit()
after
executemany

MySQLdb is not filling rows in DB with data, but feildes are auto-incremented

I am using MySQlDb to connect and populate my DB with a python script.
Data from a BGP stream (dump) is going in to the DB. But when I try to
execute(insert data) with SQL on line 65 in the bottom of the code, the DB is not affected, besides that the row does auto-increment on one of it’s fields. Is this a encoding issue? I am using utf-8 in python and utf-8,
utf8_swedish_ci in MySQL. Code I am using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from _pybgpstream import BGPStream, BGPRecord, BGPElem
from collections import defaultdict
import time
import datetime
import os
import MySQLdb
db = MySQLdb.connect(user="bgpstream", host="localhost", passwd="Bgpstream9", db="bgpstream_copy")
db_cursor = db.cursor()
# Create a new bgpstream instance and a reusable bgprecord instance
stream = BGPStream()
rec = BGPRecord()
# Consider Route Views origon only
collector_name = 'rrc11'
stream.add_filter('collector',collector_name) #maybe we want route-views4?
t_end = int(time.time()) #current time now
t_start = t_end-3600 #the time interval (duration) we are getting from collecor, e.i 60*60 = 3600s = 1 hour
stream.add_interval_filter(t_start,t_end)
print "Total duration " + str(t_end-t_start) + " sec"
# Start the stream
stream.start()
### Insert loop ###
# This loop insert new records and tries not to over count the records
# Get next record:
while(stream.get_next_record(rec)):
# Print the record information only if it is not a valid record
if rec.status != "valid":
print rec.project, rec.collector, rec.type, rec.time, rec.status
else:
# Skip if rib
if rec.type == "rib":
continue
#get affected rows from insert
affected_rows = db.affected_rows()
# Skip if dulpicate record
if affected_rows <= 0:
continue
# Extract insert id of last inserted bgp record
last_record_id = db.insert_id()
print last_record_id
# Traverse elements
elem = rec.get_next_elem()
while(elem):
print last_record_id
## Dette bør kaste en exeption
if elem == None:
continue
# Insert element
db_cursor.execute(
"""INSERT INTO bgp_elements
(record_id_owner, element_time, peer_address, peer_asn)
VALUES
(
'"""+str(last_record_id)+"""',
'"""+str(elem.time)+"""',
'"""+str(elem.peer_address)+"""',
'"""+str(elem.peer_asn)+"""'
)
""")
elem = rec.get_next_elem()

I'm not experienced with DBs in python, but could it be that it is necessary to run db.commit() after you run the cursor.execute() function.

Check validity of last_record_id that might be passed as non-number which mysql will reject, same sa element_time.

reading external sql script in python

I am working on a learning how to execute SQL in python (I know SQL, not Python).
I have an external sql file. It creates and inserts data into three tables 'Zookeeper', 'Handles', 'Animal'.
Then I have a series of queries to run off the tables. The below queries are in the zookeeper.sql file that I load in at the top of the python script. Example for the first two are:
--1.1
SELECT ANAME,zookeepid
FROM ANIMAL, HANDLES
WHERE AID=ANIMALID;
--1.2
SELECT ZNAME, SUM(TIMETOFEED)
FROM ZOOKEEPER, ANIMAL, HANDLES
WHERE AID=ANIMALID AND ZOOKEEPID=ZID
GROUP BY zookeeper.zname;
These all execute fine in SQL. Now I need to execute them from within Python. I have been given and completed code to read in the file. Then execute all the queries in the loop.
The 1.1 and 1.2 is where I am getting confused. I believe in the loop this is the line where I should put in something to run the first and then second query.
result = c.execute("SELECT * FROM %s;" % table);
but what? I think I am missing something very obvious. I think what is throwing me off is % table. In query 1.1 and 1.2, I am not creating a table, but rather looking for a query result.
My entire python code is below.
import sqlite3
from sqlite3 import OperationalError
conn = sqlite3.connect('csc455_HW3.db')
c = conn.cursor()
# Open and read the file as a single buffer
fd = open('ZooDatabase.sql', 'r')
sqlFile = fd.read()
fd.close()
# all SQL commands (split on ';')
sqlCommands = sqlFile.split(';')
# Execute every command from the input file
for command in sqlCommands:
# This will skip and report errors
# For example, if the tables do not yet exist, this will skip over
# the DROP TABLE commands
try:
c.execute(command)
except OperationalError, msg:
print "Command skipped: ", msg
# For each of the 3 tables, query the database and print the contents
for table in ['ZooKeeper', 'Animal', 'Handles']:
**# Plug in the name of the table into SELECT * query
result = c.execute("SELECT * FROM %s;" % table);**
# Get all rows.
rows = result.fetchall();
# \n represents an end-of-line
print "\n--- TABLE ", table, "\n"
# This will print the name of the columns, padding each name up
# to 22 characters. Note that comma at the end prevents new lines
for desc in result.description:
print desc[0].rjust(22, ' '),
# End the line with column names
print ""
for row in rows:
for value in row:
# Print each value, padding it up with ' ' to 22 characters on the right
print str(value).rjust(22, ' '),
# End the values from the row
print ""
c.close()
conn.close()

Your code already contains a beautiful way to execute all statements from a specified sql file
# Open and read the file as a single buffer
fd = open('ZooDatabase.sql', 'r')
sqlFile = fd.read()
fd.close()
# all SQL commands (split on ';')
sqlCommands = sqlFile.split(';')
# Execute every command from the input file
for command in sqlCommands:
# This will skip and report errors
# For example, if the tables do not yet exist, this will skip over
# the DROP TABLE commands
try:
c.execute(command)
except OperationalError, msg:
print("Command skipped: ", msg)
Wrap this in a function and you can reuse it.
def executeScriptsFromFile(filename):
# Open and read the file as a single buffer
fd = open(filename, 'r')
sqlFile = fd.read()
fd.close()
# all SQL commands (split on ';')
sqlCommands = sqlFile.split(';')
# Execute every command from the input file
for command in sqlCommands:
# This will skip and report errors
# For example, if the tables do not yet exist, this will skip over
# the DROP TABLE commands
try:
c.execute(command)
except OperationalError, msg:
print("Command skipped: ", msg)
To use it
executeScriptsFromFile('zookeeper.sql')
You said you were confused by
result = c.execute("SELECT * FROM %s;" % table);
In Python, you can add stuff to a string by using something called string formatting.
You have a string "Some string with %s" with %s, that's a placeholder for something else. To replace the placeholder, you add % ("what you want to replace it with") after your string
ex:
a = "Hi, my name is %s and I have a %s hat" % ("Azeirah", "cool")
print(a)
>>> Hi, my name is Azeirah and I have a Cool hat
Bit of a childish example, but it should be clear.
Now, what
result = c.execute("SELECT * FROM %s;" % table);
means, is it replaces %s with the value of the table variable.
(created in)
for table in ['ZooKeeper', 'Animal', 'Handles']:
# for loop example
for fruit in ["apple", "pear", "orange"]:
print(fruit)
>>> apple
>>> pear
>>> orange
If you have any additional questions, poke me.

A very simple way to read an external script into an sqlite database in python is using executescript():
import sqlite3
conn = sqlite3.connect('csc455_HW3.db')
with open('ZooDatabase.sql', 'r') as sql_file:
conn.executescript(sql_file.read())
conn.close()

First make sure that a table exists if not, create a table then follow the steps.
import sqlite3
from sqlite3 import OperationalError
conn = sqlite3.connect('Client_DB.db')
c = conn.cursor()
def execute_sqlfile(filename):
c.execute("CREATE TABLE clients_parameters (adress text, ie text)")
#
fd = open(filename, 'r')
sqlFile = fd.readlines()
fd.close()
lvalues = [tuple(v.split(';')) for v in sqlFile[1:] ]
try:
#print(command)
c.executemany("INSERT INTO clients_parameters VALUES (?, ?)", lvalues)
except OperationalError as msg:
print ("Command skipped: ", msg)
execute_sqlfile('clients.sql')
print(c.rowcount)

according me, it is not possible
solution:
import .sql file on mysql server
after
import mysql.connector
import pandas as pd
and then you use .sql file by convert to dataframe

python mysqldb, for loop not deleting records

Somewhere in here lies a problem. http://paste.pocoo.org/show/528559/
Somewhere between lines 32 and 37. As you can see the DELETE FROM is inside a for loop.
Running the script just makes the program go through the loop and exit, without actually removing any records.
Any help would be greatly appreciated! Thanks!
#!/usr/bin/env python
# encoding: utf-8
import os, os.path, MySQLdb, pprint, string
class MySQLclass(object):
"""Learning Classes"""
def __init__(self, db):
self.db=db
self.cursor = self.db.cursor()
def sversion(self):
self.cursor.execute ("SELECT VERSION()")
row = self.cursor.fetchone ()
server_version = "server version:", row[0]
return server_version
def getRows(self, tbl):
""" Returns the content of the table tbl """
statmt="select * from %s" % tbl
self.cursor.execute(statmt)
rows=list(self.cursor.fetchall())
return rows
def getEmailRows(self, tbl):
""" Returns the content of the table tbl """
statmt="select email from %s" % tbl
self.cursor.execute(statmt)
rows=list(self.cursor.fetchall())
return rows
def removeRow(self,tbl,record):
""" Remove specific record """
print "Removing %s from table %s" %(record,tbl)
print tbl
self.cursor.execute ("""DELETE FROM maillist_frogs where email LIKE %s""", (record,))
def main():
#####connections removed
sql_frogs = MySQLclass(conn_frogs)
sql_mailgust = MySQLclass(conn_mailgust)
frogs_emails = sql_frogs.getEmailRows ("emails")
frogs_systemcatch = sql_frogs.getEmailRows ("systemcatch")
mailgust_emails = sql_mailgust.getEmailRows ("maillist_frogs")
aa = set(mailgust_emails)
remove = aa.intersection(frogs_emails)
remove = remove.union(aa.intersection(frogs_systemcatch))
for x in remove:
x= x[0]
remove_mailgust = sql_mailgust.removeRow ("maillist_frogs",x)
conn_frogs.close ()
conn_mailgust.close ()
if __name__ == '__main__':
main()

the problem is python-msyqldb specific.:
Starting with 1.2.0, MySQLdb disables autocommit by default, as required by the DB-API standard (PEP-249). If you are using InnoDB tables or some other type of transactional table type, you'll need to do connection.commit() before closing the connection, or else none of your changes will be written to the database.
therefore, after the DELETE, you must self.db.commit

The removeRow() method does not return a value, yet remove_mailgust is expecting to receive this non-existent value.
Also, your removeRow() class method is statically fixed to only search the table maillist_frogs in its query. You should probably set the table name to accept the second parameter of the method, tbl.
Finally, your removeRow() method is comparing the value of a record (presumably, an id) using LIKE, which is typically used for more promiscuous string comparison. Is email your primary key in this table? If so, I would suggest changing it to:
self.cursor.execute ("""DELETE FROM %s where email_id = %s""", (tbl, record,))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Process very large 900M row MySQL table line by line with Python - python

Related

How query asynchronous postgres with aws lambda python?

mycursor.executemany UPDATE not working as expected

MySQLdb is not filling rows in DB with data, but feildes are auto-incremented

reading external sql script in python

python mysqldb, for loop not deleting records

Categories

Resources