Duplicate Insertions in Database using sqlite, sqlalchemy, python

Duplicate Insertions in Database using sqlite, sqlalchemy, python - python

I am learning Python and, through the help of online resources and people on this site, am getting the hang of it. In this first script of mine, in which I'm parsing Twitter RSS feed entries and inserting the results into a database, there is one remaining problem that I cannot fix. Namely, duplicate entries are being inserted into one of the tables.
As a bit of background, I originally found a base script on HalOtis.com for downloading RSS feeds and then modified it in several ways: 1) modified to account for idiosyncracies in Twitter RSS feeds (it's not separated into content, title, URL, etc.); 2) added tables for "hashtags" and for the many-to-many relationship (entry_tag table); 3) changed table set-up to sqlalchemy; 4) made some ad hoc changes to account for weird unicode problems that were occurring. As a result, the code is ugly in places, but it has been a good learning experience and now works--except that it keeps inserting duplicates in the "entries" table.
Since I'm not sure what would be most helpful to people, I've pasted in the entire code below, with some comments in a few places to point out what I think is most important.
I would really appreciate any help with this. Thanks!
Edit: Somebody suggested I provide a schema for the database. I've never done this before, so if I'm not doing it right, bear with me. I am setting up four tables:
RSSFeeds, which contains a list of Twitter RSS feeds
RSSEntries, which contains a list of individual entries downloaded (after parsing) from each of the feeds (with columns for content, hashtags, date, url)
Tags, which contains a list of all the hashtags that are found in individual entries (Tweets)
entry_tag, which contains columns allowing me to map tags to entries.
In short, the script below grabs the five test RSS feeds from the RSS Feeds table, downloads the 20 latest entries / tweets from each feed, parses the entries, and puts the information into the RSS Entries, Tags, and entry_tag tables.
#!/usr/local/bin/python
import sqlite3
import threading
import time
import Queue
from time import strftime
import re
from string import split
import feedparser
from django.utils.encoding import smart_str, smart_unicode
from sqlalchemy import schema, types, ForeignKey, select, orm
from sqlalchemy import create_engine
engine = create_engine('sqlite:///test98.sqlite', echo=True)
metadata = schema.MetaData(engine)
metadata.bind = engine
def now():
return datetime.datetime.now()
#set up four tables, with many-to-many relationship
RSSFeeds = schema.Table('feeds', metadata,
schema.Column('id', types.Integer,
schema.Sequence('feeds_seq_id', optional=True), primary_key=True),
schema.Column('url', types.VARCHAR(1000), default=u''),
)
RSSEntries = schema.Table('entries', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entries_seq_id', optional=True), primary_key=True),
schema.Column('feed_id', types.Integer, schema.ForeignKey('feeds.id')),
schema.Column('short_url', types.VARCHAR(1000), default=u''),
schema.Column('content', types.Text(), nullable=False),
schema.Column('hashtags', types.Unicode(255)),
schema.Column('date', types.String()),
)
tag_table = schema.Table('tag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('tag_seq_id', optional=True), primary_key=True),
schema.Column('tagname', types.Unicode(20), nullable=False, unique=True),
)
entrytag_table = schema.Table('entrytag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entrytag_seq_id', optional=True), primary_key=True),
schema.Column('entryid', types.Integer, schema.ForeignKey('entries.id')),
schema.Column('tagid', types.Integer, schema.ForeignKey('tag.id')),
)
metadata.create_all(bind=engine, checkfirst=True)
# Insert test set of Twitter RSS feeds
stmt = RSSFeeds.insert()
stmt.execute(
{'url': 'http://twitter.com/statuses/user_timeline/14908909.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/52903246.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/41902319.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/29950404.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/35699859.rss'},
)
#These 3 lines for threading process (see HalOtis.com for example)
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
#connect to sqlite database and grab the 5 test RSS feeds
conn = engine.connect()
feeds = conn.execute('SELECT id, url FROM feeds').fetchall()
#This block contains all the parsing and DB insertion
def store_feed_items(id, items):
""" Takes a feed_id and a list of items and stores them in the DB """
for entry in items:
conn.execute('SELECT id from entries WHERE short_url=?', (entry.link,))
#note: entry.summary contains entire feed entry for Twitter,
#i.e., not separated into content, etc.
s = unicode(entry.summary)
test = s.split()
tinyurl2 = [i for i in test if i.startswith('http://')]
hashtags2 = [i for i in s.split() if i.startswith('#')]
content2 = ' '.join(i for i in s.split() if i not in tinyurl2+hashtags2)
content = unicode(content2)
tinyurl = unicode(tinyurl2)
hashtags = unicode (hashtags2)
print hashtags
date = strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)
#Insert parsed feed data into entries table
#THIS IS WHERE DUPLICATES OCCUR
result = conn.execute(RSSEntries.insert(), {'feed_id': id, 'short_url': tinyurl,
'content': content, 'hashtags': hashtags, 'date': date})
entry_id = result.last_inserted_ids()[0]
#Look up tag identifiers and create any that don't exist:
tags = tag_table
tag_id_query = select([tags.c.tagname, tags.c.id], tags.c.tagname.in_(hashtags2))
tag_ids = dict(conn.execute(tag_id_query).fetchall())
for tag in hashtags2:
if tag not in tag_ids:
result = conn.execute(tags.insert(), {'tagname': tag})
tag_ids[tag] = result.last_inserted_ids()[0]
#insert data into entrytag table
if hashtags2: conn.execute(entrytag_table.insert(),
[{'entryid': entry_id, 'tagid': tag_ids[tag]} for tag in hashtags2])
#Rest of file completes the threading process
def thread():
while True:
try:
id, feed_url = jobs.get(False) # False = Don't wait
except Queue.Empty:
return
entries = feedparser.parse(feed_url).entries
rss_to_process.put((id, entries), True) # This will block if full
for info in feeds: # Queue them up
jobs.put([info['id'], info['url']])
for n in xrange(THREAD_LIMIT):
t = threading.Thread(target=thread)
t.start()
while threading.activeCount() > 1 or not rss_to_process.empty():
# That condition means we want to do this loop if there are threads
# running OR there's stuff to process
try:
id, entries = rss_to_process.get(False, 1) # Wait for up to a second
except Queue.Empty:
continue
store_feed_items(id, entries)

It looks like you included SQLAlchemy into a previously existing script that didn't use SQLAlchemy. There are too many moving parts here that none of us apparently understand well enough.
I would recommend starting from scratch. Don't use threading. Don't use sqlalchemy. To start maybe don't even use an SQL database. Write a script that collects the information you want in the simplist possible way into a simple data structure using simple loops and maybe a time.sleep(). Then when that works you can add in storage to an SQL database, and I really don't think writing SQL statements directly is much harder than using an ORM and it's easier to debug IMHO. There is a good chance you will never need to add threading.
"If you think you are smart enough to write multi-threaded programs, you're not." -- James Ahlstrom.

Related

"Maximum number of parameters" error with filter .in_(list) using pyodbc

One of our queries that was working in Python 2 + mxODBC is not working in Python 3 + pyodbc; it raises an error like this: Maximum number of parameters in the sql query is 2100. while connecting to SQL Server. Since both the printed queries have 3000 params, I thought it should fail in both environments, but clearly that doesn't seem to be the case here. In the Python 2 environment, both MSODBC 11 or MSODBC 17 works, so I immediately ruled out a driver related issue.
So my question is:
Is it correct to send a list as multiple params in SQLAlchemy because the param list will be proportional to the length of list? I think it looks a bit strange; I would have preferred concatenating the list into a single string because the DB doesn't understand the list datatype.
Are there any hints on why it would be working in mxODBC but not pyodbc? Does mxODBC optimize something that pyodbc does not? Please let me know if there are any pointers - I can try and paste more info here. (I am still new to debugging SQLAlchemy.)
Footnote: I have seen lot of answers that suggest to chunk the data, but because of 1 and 2, I wonder if I am doing the correct thing in the first place.
(Since it seems to be related to pyodbc, I have raised an internal issue in the official repository.)
import sqlalchemy
import sqlalchemy.orm
from sqlalchemy import MetaData, Table
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm.session import Session
Base = declarative_base()
create_tables = """
CREATE TABLE products(
idn NUMERIC(8) PRIMARY KEY
);
"""
check_tables = """
SELECT * FROM products;
"""
insert_values = """
INSERT INTO products
(idn)
values
(1),
(2);
"""
delete_tables = """
DROP TABLE products;
"""
engine = sqlalchemy.create_engine('mssql+pyodbc://user:password#dsn')
connection = engine.connect()
cursor = engine.raw_connection().cursor()
Session = sqlalchemy.orm.sessionmaker(bind=connection)
session = Session()
session.execute(create_tables)
metadata = MetaData(connection)
class Products(Base):
__table__ = Table('products', metadata, autoload=True)
try:
session.execute(check_tables)
session.execute(insert_values)
session.commit()
query = session.query(Products).filter(
Products.idn.in_(list(range(0, 3000)))
)
query.all()
f = open("query.sql", "w")
f.write(str(query))
f.close()
finally:
session.execute(delete_tables)
session.commit()

When you do a straightforward .in_(list_of_values) SQLAlchemy renders the following SQL ...
SELECT team.prov AS team_prov, team.city AS team_city
FROM team
WHERE team.prov IN (?, ?)
... where each value in the IN clause is specified as a separate parameter value. pyodbc sends this to SQL Server as ...
exec sp_prepexec #p1 output,N'#P1 nvarchar(4),#P2 nvarchar(4)',N'SELECT team.prov AS team_prov, team.city AS team_city, team.team_name AS team_team_name
FROM team
WHERE team.prov IN (#P1, #P2)',N'AB',N'ON'
... so you hit the limit of 2100 parameters if your list is very long. Presumably, mxODBC inserted the parameter values inline before sending it to SQL Server, e.g.,
SELECT team.prov AS team_prov, team.city AS team_city
FROM team
WHERE team.prov IN ('AB', 'ON')
You can get SQLAlchemy to do that for you with
provinces = ["AB", "ON"]
stmt = (
session.query(Team)
.filter(
Team.prov.in_(sa.bindparam("p1", expanding=True, literal_execute=True))
)
.statement
)
result = list(session.query(Team).params(p1=provinces).from_statement(stmt))

Python Pony ORM Insert multiple values at once

I'm trying to insert multiple values into my postgres database using Pony ORM. My current approach is very inefficient:
from pony.orm import *
db = Database()
class Names(db.Entity):
first_name = Optional(str)
last_name = Optional(str)
family = [["Peter", "Mueller"], ["Paul", "Meyer"], ...]
#db_session
def populate_names(name_list)
for name in name_list:
db.insert("Names", first_name=name[0], last_name=name[1])
if __name__ == "__main__":
db.bind(provider='postgres', user='', password='', host='', database='')
db.generate_mappings(create_tables=True)
populate_names(family)
This is just a short example but the structure of the input is the same:
a list of lists.
I'm extracting the data from several xml files and insert one "file" at a time.
Does anyone has an idea on how to put several rows of data into one insert query in Pony ORM?

Pony doesn't provide something special for this, you can use execute_values from psycopg2.extras. Get connection object from db to use it.
from psycopg2.extras import execute_values
...
names = [
('はると', '一温'),
('りく', '俐空'),
('はる', '和晴'),
('ひなた', '向日'),
('ゆうと', '佑篤')
]
#db_session
def populate_persons(names):
sql = 'insert into Person(first_name, last_name) values %s'
con = db.get_connection()
cur = con.cursor()
execute_values(cur, sql, names)
populate_persons(names)
execute_values is in Fast execution helpers list so I think that iе should be the most efficient way.

Currently I'm experimenting with PonyORM for a future project and also came to the conclusion you provided.
The only way on how to insert data in a bulky way is:
# assuming data has this structure:
# [['foo','bar','bazooka'],...]
#db_session
def insert_bulk_array(field1, field2, field3):
MyClass(field1=field1, field2=field2, field3=field3)
# assuming the data is:
# {'field1':'foo','field2':'bar','field3':'bazooka'}
#db_session
def insert_bulk_dict(data)
MyClass(**data)
But from my point of view this is still somehow handy, specially when your data comes as JSON.

There is an open issue in the issue tracker of PonyORM which asks for exactly this feature.
I recommend to vote for it.

Python - Collect data from APIs, Cache and Push To DB - in Parallel

I'm having a hard time figuring it out how to develop the phase 3 of this algorithm:
Fetch data from a series of APIs
Store the data in the script until a certain condition is reached (cache and don't disturb the DB)
Push that structured data to a database AND at the same time continue with 1 (launch 1 without wait to complete the upload on the DB, the two things should go in parallel)
import requests
import time
from sqlalchemy import schema, types
from sqlalchemy.engine import create_engine
import threading
# I usually work on postgres
meta = schema.MetaData(schema="example")
# table one
table_api_one = schema.Table('api_one', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# table two
table_api_two = schema.Table('api_two', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# create tables
engine = create_engine("postgres://......", echo=False, pool_size=15, max_overflow=15)
meta.bind = engine
meta.create_all(checkfirst=True)
# get the data from the API and return data as JSON
def getdatafrom(url):
data = requests.get(url)
structured = data.json()
return structured
# push the data to the DB
def flush(list_one,list_two):
connection = engine.connect()
# both lists are list of json
connection.execute(table_api_one.insert(),list_one)
connection.execute(table_api_two.insert(),list_two)
connection.close()
# start doing something
def main():
timesleep = 30
flush_limit = 10
threading.Timer(timesleep * flush_limit, main).start()
data_api_one = []
data_api_two = []
# repeat the process 10 times (flush_limit) avoiding to keep to busy the DB
WHILE len(data_api_one) > flush_limit AND len(data_api_two) > flush_limit:
data_api_one.append(getdatafrom("http://www.apiurlone.com/api...").copy())
data_api_two.append(getdatafrom("http://www.apiurltwo.com/api...").copy())
time.sleep(timesleep)
# push the data when the limit is reached
flush(data_api_one,data_api_two)
# start the example
main()
In this example script, the thread is launched every 10 * 30 sec a main() (avoid overlapping the threads)
but, for this algorithm during the time of the flush() the script stop collecting the data from the APIs.
How it's possible to flush and keep getting the data from the APIs continuously?
thanks!

Usual approach is a Queue object (from module named Queue or queue, depending on Python version).
Create a producer function (running in one thread) which collects api data and when flushing puts it in the queue and a consumer function running in another thread waiting to get the data from the queue and store it to the database.

Close SQLAlchemy connection

I have the following function in python:
def add_odm_object(obj, table_name, primary_key, unique_column):
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
metadata = MetaData(db)
t = Table(table_name, metadata, autoload=True)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = s.execute()
r = rs.fetchone()
if not r:
i = t.insert()
i_res = i.execute(obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]
This function looks if the object obj is in the database, and if it is not found, it saves it to the DB. Now, I have a problem. I call the above function in a loop many times. And after few hundred times, I get an error: user root has exceeded the max_user_connections resource (current value: 30) I tried to search for answers and for example the question: How to close sqlalchemy connection in MySQL recommends creating a conn = db.connect() object where dbis the engine and calling conn.close() after my query is completed.
But, where should I open and close the connection in my code? I am not working with the connection directly, but I'm using the Table() and MetaData functions in my code.

The engine is an expensive-to-create factory for database connections. Your application should call create_engine() exactly once per database server.
Similarly, the MetaData and Table objects describe a fixed schema object within a known database. These are also configurational constructs that in most cases are created once, just like classes, in a module.
In this case, your function seems to want to load up tables dynamically, which is fine; the MetaData object acts as a registry, which has the convenience feature that it will give you back an existing table if it already exists.
Within a Python function and especially within a loop, for best performance you typically want to refer to a single database connection only.
Taking these things into account, your module might look like:
# module level variable. can be initialized later,
# but generally just want to create this once.
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
# module level MetaData collection.
metadata = MetaData()
def add_odm_object(obj, table_name, primary_key, unique_column):
with db.begin() as connection:
# will load table_name exactly once, then store it persistently
# within the above MetaData
t = Table(table_name, metadata, autoload=True, autoload_with=conn)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = connection.execute(s)
r = rs.fetchone()
if not r:
i_res = connection.execute(t.insert(), some_col=obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]

How to execute raw SQL in Flask-SQLAlchemy app

How do you execute raw SQL in SQLAlchemy?
I have a python web app that runs on flask and interfaces to the database through SQLAlchemy.
I need a way to run the raw SQL. The query involves multiple table joins along with Inline views.
I've tried:
connection = db.session.connection()
connection.execute( <sql here> )
But I keep getting gateway errors.

Have you tried:
result = db.engine.execute("<sql here>")
or:
from sqlalchemy import text
sql = text('select name from penguins')
result = db.engine.execute(sql)
names = [row[0] for row in result]
print names
Note that db.engine.execute() is "connectionless", which is deprecated in SQLAlchemy 2.0.

SQL Alchemy session objects have their own execute method:
result = db.session.execute('SELECT * FROM my_table WHERE my_column = :val', {'val': 5})
All your application queries should be going through a session object, whether they're raw SQL or not. This ensures that the queries are properly managed by a transaction, which allows multiple queries in the same request to be committed or rolled back as a single unit. Going outside the transaction using the engine or the connection puts you at much greater risk of subtle, possibly hard to detect bugs that can leave you with corrupted data. Each request should be associated with only one transaction, and using db.session will ensure this is the case for your application.
Also take note that execute is designed for parameterized queries. Use parameters, like :val in the example, for any inputs to the query to protect yourself from SQL injection attacks. You can provide the value for these parameters by passing a dict as the second argument, where each key is the name of the parameter as it appears in the query. The exact syntax of the parameter itself may be different depending on your database, but all of the major relational databases support them in some form.
Assuming it's a SELECT query, this will return an iterable of RowProxy objects.
You can access individual columns with a variety of techniques:
for r in result:
print(r[0]) # Access by positional index
print(r['my_column']) # Access by column name as a string
r_dict = dict(r.items()) # convert to dict keyed by column names
Personally, I prefer to convert the results into namedtuples:
from collections import namedtuple
Record = namedtuple('Record', result.keys())
records = [Record(*r) for r in result.fetchall()]
for r in records:
print(r.my_column)
print(r)
If you're not using the Flask-SQLAlchemy extension, you can still easily use a session:
import sqlalchemy
from sqlalchemy.orm import sessionmaker, scoped_session
engine = sqlalchemy.create_engine('my connection string')
Session = scoped_session(sessionmaker(bind=engine))
s = Session()
result = s.execute('SELECT * FROM my_table WHERE my_column = :val', {'val': 5})

docs: SQL Expression Language Tutorial - Using Text
example:
from sqlalchemy.sql import text
connection = engine.connect()
# recommended
cmd = 'select * from Employees where EmployeeGroup = :group'
employeeGroup = 'Staff'
employees = connection.execute(text(cmd), group = employeeGroup)
# or - wee more difficult to interpret the command
employeeGroup = 'Staff'
employees = connection.execute(
text('select * from Employees where EmployeeGroup = :group'),
group = employeeGroup)
# or - notice the requirement to quote 'Staff'
employees = connection.execute(
text("select * from Employees where EmployeeGroup = 'Staff'"))
for employee in employees: logger.debug(employee)
# output
(0, 'Tim', 'Gurra', 'Staff', '991-509-9284')
(1, 'Jim', 'Carey', 'Staff', '832-252-1910')
(2, 'Lee', 'Asher', 'Staff', '897-747-1564')
(3, 'Ben', 'Hayes', 'Staff', '584-255-2631')

You can get the results of SELECT SQL queries using from_statement() and text() as shown here. You don't have to deal with tuples this way. As an example for a class User having the table name users you can try,
from sqlalchemy.sql import text
user = session.query(User).from_statement(
text("""SELECT * FROM users where name=:name""")
).params(name="ed").all()
return user

For SQLAlchemy ≥ 1.4
Starting in SQLAlchemy 1.4, connectionless or implicit execution has been deprecated, i.e.
db.engine.execute(...) # DEPRECATED
as well as bare strings as queries.
The new API requires an explicit connection, e.g.
from sqlalchemy import text
with db.engine.connect() as connection:
result = connection.execute(text("SELECT * FROM ..."))
for row in result:
# ...
Similarly, it’s encouraged to use an existing Session if one is available:
result = session.execute(sqlalchemy.text("SELECT * FROM ..."))
or using parameters:
session.execute(sqlalchemy.text("SELECT * FROM a_table WHERE a_column = :val"),
{'val': 5})
See "Connectionless Execution, Implicit Execution" in the documentation for more details.

result = db.engine.execute(text("<sql here>"))
executes the <sql here> but doesn't commit it unless you're on autocommit mode. So, inserts and updates wouldn't reflect in the database.
To commit after the changes, do
result = db.engine.execute(text("<sql here>").execution_options(autocommit=True))

This is a simplified answer of how to run SQL query from Flask Shell
First, map your module (if your module/app is manage.py in the principal folder and you are in a UNIX Operating system), run:
export FLASK_APP=manage
Run Flask shell
flask shell
Import what we need::
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
db = SQLAlchemy(app)
from sqlalchemy import text
Run your query:
result = db.engine.execute(text("<sql here>").execution_options(autocommit=True))
This use the currently database connection which has the application.

Flask-SQLAlchemy v: 3.0.x / SQLAlchemy v: 1.4
users = db.session.execute(db.select(User).order_by(User.title.desc()).limit(150)).scalars()
So basically for the latest stable version of the flask-sqlalchemy specifically the documentation suggests using the session.execute() method in conjunction with the db.select(Object).

Have you tried using connection.execute(text( <sql here> ), <bind params here> ) and bind parameters as described in the docs? This can help solve many parameter formatting and performance problems. Maybe the gateway error is a timeout? Bind parameters tend to make complex queries execute substantially faster.

If you want to avoid tuples, another way is by calling the first, one or all methods:
query = db.engine.execute("SELECT * FROM blogs "
"WHERE id = 1 ")
assert query.first().name == "Welcome to my blog"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicate Insertions in Database using sqlite, sqlalchemy, python - python

Related

"Maximum number of parameters" error with filter .in_(list) using pyodbc

Python Pony ORM Insert multiple values at once

Python - Collect data from APIs, Cache and Push To DB - in Parallel

Close SQLAlchemy connection

How to execute raw SQL in Flask-SQLAlchemy app

Categories

Resources