Python - Collect data from APIs, Cache and Push To DB - in Parallel

Python - Collect data from APIs, Cache and Push To DB - in Parallel - python

I'm having a hard time figuring it out how to develop the phase 3 of this algorithm:
Fetch data from a series of APIs
Store the data in the script until a certain condition is reached (cache and don't disturb the DB)
Push that structured data to a database AND at the same time continue with 1 (launch 1 without wait to complete the upload on the DB, the two things should go in parallel)
import requests
import time
from sqlalchemy import schema, types
from sqlalchemy.engine import create_engine
import threading
# I usually work on postgres
meta = schema.MetaData(schema="example")
# table one
table_api_one = schema.Table('api_one', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# table two
table_api_two = schema.Table('api_two', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# create tables
engine = create_engine("postgres://......", echo=False, pool_size=15, max_overflow=15)
meta.bind = engine
meta.create_all(checkfirst=True)
# get the data from the API and return data as JSON
def getdatafrom(url):
data = requests.get(url)
structured = data.json()
return structured
# push the data to the DB
def flush(list_one,list_two):
connection = engine.connect()
# both lists are list of json
connection.execute(table_api_one.insert(),list_one)
connection.execute(table_api_two.insert(),list_two)
connection.close()
# start doing something
def main():
timesleep = 30
flush_limit = 10
threading.Timer(timesleep * flush_limit, main).start()
data_api_one = []
data_api_two = []
# repeat the process 10 times (flush_limit) avoiding to keep to busy the DB
WHILE len(data_api_one) > flush_limit AND len(data_api_two) > flush_limit:
data_api_one.append(getdatafrom("http://www.apiurlone.com/api...").copy())
data_api_two.append(getdatafrom("http://www.apiurltwo.com/api...").copy())
time.sleep(timesleep)
# push the data when the limit is reached
flush(data_api_one,data_api_two)
# start the example
main()
In this example script, the thread is launched every 10 * 30 sec a main() (avoid overlapping the threads)
but, for this algorithm during the time of the flush() the script stop collecting the data from the APIs.
How it's possible to flush and keep getting the data from the APIs continuously?
thanks!

Usual approach is a Queue object (from module named Queue or queue, depending on Python version).
Create a producer function (running in one thread) which collects api data and when flushing puts it in the queue and a consumer function running in another thread waiting to get the data from the queue and store it to the database.

Related

Multi Processing with sqlalchemy

I have a python script that handles data transactions through sqlalchemy using:
def save_update(args):
session, engine = create_session(config["DATABASE"])
try:
instance = get_record(session)
if instance is None:
instance = create_record(session)
else:
instance = update_record(session, instance)
sync_errors(session, instance)
sync_expressions(session, instance)
sync_part(session, instance)
session.commit()
except:
session.rollback()
write_error(config)
raise
finally:
session.close()
On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).
I am thus doing in a different script, simplified example here:
process1 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process2 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process1.start()
process2.start()
process1.join()
process2.join()
The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):
def call_script(args):
status = subprocess.call(args, shell=True)
print(status)
So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.
My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.
Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.
EDIT
I have used multi processing to replace multi threading for testing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
I still have same issue on which I got more details:
I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command.
I feel it appears when I am intensely parallelizing the sub processes but really not sure.
Thanks in advance for any support.

This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool. I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork(), ie. before the Pool tasks are created, this is mentioned here: engine-disposal
from random import randint
from datetime import datetime
from multiprocessing import Pool
from sqlalchemy import (
create_engine,
Integer,
DateTime,
String,
)
from sqlalchemy.schema import (
Column,
MetaData,
ForeignKey,
)
from sqlalchemy.orm import declarative_base, relationship, Session, backref
db_uri = 'postgresql+psycopg2://username:password#/database'
engine = create_engine(db_uri, echo=False)
metadata = MetaData()
Base = declarative_base(metadata=metadata)
class Event(Base):
__tablename__ = "events"
id = Column(Integer, primary_key=True, index=True)
created_on = Column(DateTime, nullable=False, index=True)
device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
device = relationship('Device', backref=backref("events"))
class Device(Base):
__tablename__ = "devices"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(50))
def get_test_data(device_num):
""" Generate a test device and its test events for the given device number. """
device_dict = dict(name=f'device-{device_num}')
event_dicts = []
for day in range(1, 5):
for hour in range(0, 24):
for _ in range(0, randint(0, 50)):
event_dicts.append({
"created_on": datetime(day=day, month=1, year=2022, hour=hour),
})
return (device_dict, event_dicts)
def create_test_data(device_num):
""" Actually write the test data to the database. """
device_dict, event_dicts = get_test_data(device_num)
print (f"creating test data for {device_dict['name']}")
with Session(engine) as session:
device = Device(**device_dict)
session.add(device)
session.flush()
events = [Event(**event_dict) for event_dict in event_dicts]
event_count = len(events)
device.events.extend(events)
session.add_all(events)
session.commit()
return event_count
if __name__ == '__main__':
metadata.create_all(engine)
# Throw this away before fork.
engine.dispose()
# I have a 4-core processor, so I chose 3.
with Pool(3) as p:
print (p.map(create_test_data, range(0, 10)))
# Accessing engine here should still work
# but a new connection will be created.
with Session(engine) as session:
print (session.query(Event).count())
Output
creating test data for device-1
creating test data for device-0
creating test data for device-2
creating test data for device-3
creating test data for device-4
creating test data for device-5
creating test data for device-6
creating test data for device-7
creating test data for device-8
creating test data for device-9
[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
23596

I am answering my question as it did not relate to SQLAlchemy at all in the end.
When executing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
On a specific batch and for no direct reasons, one of the subprocess was not exiting properly although the script it called was arriving at the end.
I searched and saw that this is an issue of using p.wait() with Popen having shell=True.
I set shell=False and used Pipes to stdout and stderr and also added a sys.exit(0) at the end of the python script being executed by the subprocess to make sure it terminates properly the execution.
Hope it can help someone else! Thanks Ian for your support.

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

I need to embarrassingly parallel the fetch job for thousands of sql query from database.
Here is the simplified example.
##Env info: python=3.7 postgresql=10 dask=latest
##generate the example db table.
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
data = pd.DataFrame(np.random.randint(0,100 , size=(30000,5)),columns=['a','b','c','d','e'])
data.to_sql('tablename',engine,index=True,if_exists='append')
First, this is the basic example without dask parallel.
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
def job(indexstr):
'send the query, fetch the data, do some calculate and return'
sql='select * from public.tablename where index='+indexstr
df=pd.read_sql_query(sql, engine, index_col='index',)
##get the data and do some analysis.
return np.sum(df.values)
for v in range(1000):
lists.append(job(str(v)))
### wall time:17s
It's not as fast as we image since both the database query and data analysis might cost time and there are more idle cpu.
Then I try to use dask to parallel it like this.
def jobWithEngine(indexstr):
`engine cannot be serialized between processes thus create each own.`
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
sql='select * from public.tablename where index='+indexstr
df=pd.read_sql_query(sql, engine, index_col='index',)
return np.sum(df.values)
import dask
dask.config.set(scheduler='processes')
import dask.bag as db
dbdata=db.from_sequence([str(v) for v in range(1000)])
dbdata=dbdata.map(lambda x:jobWithEngine(x))
results_bag = dbdata.compute()
###Wall time:1min8s
The problem is that I find the engine creation takes more time and there are thousands of it.
It will be recreated in every sql query which is really costly and it might crash the database service!
So I guess it must be more elegant way like this:
import dask
dask.config.set(scheduler='processes')
import dask.bag as db
dbdata=db.from_sequence([str(v) for v in range(1000)])
dbdata=dbdata.map(lambda x:job(x,init=create_engine))
results_bag = dbdata.compute()
1.The main process create 8 sub process.
2.Each process create its own engine to initialize the job preparation.
3.Then main process send them 1000 jobs and get the 1000 return.
4.After all is done, sub process engine is stopped and kill the sub process.
Or the dask have already done this and the additional time comes from communications between process?

You can do this by setting a connected database as a variable for each worker using get_worker
from dask.distributed import get_worker
def connect_worker_db(db):
worker = get_worker()
worker.db = db # DB settings, password, username etc
worker.db.connect() # Function that connects the database, e.g. create_engine()
Then have the client run the connect_worker_db:
from dask.distributed import Client, get_worker
client = Client()
client.run(connect_worker_db, db)
For the function using the connection, like jobWithEngine(), you have to get the worker and use the parameter you have saved it to:
def jobWithEngine():
db = get_worker().db
Then make sure to disconnect at the end:
def disconnect_worker_db():
worker = get_worker()
worker.db.disconnect()
client.run(disconnect_worker_db)

Amy's answer has the benefit of being simple, but if for any reason dask starts new workers, they will not have .db.
I don't know when first introduced, but Dask 1.12.2 has a Client.register_worker_callbacks which takes a function as a parameter intended for this kind of use. If this callback takes a param called dask_worker then worker itself will be passed.
def main():
dask_client = dask.distributed.Client(cluster)
db = dict(
host="db-host",
username="user",
# etc etc
)
def worker_setup(dask_worker: dask.distributed.Worker):
dask_worker.db = db
dask_client.register_worker_callbacks(worker_setup)
https://distributed.dask.org/en/latest/api.html#distributed.Client.register_worker_callbacks
However, this doesn't close the db connections at the end. You probably will be covered with client.run(disconnect_worker_db) but I have seen some workers not releasing their resources. Fixing this in a more comprehensive manner needs a bit more code as per https://distributed.dask.org/en/latest/api.html#distributed.Client.register_worker_plugin
class MyWorkerPlugin(dask.distributed.WorkerPlugin):
def __init__(self, *args, **kwargs):
self.db = kwargs.get("db")
assert self.db, "no db"
def setup(self, worker: dask.distributed.Worker):
worker.db = self.db
def teardown(self, worker: dask.distributed.Worker):
print(f"worker {worker.name} teardown")
# eg db.disconnect()
def main():
cluster = dask.distributed.LocalCluster(
n_workers=os.cpu_count(),
threads_per_worker=2,
)
dask_client = dask.distributed.Client(cluster)
db = dict(
host="db-host",
username="user",
# etc etc
)
dask_client.register_worker_plugin(LGInferWorkerPlugin, "set-dbs", db=db)
dask_client.start()
You can give the plugin somewhat helpful names, and pass in kwargs to be used in the plugin's __init__.

Issues with concurrent inserts on Redshift table

I am trying to concurrently process insert/update into a redshift database using a python script on AWS glue. I am using the pg8000 library to do all my database operations. The concurrent insert/update fails with an error Error Name:1023 ,Error State:XX000). While researching the error I found out that the error was related to Serializable Isolation.
Can anyone look at the code and ensure that there would not be clashes while the insert/update happens?
I tried using a random sleep time within the calling class. it worked for a couple of cases but then as the number of workers increased. It failed for an insert/update case.
import sys
import time
import concurrent.futures
import pg8000
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME','REDSHIFT_HOST','REDSHIFT_PORT','REDSHIFT_DB','REDSHIFT_USER_NAME','REDSHIFT_USER_PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job_run_id = args['JOB_RUN_ID']
maximum_workers = 5
def executeSql(sqlStmt):
conn = pg8000.connect(database=args['REDSHIFT_DB'],user=args['REDSHIFT_USER_NAME'],password=args['REDSHIFT_USER_PASSWORD'],host=args['REDSHIFT_HOST'],port=int(args['REDSHIFT_PORT']))
conn.autocommit = True
cur = conn.cursor()
cur.execute(sqlStmt)
cur.close()
conn.close()
def executeSqlProcedure(procedureName, procedureArgs = ""):
try:
logProcStrFormat = "CALL table_insert_proc('{}','{}','{}','{}',{},{})"
#Insert into the log table - create the record
executeSql (logProcStrFormat.format(job_run_id,procedureName,'pending','','getdate()','null')) #Code fails here
#Executing the procedure
procStrFormat = "CALL {}({})"
executeSql(procStrFormat.format(procedureName,procedureArgs))
print("Printing from {} process at ".format(procedureName),time.ctime())
#Update the record in log table to complete
executeSql (logProcStrFormat.format(job_run_id,procedureName,'complete','','null','getdate()')) #Code fails here
except Exception as e:
errorMsg = str(e.message["M"])
executeSql (logProcStrFormat.format(job_run_id,procedureName,'failure',errorMsg,'null','getdate()'))
raise
sys.exit(1)
def runDims():
dimProcedures = ["test_proc1","test_proc2","test_proc3","test_proc4","test_proc5"]
with concurrent.futures.ThreadPoolExecutor(max_workers=maximum_workers) as executor:
result = list(executor.map(executeSqlProcedure, dimProcedures))
def runFacts():
factProcedures = ["test_proc6","test_proc7","test_proc8","test_proc9"]
with concurrent.futures.ThreadPoolExecutor(max_workers=maximum_workers) as executor:
result = list(executor.map(executeSqlProcedure, factProcedures))
runDims()
runFacts()
I expect the insert/update to occur into the log table without locking/erroring out

Amazon Redshift does not work well with lots of small INSERT statements.
From Use a Multi-Row Insert - Amazon Redshift:
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.
Multi-row inserts improve performance by batching up a series of inserts. The following example inserts three rows into a four-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert.
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
Alternatively, output the data to Amazon S3, then perform a bulk load using the COPY command. This will be much more efficient because it can perform the load in parallel across all nodes.

Google BigQuery - python client - creating/managing jobs

I'm new to the BigQuery world... I'm using the python google.cloud package and I need simply to run a query from Python on a BigQuery table and print the results. This is the part of the query function which creates a query job.
function test():
query = "SELECT * FROM " + dataset_name + '.' + table_name
job = bigquery_client.run_async_query('test-job', query)
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
sleep(10)
job.reload() # API call
print(job.state)
print(job.ended)
If I run the test() function multiple times, I get the error:
google.api.core.exceptions.Conflict: 409 POST https://www.googleapis.com/bigquery/v2/projects/myprocject/jobs:
Already Exists: Job myprocject:test-job
Since I have to run the test() function multiple times, do I have to delete the job named 'test-job' each time or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?

do I have to delete the job named 'test-job' each time
You cannot delete job. Jobs collection stores your project's complete job history, but availability is only guaranteed for jobs created in the past six months. The best you can do is to request automatic deletion of jobs that are more than 50 days old for which you should contact support.
or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?
Yes. This is the way to go

As a side recommendation, we usually do it like:
import uuid
job_name = str(uuid.uuid4())
job = bigquery_client.run_async_query(job_name, query)
Notice this is already automatic if you run a synced query.
Also, you don't have to manage the validation for job completeness (as of version 0.27.0), if you want you can use it like:
job = bigquery_client.run_async_query(job_name, query)
job_result = job.result()
query_result = job_result.query_results()
data = list(query_result.fetch_data())

Duplicate Insertions in Database using sqlite, sqlalchemy, python

I am learning Python and, through the help of online resources and people on this site, am getting the hang of it. In this first script of mine, in which I'm parsing Twitter RSS feed entries and inserting the results into a database, there is one remaining problem that I cannot fix. Namely, duplicate entries are being inserted into one of the tables.
As a bit of background, I originally found a base script on HalOtis.com for downloading RSS feeds and then modified it in several ways: 1) modified to account for idiosyncracies in Twitter RSS feeds (it's not separated into content, title, URL, etc.); 2) added tables for "hashtags" and for the many-to-many relationship (entry_tag table); 3) changed table set-up to sqlalchemy; 4) made some ad hoc changes to account for weird unicode problems that were occurring. As a result, the code is ugly in places, but it has been a good learning experience and now works--except that it keeps inserting duplicates in the "entries" table.
Since I'm not sure what would be most helpful to people, I've pasted in the entire code below, with some comments in a few places to point out what I think is most important.
I would really appreciate any help with this. Thanks!
Edit: Somebody suggested I provide a schema for the database. I've never done this before, so if I'm not doing it right, bear with me. I am setting up four tables:
RSSFeeds, which contains a list of Twitter RSS feeds
RSSEntries, which contains a list of individual entries downloaded (after parsing) from each of the feeds (with columns for content, hashtags, date, url)
Tags, which contains a list of all the hashtags that are found in individual entries (Tweets)
entry_tag, which contains columns allowing me to map tags to entries.
In short, the script below grabs the five test RSS feeds from the RSS Feeds table, downloads the 20 latest entries / tweets from each feed, parses the entries, and puts the information into the RSS Entries, Tags, and entry_tag tables.
#!/usr/local/bin/python
import sqlite3
import threading
import time
import Queue
from time import strftime
import re
from string import split
import feedparser
from django.utils.encoding import smart_str, smart_unicode
from sqlalchemy import schema, types, ForeignKey, select, orm
from sqlalchemy import create_engine
engine = create_engine('sqlite:///test98.sqlite', echo=True)
metadata = schema.MetaData(engine)
metadata.bind = engine
def now():
return datetime.datetime.now()
#set up four tables, with many-to-many relationship
RSSFeeds = schema.Table('feeds', metadata,
schema.Column('id', types.Integer,
schema.Sequence('feeds_seq_id', optional=True), primary_key=True),
schema.Column('url', types.VARCHAR(1000), default=u''),
)
RSSEntries = schema.Table('entries', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entries_seq_id', optional=True), primary_key=True),
schema.Column('feed_id', types.Integer, schema.ForeignKey('feeds.id')),
schema.Column('short_url', types.VARCHAR(1000), default=u''),
schema.Column('content', types.Text(), nullable=False),
schema.Column('hashtags', types.Unicode(255)),
schema.Column('date', types.String()),
)
tag_table = schema.Table('tag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('tag_seq_id', optional=True), primary_key=True),
schema.Column('tagname', types.Unicode(20), nullable=False, unique=True),
)
entrytag_table = schema.Table('entrytag', metadata,
schema.Column('id', types.Integer,
schema.Sequence('entrytag_seq_id', optional=True), primary_key=True),
schema.Column('entryid', types.Integer, schema.ForeignKey('entries.id')),
schema.Column('tagid', types.Integer, schema.ForeignKey('tag.id')),
)
metadata.create_all(bind=engine, checkfirst=True)
# Insert test set of Twitter RSS feeds
stmt = RSSFeeds.insert()
stmt.execute(
{'url': 'http://twitter.com/statuses/user_timeline/14908909.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/52903246.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/41902319.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/29950404.rss'},
{'url': 'http://twitter.com/statuses/user_timeline/35699859.rss'},
)
#These 3 lines for threading process (see HalOtis.com for example)
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
#connect to sqlite database and grab the 5 test RSS feeds
conn = engine.connect()
feeds = conn.execute('SELECT id, url FROM feeds').fetchall()
#This block contains all the parsing and DB insertion
def store_feed_items(id, items):
""" Takes a feed_id and a list of items and stores them in the DB """
for entry in items:
conn.execute('SELECT id from entries WHERE short_url=?', (entry.link,))
#note: entry.summary contains entire feed entry for Twitter,
#i.e., not separated into content, etc.
s = unicode(entry.summary)
test = s.split()
tinyurl2 = [i for i in test if i.startswith('http://')]
hashtags2 = [i for i in s.split() if i.startswith('#')]
content2 = ' '.join(i for i in s.split() if i not in tinyurl2+hashtags2)
content = unicode(content2)
tinyurl = unicode(tinyurl2)
hashtags = unicode (hashtags2)
print hashtags
date = strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)
#Insert parsed feed data into entries table
#THIS IS WHERE DUPLICATES OCCUR
result = conn.execute(RSSEntries.insert(), {'feed_id': id, 'short_url': tinyurl,
'content': content, 'hashtags': hashtags, 'date': date})
entry_id = result.last_inserted_ids()[0]
#Look up tag identifiers and create any that don't exist:
tags = tag_table
tag_id_query = select([tags.c.tagname, tags.c.id], tags.c.tagname.in_(hashtags2))
tag_ids = dict(conn.execute(tag_id_query).fetchall())
for tag in hashtags2:
if tag not in tag_ids:
result = conn.execute(tags.insert(), {'tagname': tag})
tag_ids[tag] = result.last_inserted_ids()[0]
#insert data into entrytag table
if hashtags2: conn.execute(entrytag_table.insert(),
[{'entryid': entry_id, 'tagid': tag_ids[tag]} for tag in hashtags2])
#Rest of file completes the threading process
def thread():
while True:
try:
id, feed_url = jobs.get(False) # False = Don't wait
except Queue.Empty:
return
entries = feedparser.parse(feed_url).entries
rss_to_process.put((id, entries), True) # This will block if full
for info in feeds: # Queue them up
jobs.put([info['id'], info['url']])
for n in xrange(THREAD_LIMIT):
t = threading.Thread(target=thread)
t.start()
while threading.activeCount() > 1 or not rss_to_process.empty():
# That condition means we want to do this loop if there are threads
# running OR there's stuff to process
try:
id, entries = rss_to_process.get(False, 1) # Wait for up to a second
except Queue.Empty:
continue
store_feed_items(id, entries)

It looks like you included SQLAlchemy into a previously existing script that didn't use SQLAlchemy. There are too many moving parts here that none of us apparently understand well enough.
I would recommend starting from scratch. Don't use threading. Don't use sqlalchemy. To start maybe don't even use an SQL database. Write a script that collects the information you want in the simplist possible way into a simple data structure using simple loops and maybe a time.sleep(). Then when that works you can add in storage to an SQL database, and I really don't think writing SQL statements directly is much harder than using an ORM and it's easier to debug IMHO. There is a good chance you will never need to add threading.
"If you think you are smart enough to write multi-threaded programs, you're not." -- James Ahlstrom.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Collect data from APIs, Cache and Push To DB - in Parallel - python

Related

Multi Processing with sqlalchemy

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

Issues with concurrent inserts on Redshift table

Google BigQuery - python client - creating/managing jobs

Duplicate Insertions in Database using sqlite, sqlalchemy, python

Categories

Resources