Bulk insert with multiprocessing using peewee - python

I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.
Example:
Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.
My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.
Thx for answers and your time

You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:
http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-unique-constraints
http://docs.peewee-orm.com/en/latest/peewee/querying.html#get-or-create
http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts
http://docs.peewee-orm.com/en/latest/peewee/querying.html#upsert - upsert with on conflict

Looked for this myself for a while, but found it!
you can use the on_conflict_replace() or on_conflict_ignore() functions to define behaviour for when a record exists in a table that has a uniqueness constraint.
PriceData.insert_many(values).on_conflict_replace().execute()
or
PriceData.insert_many(values).on_conflict_ignore().execute()
More info under "Upsert" here

Related

What should I use to enter data to my database on Django? Django admin or SQL code?

I am a newbie in programming, but now I connected my project with PostgreSQL. I learned the way to enter by SQL code and also found out that we can actually enter /adming (by creating the superuser and add data there). So which one is widely used in webdev?
It will depend completely on your application.
You can add rows to a table using SQL if that's the easiest way for you. Or you can add rows by creating new object instances in Python code and .save()ing them. Or you can create instances through a CreateView or through the Django admin.
Adding data with SQL has the drawback that you will lise the benefit of any validators declared on the model's fields. YOu may end up with data stored in your SQL tables which your app regards as "impossible", which may cause you minor or even major difficulties.
I have several times written management commands which all have the same general format. For each "row" in a data source (often a spreadsheet) construct one or more Django objects and save them. You can process each data "row" within a transaction (with transaction.atomic()) so if anything goes wrong, the data row is not committed. Or you can treat the entire process as a single transaction (not recommended for vast numbers of "rows", though)ยท

SQLalchemy - Iterate through all mapped tables

I am currently creating a web app in Flask and use SQL-alchemy (not the flask version) to deal with reading and writing to my MySQL database.
I have about 15 different tables each mapped to a different declarative class, however the application is still in beta stages and so this number will probably increase.
I would like a way to iterate through every single table and run the same command on every single one. This is part of an update function where an admin can change the name of a book, this name change should be reflected in all the other tables where that book is referred to.
Is there a way to iterate through all your SqlAlchemy tables?
Thanks!
Not exactly sure what you want to achieve here, but if you use declarative base, you can try something like this:
tables = Base.__subclasses__()
for t in tables:
rows = Session.query(t).all()
for r in rows:
... do something ...
This gets all tables by listing subclasses of Base. Then it queries everything from each table in turn and loops through selected rows.
However, I do not quite understand why you would want to do this. How you describe your question is that you should have a Book table, and all others link to it if they want to reference books. This would be the relational model instead of dragging information on Books in each and every table and trying to manage them like this manually.

will database pull whole table at once or one row by one row after using model.objects.all().iterator() in django?

I know django will return the model object one by one after using iterator() on queryset to save memory. In database side, will django pull the data one row by one row or still pull the whole table at once just like model.objects.all().
See https://docs.djangoproject.com/en/1.10/ref/models/querysets/#when-querysets-are-evaluated. The query selects all records from the database. The best way to figure this out is to add 'backend' DEBUG level logging in your settings, and you can see the actual SQL queries that are being executed.

Loading data from a (MySQL) database into Django without models

This might sound like a bit of an odd question - but is it possible to load data from a (in this case MySQL) table to be used in Django without the need for a model to be present?
I realise this isn't really the Django way, but given my current scenario, I don't really know how better to solve the problem.
I'm working on a site, which for one aspect makes use of a table of data which has been bought from a third party. The columns of interest are liklely to remain stable, however the structure of the table could change with subsequent updates to the data set. The table is also massive (in terms of columns) - so I'm not keen on typing out each field in the model one-by-one. I'd also like to leave the table intact - so coming up with a model which represents the set of columns I am interested in is not really an ideal solution.
Ideally, I want to have this table in a database somewhere (possibly separate to the main site database) and access its contents directly using SQL.
You can always execute raw SQL directly against the database: see the docs.
There is one feature called inspectdb in Django. for legacy databases like MySQL , it creates models automatically by inspecting your db tables. it stored in our app files as models.py. so we don't need to type all column manually.But read the documentation carefully before creating the models because it may affect the DB data ...i hope this will be useful for you.
I guess you can use any SQL library available for Python. For example : http://www.sqlalchemy.org/
You have just then to connect to your database, perform your request and use the datas at your will. I think you can't use Django without their model system, but nothing prevents you from using another library for this in parallel.

Work with Postgres/PostGIS View in SQLAlchemy

Two questions:
i want to generate a View in my PostGIS-DB. How do i add this View to my geometry_columns Table?
What i have to do, to use a View with SQLAlchemy? Is there a difference between a Table and View to SQLAlchemy or could i use the same way to use a View as i do to use a Table?
sorry for my poor english.
If there a questions about my question, please feel free to ask so i can try to explain it in another way maybe :)
Nico
Table objects in SQLAlchemy have two roles. They can be used to issue DDL commands to create the table in the database. But their main purpose is to describe the columns and types of tabular data that can be selected from and inserted to.
If you only want to select, then a view looks to SQLAlchemy exactly like a regular table. It's enough to describe the view as a Table with the columns that interest you (you don't even need to describe all of the columns). If you want to use the ORM you'll need to declare for SQLAlchemy that some combination of the columns can be used as the primary key (anything that's unique will do). Declaring some columns as foreign keys will also make it easier to set up any relations. If you don't issue create for that Table object, then it is just metadata for SQLAlchemy to know how to query the database.
If you also want to insert to the view, then you'll need to create PostgreSQL rules or triggers on the view that redirect the writes to the correct location. I'm not aware of a good usage recipe to redirect writes on the Python side.

Categories

Resources