Large Import into django postgres database

Large Import into django postgres database - python

I have a CSV file with 4,500,000 rows in it that needs to be imported into my django postgres database. This files includes relations so it isn't as easy as using COPY to import the CSV file straight into the database.
If I wanted to load it straight into postgres, I can change the CSV file to match the database tables, but I'm not sure how to get the relationship since I need to know the inserted id in order to build the relationship.
Is there a way to generate sql inserts that will get the last id and use that in future statements?
I initially wrote this using django ORM, but its going to take way to long to do that and it seems to be slowing down. I removed all of my indexes and contraints, so that shouldn't be the issue.
The database is running locally on my machine. I figured once I get the data into a database, it wouldn't be hard to dump and reload it on the production database.
So how can I get this data into my database with the correct relationships?
Note that I don't know JAVA so the answer suggested here isn't super practical for me: Django with huge mysql database
EDIT:
Here are more details:
I have a model something like this:
class Person(models.Model):
name = models.CharField(max_length=100)
offices = models.ManyToManyField(Office)
job = models.ForeignKey(Job)
class Office(models.Model):
address = models.CharField(max_length=100)
class Job(models.Model):
title = models.CharField(max_length=100)
So I have a person who can have 1 job but many offices. (My real model has more fields, but you get the idea).
My CSV file is something like this:
name,office_1,office_2,job
hailey,"123 test st","222 USA ave.",Programmer
There are more fields than that, but I'm only including the relevant ones.
So I need to make the person object and the office objects and relate them. The job objects are already created so all I need to do there is find the job and save it as the person's job.
The original data was not in a database before this. Only the flat file. We are trying to make it relational so there is more flexibility.
Thanks!!!

Well this is though one.
When you say relations, they are all on a single CSV file? I mean, like this, presuming a simple data model, with a relation to itself?
id;parent_id;name
4;1;Frank
1;;George
2;1;Costanza
3;1;Stella
If this is the case and it's out of order, I would write a Python script to reorder these and then import them.
I had a scenario a while back that I had a number of CSV files, but they were from individual models, where I loaded the first parent one, then the second, etc.
We wrote here custom importers that would read the data from a single CSV, and would do some processing on it, like check if it already existed, if some things were valid, etc. A method for each CSV file.
For CSV's that were big enough, we just split them in smaller files (around 200k records each) and processed them one after the other. The difference is that all the previous data that this big CSV depended on, was already in the database, imported by the same method described previously.
Without an example, I can't comment much more.
EDIT
Well, since you gave us your model, and based on the fact that the job model is already there, I would go for something like this:
create a custom method, even if you one n you can invoke from the shell. A method/function or whatever, that will receive a single line of the file.
In that method, discover how many offices that person is related to. Search to see if the office already exists in the DB. If so, use it to relate a person and the office. If not, create it and relate them
Lookup for the job. Does it exist? Yes, then use it. No? Create it and then use it.
Something like this:
def process_line(line):
data = line.split(";")
person = Person()
# fill in the person details that are in the CSV
person.name = data[1]
person.name = data[2]
person.save() # you'll need to save to use the m2m
offices = get_offices_from_line(line) # returns the plain data, not office instances
for office in offices:
obj, create = get_or_create(office_address=office)
if (obj):
person.offices.add(obj)
if (create):
person.offices.add(create)
job_obj, job_create = get_or_create(job_title=data[5])
# repeat
Be aware that the function above was not tested or guarded against any kind of errors. You'll need to:
Do that yourself;
Create the function that identifies the offices each person has. I don't know the data, but perhaps if you look at the field preceding the first office and look until the first field after all the offices you'll be able to grasp all of them;
You'll need to create a function to parse the high level file, iterate the lines and pass them along your shiny import function.
Here are the docs for get_or_create: https://docs.djangoproject.com/en/1.8/ref/models/querysets/#get-or-create

Related

More efficient way to update multiple model objects each with unique values

I am looking for a more efficient way to update a bunch of model objects. Every night I have background jobs creating 'NCAABGame' objects from an API once the scores are final.
In the morning I have to update all the fields in the model with the stats that the API did not provide.
As of right now I get the stats formatted from an excel file and I copy and paste each update and run it like this:
NCAABGame.objects.filter(
name__name='San Francisco', updated=False).update(
field_goals=38,
field_goal_attempts=55,
three_points=11,
three_point_attempts=24,
...
)
The other day there were 183 games, most days between 20-30 so it can be very timely doing it this way. I've looked into bulk_update and a few other things but I can't really find a solution. I'm sure there is something simple that I'm just not seeing.
I appreciate any ideas or solutions you can offer.

If you need to update each object that gets created via the API manually anyway, I would not even bother going through Django. Just load your games from the API directly in Excel, then make your edits in Excel, and save as CSV file. Then I would add the CSV directly into the database table, unless there is a specific reason that objects must be created via Django? I mean, you can of course do that with something like the below, which could be modified to also work for your current method via updates, but then you need to first retrieve the correct pk of the object that you want to update.
import csv
with open("my_data.csv", 'r') as my_data_file:
reader = csv.reader(my_data_file)
for row in reader:
# get_or_create returns a tuple. 'created' is a boolean that indicates
# if a new object was created or not, with game holding the object that
# was either retrieved or created
game, created = NCAABGame.objects.get_or_create(
name=row[0],
field_goals=row[1],
field_goal_attempts=row[2],
....,
)

How to access these tables more efficiently in django

we have the following data base schema to store different types of data.
DataDefinition: basic information about the new data.
*FieldDefinition: Every DataDefinition has some fields. Every field has a type, title, etc, that information is stored here. Every DataDefinition has more than one FieldDefinition associated. I have put '' because we have a lot of different models, one for every kind of field supported.
DataValue, *FieldValues: we store the definition and the values in different models.
With this setup, to retrieve a data from our database we need to do a lot of queries:
Retrieve the DataDefinition.
Retrieve the DataValue.
Retrieve the *FieldDefinition associated to that DataDefinition.
Retrieve all the *FieldValues associated to those *FieldDefinition.
So, if n is the average number of fields of a DataDefinition, we need to make 2*n+2 queries to the database to retrieve a single value.
We cannot change this setup, but queries are quite slow. So to speed it up I have thought the following: store a joined version of the tables. I do not know if this is possible but I cannot think of any other way. Any suggestion?
Update: we are already using prefetch_related and select_related and it's still slow.
Use Case Right now: get an entire data object from the one object value:
someValue = SomeTypeValue.objects.filter(value=value).select_related('DataValue', 'DataDefinition')[0]
# for each *FieldDefinition/*FieldValue model
definition = SomeFieldDefinition.objects.filter(*field_definition__id=someValue.data_value.data_definition.id)
value = SomeFieldValue.objects.filter(*field_definition__id=definition.id)
And with that info you can now build the entire data object.
Django: 1.11.20
Python: 2.7

SQLalchemy - Iterate through all mapped tables

I am currently creating a web app in Flask and use SQL-alchemy (not the flask version) to deal with reading and writing to my MySQL database.
I have about 15 different tables each mapped to a different declarative class, however the application is still in beta stages and so this number will probably increase.
I would like a way to iterate through every single table and run the same command on every single one. This is part of an update function where an admin can change the name of a book, this name change should be reflected in all the other tables where that book is referred to.
Is there a way to iterate through all your SqlAlchemy tables?
Thanks!

Not exactly sure what you want to achieve here, but if you use declarative base, you can try something like this:
tables = Base.__subclasses__()
for t in tables:
rows = Session.query(t).all()
for r in rows:
... do something ...
This gets all tables by listing subclasses of Base. Then it queries everything from each table in turn and loops through selected rows.
However, I do not quite understand why you would want to do this. How you describe your question is that you should have a Book table, and all others link to it if they want to reference books. This would be the relational model instead of dragging information on Books in each and every table and trying to manage them like this manually.

Loading data from a (MySQL) database into Django without models

This might sound like a bit of an odd question - but is it possible to load data from a (in this case MySQL) table to be used in Django without the need for a model to be present?
I realise this isn't really the Django way, but given my current scenario, I don't really know how better to solve the problem.
I'm working on a site, which for one aspect makes use of a table of data which has been bought from a third party. The columns of interest are liklely to remain stable, however the structure of the table could change with subsequent updates to the data set. The table is also massive (in terms of columns) - so I'm not keen on typing out each field in the model one-by-one. I'd also like to leave the table intact - so coming up with a model which represents the set of columns I am interested in is not really an ideal solution.
Ideally, I want to have this table in a database somewhere (possibly separate to the main site database) and access its contents directly using SQL.

You can always execute raw SQL directly against the database: see the docs.

There is one feature called inspectdb in Django. for legacy databases like MySQL , it creates models automatically by inspecting your db tables. it stored in our app files as models.py. so we don't need to type all column manually.But read the documentation carefully before creating the models because it may affect the DB data ...i hope this will be useful for you.

I guess you can use any SQL library available for Python. For example : http://www.sqlalchemy.org/
You have just then to connect to your database, perform your request and use the datas at your will. I think you can't use Django without their model system, but nothing prevents you from using another library for this in parallel.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.