Best Practices to add custom attributes to pandas Dataframe

Best Practices to add custom attributes to pandas Dataframe - python

I am building a Table class to make it easy to retrieve data from a database, manipulate it arbitrarily in memory, then save it back. Ideally, these tables work for the python interpreter and normal code. "Work" means I can use all standard pandas Dataframe features, as well as all custom features from the Table class.
Generally, the tables contain data I use for academic research or personal interest. So, the user-base is currently just me, but for portability I'm trying to write as generically as possible.
I have seen several threads (example 1, example 2) discussing whether to subclass DataFrame, or use composition. After trying to walk through pandas's subclassing guide I decided to go for composition because pandas itself says this is easier.
The problem is, I want to be able to call any Dataframe function, property, or attribute on a Table, but I to do so, I have to keep track of any attribute I code into the Table class. See below, points of interest are metadata and __getattr__, everything else is meant to be illustrative.
class Table(object):
metadata = ['db', 'data', 'name', 'clean', 'refresh', 'save']
def __getattr__(self, name):
if name not in Table.metadata:
return getattr(self.data, name) #self.data is the Dataframe
def __init__(self, db, name):
#set up Table specific values
def refresh(self):
#undo all changes since last save
etc...
Obviously, having to explicitly specify the Table attributes versus the Dataframe ones is not ideal (though--to my understanding--this is how pandas implements column names as attributes). I could write out tablename.data.foo, but I find that unintuitive and non-pythonic. Is there a better way to achieve the same functionality?

Here's my understanding of your desired workflow: (1) you have a table in a database, (2) you read part/all(?) of it into memory as a pandas dataframe wrapped in a custom class (3) you make any manipulation you want and then (4) save it back to the database as the new state of that table.
I'm worried that arbitrary changes to the df could break db features
I'm guessing this is a relational db? Do other tables rely on primary keys of this table?
Are you trying to keep a certain schema?
Are you ok with adding/deleting/renaming columns arbitrarily?
If you decide there are an enumerable amount of manipulations, rather than an arbitrary amount, then I'd make a separate class method for each.
If you don't care about your db schema and your db table doesn't have relationships with other tables then I guess you can do arbitrary manipulations in memory and replace the db table each time
In this case I feel you are not benefiting from using a database over a CSV file
I guess one benefit could be the db is publicly accessible while the CSV wouldn't be (unless you were using S3 or something)

Related

Large Import into django postgres database

I have a CSV file with 4,500,000 rows in it that needs to be imported into my django postgres database. This files includes relations so it isn't as easy as using COPY to import the CSV file straight into the database.
If I wanted to load it straight into postgres, I can change the CSV file to match the database tables, but I'm not sure how to get the relationship since I need to know the inserted id in order to build the relationship.
Is there a way to generate sql inserts that will get the last id and use that in future statements?
I initially wrote this using django ORM, but its going to take way to long to do that and it seems to be slowing down. I removed all of my indexes and contraints, so that shouldn't be the issue.
The database is running locally on my machine. I figured once I get the data into a database, it wouldn't be hard to dump and reload it on the production database.
So how can I get this data into my database with the correct relationships?
Note that I don't know JAVA so the answer suggested here isn't super practical for me: Django with huge mysql database
EDIT:
Here are more details:
I have a model something like this:
class Person(models.Model):
name = models.CharField(max_length=100)
offices = models.ManyToManyField(Office)
job = models.ForeignKey(Job)
class Office(models.Model):
address = models.CharField(max_length=100)
class Job(models.Model):
title = models.CharField(max_length=100)
So I have a person who can have 1 job but many offices. (My real model has more fields, but you get the idea).
My CSV file is something like this:
name,office_1,office_2,job
hailey,"123 test st","222 USA ave.",Programmer
There are more fields than that, but I'm only including the relevant ones.
So I need to make the person object and the office objects and relate them. The job objects are already created so all I need to do there is find the job and save it as the person's job.
The original data was not in a database before this. Only the flat file. We are trying to make it relational so there is more flexibility.
Thanks!!!

Well this is though one.
When you say relations, they are all on a single CSV file? I mean, like this, presuming a simple data model, with a relation to itself?
id;parent_id;name
4;1;Frank
1;;George
2;1;Costanza
3;1;Stella
If this is the case and it's out of order, I would write a Python script to reorder these and then import them.
I had a scenario a while back that I had a number of CSV files, but they were from individual models, where I loaded the first parent one, then the second, etc.
We wrote here custom importers that would read the data from a single CSV, and would do some processing on it, like check if it already existed, if some things were valid, etc. A method for each CSV file.
For CSV's that were big enough, we just split them in smaller files (around 200k records each) and processed them one after the other. The difference is that all the previous data that this big CSV depended on, was already in the database, imported by the same method described previously.
Without an example, I can't comment much more.
EDIT
Well, since you gave us your model, and based on the fact that the job model is already there, I would go for something like this:
create a custom method, even if you one n you can invoke from the shell. A method/function or whatever, that will receive a single line of the file.
In that method, discover how many offices that person is related to. Search to see if the office already exists in the DB. If so, use it to relate a person and the office. If not, create it and relate them
Lookup for the job. Does it exist? Yes, then use it. No? Create it and then use it.
Something like this:
def process_line(line):
data = line.split(";")
person = Person()
# fill in the person details that are in the CSV
person.name = data[1]
person.name = data[2]
person.save() # you'll need to save to use the m2m
offices = get_offices_from_line(line) # returns the plain data, not office instances
for office in offices:
obj, create = get_or_create(office_address=office)
if (obj):
person.offices.add(obj)
if (create):
person.offices.add(create)
job_obj, job_create = get_or_create(job_title=data[5])
# repeat
Be aware that the function above was not tested or guarded against any kind of errors. You'll need to:
Do that yourself;
Create the function that identifies the offices each person has. I don't know the data, but perhaps if you look at the field preceding the first office and look until the first field after all the offices you'll be able to grasp all of them;
You'll need to create a function to parse the high level file, iterate the lines and pass them along your shiny import function.
Here are the docs for get_or_create: https://docs.djangoproject.com/en/1.8/ref/models/querysets/#get-or-create

Are there serious performance differences between using pickleType and relationships?

Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?

Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

SQLAlchemy Custom Properties

I've got a table called "Projects" which has a mapped column "project". What I'm wanting to be able to do is to define my own property on my mapped class called "project" that performs some manipulation of the project value before returning it. This will of course create an infinite loop when I try to reference the row value. So my question is whether there's a way of setting up my table mapper to use an alias for the project column, perhaps _project. Is there any easy way of doing this?
I worked it out myself in the end. You can specify an alternative name when calling orm.mapper:
orm.mapper(MappedClass, table, properties={'_project': table.c.project})

Have you check the synonyms feature of Sqlalchemy
http://www.sqlalchemy.org/docs/05/reference/ext/declarative.html#defining-synonyms
http://www.sqlalchemy.org/docs/05/mappers.html#synonyms
?
I use this pretty often to provide a proper setter/getter public API for properties
having a pretty complicated underlaying data structure or in case where additional functionality/validation or whatever is needed.

Please help me design a database schema for this:

I'm designing a python application which works with a database. I'm planning to use sqlite.
There are 15000 objects, and each object has a few attributes. every day I need to add some data for each object.(Maybe create a column with the date as its name).
However, I would like to easily delete the data which is too old but it is very hard to delete columns using sqlite(and it might be slow because I need to copy the required columns and then delete the old table)
Is there a better way to organize this data other than creating a column for every date? Or should I use something other than sqlite?

It'll probably be easiest to separate your data into two tables like so:
CREATE TABLE object(
id INTEGER PRIMARY KEY,
...
);
CREATE TABLE extra_data(
objectid INTEGER,
date DATETIME,
...
FOREIGN KEY(objectid) REFERENCES object(id)
);
This way when you need to delete all of your entries from a date it'll be an easy:
DELETE FROM extra_data WHERE date = curdate;
I would try and avoid altering tables all the time, usually indicates a bad design.

For that size of a db, I would use something else. I've used sqlite once for a media library with about 10k objects and it was slow, like 5 minutes to query it all and display, searches were :/, switching to postgres made life so much easier. This is just on the performance issue only.
It also might be better to create an index that contains the date and the data/column you want to add and a pk reference to the object it belongs and use that for your deletions instead of altering the table all the time. This can be done in sqlite if you give the pk an int type and save the pk of the object to it, instead of using a Foreign Key like you would with mysql/postgres.

If your database is pretty much a collection of almost-homogenic data, you could as well go for a simpler key-value database. If the main action you perform on the data is scanning through everything, it would perform significantly better.
Python library has bindings for popular ones as "anydbm". There is also a dict-imitating proxy over anydbm in shelve. You could pickle your objects with the attributes using any serializer you want (simplejson, yaml, pickle)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.