I would like to take data from the Facebook Graph API and analyze it to find out roughly how close one person is to another. I am attempting to use the Pylons framework with SqlAlchemy (right now it is attached to a SQLite database) to store information from the Graph API so that I can make it available to my other applications via a RESTful web service. I am wondering what would be the best approach to analyzing the data.
For example, should I create objects analogous to the nodes and edges in the Graph API (users, posts, statuses, etc.) and analyze them, then store only the aftermath of that analysis in the database, perhaps the UIDs of each node and its connections to other nodes? Or should I store even less, and only have a database of the users and their close friends? Or should I go through step by step and store each of the objects via the ORM mapper in the database and make the analysis from the database after having filled it?
What sorts of concerns go into the designing of a database in situations like this? How should objects relate/map to the model? Where should the analysis be taking place during the whole process of grabbing data and storing it?
I'd store as much as possible, dump everything you can. Try to maintain the relationships between nodes so you can traverse/analyze them later. This affords you the opportunity to analyze your data set as much as you want, over and over and try different things. If you want to use SQLAlchemy you could use a simple self-referential relationship: http://www.sqlalchemy.org/docs/05/mappers.html#adjacency-list-relationships. That way you can maintain the connections between objects easily, and easily traverse them. You should also think about using MongoDB. It's pretty nice for this sort of thing, you can pretty much just dump the JSON responses you get from Facebook into MongoDB. It also has a great python client. Here's the MongoDB docs on storing a tree in MongoDB: http://www.mongodb.org/display/DOCS/Trees+in+MongoDB. There are a couple approaches that make sense there.
Related
I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you
I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.
As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.
Some popular databases:
MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)
MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)
ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)
These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.
My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!
Edit
If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.
you can try using any NOSql DB. Mongo DB would be a good place to start
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
For my app, I am using Flask, however the question I am asking is more general and can be applied to any Python web framework.
I am building a comparison website where I can update details about products in the database. I want to structure my app so that 99% of users who visit my website will never need to query the database, where information is instead retrieved from the cache (memcached or Redis).
I require my app to be realtime, so any update I make to the database must be instantly available to any visitor to the site. Therefore I do not want to cache views/routes/html.
I want to cache the entire database. However, because there are so many different variables when it comes to querying, I am not sure how to structure this. For example, if I were to cache every query and then later need to update a product in the database, I would basically need to flush the entire cache, which isn't ideal for a large web app.
I would prefer is to cache individual rows within the database. The problem is, how do I structure this so I can flush the cache appropriately when an update is made to the database? Also, how can I map all of this together from the cache?
I hope this makes sense.
I had this exact question myself, with a PHP project, though. My solution was to use ElasticSearch as an intermediate cache between the application and database.
The trick to this is the ORM. I designed it so that when Entity.save() is called it is first stored in the database, then the complete object (with all references) is pushed to ElasticSearch and only then the transaction is committed and the flow is returned back to the caller.
This way I maintained full functionality of a relational database (atomic changes, transactions, constraints, triggers, etc.) and still have all entities cached with all their references (parent and child relations) together with the ability to invalidate individual cached objects.
Hope this helps.
So a free eBook called "Redis in Action" by Josiah Carlson answered all of my questions. It it quite long, but after reading through, I have a fairly solid understanding of how to structure a caching architecture. It gives real world examples, such as a social network and a shopping site with tons of traffic. I will need to read through it again once or twice to fully understand. A great book!
Link: Redis in Action
I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!
Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).
I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.
I'm developing a multi-player game in Python with a Flask frontend, and I'm using it as an opportunity to learn more about the NoSQL way of doing things.
Redis seems to be a good fit for some of the things I need for this app, including storage of server-side sessions and other transient data, e.g. what games are in progress, who's online, etc. There are also several good Flask/Redis recipes that have made things very easy so far.
However, there are still some things in the data model that I would prefer lived inside a traditional RDBMS, including user accounts, logs of completed games, etc. It's not that Redis can't do these things, but I just think the RDBMS is more suited to them, and since Redis wants everything in memory, it seems to make sense to "warehouse" some of this data on disk.
The one thing I don't quite have a good strategy for is how to make these two data stores live happily together. Using ORMs like SQLAlchemy and/or redisco seems right out, because the ORMs are going to want to own all the data that's part of their data model, and there are inevitably times I'm going to need to have classes from one ORM know about classes from the other one (e.g. "users are in the RDBMS, but games are in Redis, and games have users participating in them.)
Does anyone have any experience deploying python web apps using a NoSQL store like Redis for some things and an RDBMS for others? If so, do you have any strategies for making them work together?
You should have no problem using an ORM because, in the end, it just stores strings, numbers and other values. So you could have a game in progress, and keep its state in Redis, including the players' IDs from the SQL player table, because the ID is just a unique integer.