I am using couchdb to store twitter data. I found that couchdb stops updating its data base though I keep getting the twitter data. I basically store the dictionary that contains twitter data by using the python couchdb save method, db.save(twitter_dic) where db is the database instance. I find that some times I get 3GB of data and couchdb stops storing, sometimes it stops storing even when it reaches 0.6GB. I don't know what is the reason. If some one have come across similar situation please help me out. If this problem cannot be solved I would look forward to use some other key-value data base where python is used as wrapper to store the values (Very similar to CouchDB) where I can do map reduce etc, can some one provide me such a database?
I had to re install couchdb and I am marking this question accepted.
Related
I am using firebase python client to write data to firestore.Any read / write operation at least takes 1 second to complete.Firestore DB is in us-central and our server is in Singapore.
Is it what causing issues?
During read, I have used a where query with limit like below.
collection_ref.where(
u"field", u"==", u"field_value").limit(1).get()
During write, I use set and update(dict)
Sometimes the lag is around 10 to 12 sec
Did anyone face similar issues?
Any pointers will be appreciated
This article on why is Cloud Firestore query slow mentioned the lists of reasons
If you are downloading a bunch of data you probably don’t need to download all of them.The solution would be to limit the amount that comes back.
Your offline cache is too big. Cloud Firestore does some amazing offline caching but this local cache does not apply the same indexes that the server does. This means when you query documents in your offline cache cloud Firestore needs to pack every documents stored locally for the collection being queried and compare it against your query.The solution is limit how much data is being stored in offline cache.
Without composite indexing Firestore would have to do a lot of searching to get the results set.So instead, create a composite index so Firestore can do a quick lookup.
Used to Realtime Database.Realtime Database generally has a lower latency,you are not really going to notice the difference.But if app needs every second of latency you are probably better off using Realtime
Database in these scenarios.
The laws of physics are keeping you down. Your customer might be too far away from your Firestore Database and the actual latency is taking too long. To fix this use real time listeners which is a technique called latency compensation.
I am currently developing a Python Discord bot that uses a Mongo database to store user data.
As this data is continually changed, the database would be subjected to a massive number of queries to both extract and update the data; so I'm trying to find ways to minimize client-server communication and reduce bot response times.
In this sense, is it a good idea to create a copy of a Mongo collection as a dictionary list as soon as the script is run, and manipulate the data offline instead of continually querying the database?
In particular, every time a data would be searched with the collection.find() method, it is instead extracted from the list. On the other hand, every time a data needs to be updated with collection.update(), both the list and the database are updated.
I'll give an example to better explain what I'm trying to do. Let's say that my collection contains documents with the following structure:
{"user_id": id_of_the_user, "experience": current_amount_of_experience}
and the experience value must be continually increased.
Here's how I'm implementing it at the moment:
online_collection = db["collection_name"] # mongodb cursor
offline_collection = list(online_collection.find()) # a copy of the collection
def updateExperience(user_id):
online_collection.update_one({"user_id":user_id}, {"$inc":{"experience":1}})
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
mydocument["experience"] += 1
def findExperience(user_id):
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
return mydocument["experience"]
As you can see, the database is involved only for the update function.
Is this a valid approach?
For very large collections (millions of documents) does the next () function have the same execution times or would there still be some slowdowns?
Also, while not explicitly asked in the question, I'd me more than happy to get any advice on how to improve the performance of a Discord bot, as long as it doesn't include using a VPS or sharding, since I'm already using these options.
I don't really see why not - as long as you're aware of the following :
You will need the system resources to load an entire database into memory
It is your responsibility to sync the actual db and your local store
You do need to be the only person/system updating the database
Eventually this pattern will fail i.e. db gets too large, or more than one process needs to update, so it isn't future-proof.
In essence you're talking about a caching solution - so no need to reinvent the wheel - many such products/solutions you could use.
It's probably not the traditional way of doing things, but if it works then why not
I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.
Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs
I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!
Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).
I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.