pymongo insert W=2 , J=True speed up - python

Im using python 2.7.8 and pymongo 2.7
and the mongoDB server is a ReplicaSet group one primary two secondary .
the mognodb server is built on AWS server EBS:500GB, IOPS3000 .
I want to know is there any way to speed up the insert.when W=2, j=True
Using pymongo to insert a million of files takes a lot of time
and i know that if i use the W=0 it will speed up ,but it isn't safe
So any suggestion ? Please help me thanks.

Setting W=0 is deprecated. This is the older model of MongoDB (pre-3.0), which they don't recommend using any more.
Using MongoDB as a file storage system also isn't a great idea; but, you can consider using GridFS if that's the case.
I assume you're trying some sort of mass-import, and you don't have many (or any) readers right now; in which case, you will be okay if any reader sees some, but not all, of the documents.
You have a couple of options:
set j=False. MongoDB will return more quickly (before the documents are committed to the journal), at the potential risk of documents being lost if the DB crashes.
set W=1. If replication is slow, this will only wait until one of the nodes (the primary) has the data before returning.
If you do need strong consistency requirements (readers seeing everything inserted so far), neither of these options will help.

You can use unordered or ordered bulk inserts
This speeds up things alot. May be also you take a look at my muBulkOps a wraper of pymongo bulk operations.

Related

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

peewee with bulk insert is very slow into sqlite db

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

Struggling to take the next step in how to store my data

I'm struggling with how to store some telemetry streams. I've played with a number of things, and I find myself feeling like I'm at a writer's block.
Problem Description
Via a UDP connection, I receive telemetry from different sources. Each source is decomposed into a set of devices. And for each device there's at most 5 different value types I want to store. They come in no faster than once per minute, and may be sparse. The values are transmitted with a hybrid edge/level triggered scheme (send data for a value when it is either different enough or enough time has passed). So it's a 2 or 3 level hierarchy, with a dictionary of time series.
The thing I want to do most with the data is a) access the latest values and b) enumerate the timespans (begin/end/value). I don't really care about a lot of "correlations" between data. It's not the case that I want to compute averages, or correlate between them. Generally, I look at the latest value for given type, across all or some hierarchy derived subset. Or I focus one one value stream and am enumerating the spans.
I'm not a database expert at all. In fact I know very little. And my three colleagues aren't either. I do python (and want whatever I do to be python3). So I'd like whatever we do to be as approachable as possible. I'm currently trying to do development using Mint Linux. I don't care much about ACID and all that.
What I've Done So Far
Our first version of this used the Gemstone Smalltalk database. Building a specialized Timeseries object worked like a charm. I've done a lot of Smalltalk, but my colleagues haven't, and the Gemstone system is NOT just a "jump in and be happy right away". And we want to move away from Smalltalk (though I wish the marketplace made it otherwise). So that's out.
Played with RRD (Round Robin Database). A novel approach, but we don't need the compression that bad, and being edge triggered, it doesn't work well for our data capture model.
A friend talked me into using sqlite3. I may try this again. My first attempt didn't work out so well. I may have been trying to be too clever. I was trying to do things the "normalized" way. I found that I got something working at first OK. But getting the "latest" value for given field for a subset of devices, was getting to be some hairy (for me) SQL. And the speed for doing so was kind of disappointing. So it turned out I'd need to learn about indexing too. I found I was getting into a hole I didn't want to. And headed right back where we were with the Smalltalk DB, lot of specialized knowledge, me the only person that could work with it.
I thought I'd go the "roll your own" route. My data is not HUGE. Disk is cheap. And I know real well how to read/write files. And aren't filesystems hierarchical databases anyway? I'm sure that "people in the know" are rolling their eyes at this primitive approach, but this method was the most approachable. With a little bit of python code, I used directories for my structuring, and then a 2 file scheme for each value (one for the latest value, and an append log for the rest of the values). This has worked OK. But I'd rather not be liable for the wrinkles I haven't quite worked out yet. There's as much code involved in how the data is serialized to/from (just using simple strings right now). One nice thing about this approach, is that while I can write python scripts to analyze the data, some things can be done just fine with classic command line tools. E.g (simple query to show all latest rssi values).
ls Telemetry/*/*/rssi | xargs cat
I spent this morning looking at alternatives. Growsed the NOSQL sites. Read up on PyTables. Scanned ZODB tutorial. PyTables looks very suited for what I'm after. Hierarchy of named tables modeling timeseries. But I don't think PyTables works with python3 yet (at least, there is no debian/ubuntu package for python3 yet). Ditto for ZODB. And I'm afraid I don't know enough about what the many different NOSQL databases do to even take a stab at one.
Plea for Ideas
I find myself more bewildered and confused than at the start of this. I was probably too naive that I'd find something that could be a little more "fire and forget" and be past it at this point. Any advice and direction you have, would be hugely appreciated. If someone can give me a recipe that I can meet my needs without huge amounts of overhead/education/ingress, I'd mark that as the answer for sure.
Ok, I'm going to take a stab at this.
We use Elastic Search for a lot of our unstructured data: http://www.elasticsearch.org/. I'm no expert on this subject, but in my day-to-day, I rely on the indices a lot. Basically, you post JSON objects to the index, which lives on some server. You can query the index via the URL, or by posting a JSON object to the appropriate place. I use pyelasticsearch to connect to the indices---that package is well-documented, and the main class that you use is thread-safe.
The query language is pretty robust itself, but you could just as easily add a field to the records in the index that is "latest time" before you post the records.
Anyway, I don't feel that this deserves a check mark (even if you go that route), but it was too long for a comment.
What you describe fits the database model (ex, sqlite3).
Keep one table.
id, device_id, valuetype1, valuetype2, valuetype3, ... ,valuetypen, timestamp
I assume all devices are of the same type (IE, have the same set of values that you care about). If they do not, consider simply setting the value=null when it doesn't apply to a specific device type.
Each time you get an update, duplicate the last row and update the newest value:
INSERT INTO DeviceValueTable (device_id, valuetype1, valuetype2,..., timestamp)
SELECT device_id, valuetype1, #new_value, ...., NOW()
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for a specific device:
SELECT *
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for all devices:
select
DeviceValueTable.*
from
DeviceValueTable a
inner join
(select id, max(timestamp) as newest
from DeviceValueTable group by device_id) as b on
a.id = b.id
You might be worried about the cost (size of storing) the duplicate values. Rely on the database to handle compression.
Also, keep in mind simplicity over optimization. Make it work, then if it's too slow, find and fix the slowness.
Note, these queries were not tested on sqlite3 and may contain typos.
It sounds to me like you want an on-disk, implicitly sorted datastructure like a btree or similar.
Maybe check out:
http://liw.fi/larch/
http://www.egenix.com/products/python/mxBase/mxBeeBase/
Your issue isn't technical, its poor problem specification.
If you are doing anything with sensor data then the old laboratory maxim applies "If you don't write it down, it didn't happen". In the lab, that means a notebook and pen, on a computer it means ACID.
You also seem to be prematurely optimizing the solution, which is well known to be the root of all evil. You don't say what size the data are, but you do say they "come no faster than once per minute, and may be sparse". Assuming they are an even 1.0KB in size, that's 1.5MB/day or 5.3GB/year. My StupidPhone has more storage than you would need in a year, and my laptop has sneezes that are larger.
The biggest problem is that you claim to "know very little" about databases and that is the crux of the matter. Your data is standard old 1950s data-processing boring. You're jumping into buzzword storage technologies when SQLite would do everything you need if only you knew how to ask it. Given that you've got Smalltalk DB down, I'd be quite surprised if it took more than a day's study to learn all the conventional RDBM principles you need and then some.
After that, you'd be able to write a question that can be answered in more than generalities.

*large* python dictionary with persistence storage for quick look-ups

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

Is Using Python to MapReduce for Cassandra Dumb?

Since Cassandra doesn't have MapReduce built in yet (I think it's coming in 0.7), is it dumb to try and MapReduce with my Python client or should I just use CouchDB or Mongo or something?
The application is stats collection, so I need to be able to sum values with grouping to increment counters. I'm not, but pretend I'm making Google analytics so I want to keep track of which browsers appear, which pages they went to, and visits vs. pageviews.
I would just atomically update my counters on write, but Cassandra isn't very good at counters either.
May Cassandra just isn't the right choice for this?
Thanks!
Cassandra supports map reduce since version 0.6. (Current stable release is 0.5.1, but go ahead and try the new map reduce functionality in 0.6.0-beta3) To get started I recommend to take a look at the word count map reduce example in 'contrib/word_count'.
MongoDB has update-in-place, so MongoDB should be very good with counters. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Categories

Resources