Is Using Python to MapReduce for Cassandra Dumb?

Is Using Python to MapReduce for Cassandra Dumb? - python

Since Cassandra doesn't have MapReduce built in yet (I think it's coming in 0.7), is it dumb to try and MapReduce with my Python client or should I just use CouchDB or Mongo or something?
The application is stats collection, so I need to be able to sum values with grouping to increment counters. I'm not, but pretend I'm making Google analytics so I want to keep track of which browsers appear, which pages they went to, and visits vs. pageviews.
I would just atomically update my counters on write, but Cassandra isn't very good at counters either.
May Cassandra just isn't the right choice for this?
Thanks!

Cassandra supports map reduce since version 0.6. (Current stable release is 0.5.1, but go ahead and try the new map reduce functionality in 0.6.0-beta3) To get started I recommend to take a look at the word count map reduce example in 'contrib/word_count'.

MongoDB has update-in-place, so MongoDB should be very good with counters. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Related

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?

Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()

I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

pymongo insert W=2 , J=True speed up

Im using python 2.7.8 and pymongo 2.7
and the mongoDB server is a ReplicaSet group one primary two secondary .
the mognodb server is built on AWS server EBS:500GB, IOPS3000 .
I want to know is there any way to speed up the insert.when W=2, j=True
Using pymongo to insert a million of files takes a lot of time
and i know that if i use the W=0 it will speed up ,but it isn't safe
So any suggestion ? Please help me thanks.

Setting W=0 is deprecated. This is the older model of MongoDB (pre-3.0), which they don't recommend using any more.
Using MongoDB as a file storage system also isn't a great idea; but, you can consider using GridFS if that's the case.
I assume you're trying some sort of mass-import, and you don't have many (or any) readers right now; in which case, you will be okay if any reader sees some, but not all, of the documents.
You have a couple of options:
set j=False. MongoDB will return more quickly (before the documents are committed to the journal), at the potential risk of documents being lost if the DB crashes.
set W=1. If replication is slow, this will only wait until one of the nodes (the primary) has the data before returning.
If you do need strong consistency requirements (readers seeing everything inserted so far), neither of these options will help.

You can use unordered or ordered bulk inserts
This speeds up things alot. May be also you take a look at my muBulkOps a wraper of pymongo bulk operations.

Loading data into Titan with bulbs and then accessing it

I am a complete novice in graph databases and all the Titan ecosystem, so please excuse me sounding stupid. I am also suffering from the lack of documentation -_-
I've installed the titan server. I am using Cassandra as a back-end.
I am trying to load basic twitter data into Titan using Python.
I use the bulbs library for this purpose.
Lets say, i have a list of people i follow on twitter in the friends list
my python script goes like this
from bulbs.titan import Graph
# some other imports here
# getting the *friends* list for a specified user here
g = Graph()
# a vertex of a specified user
center = g.vertices.create(name = 'sergiikhomenko')
for friend in friends:
cur_friend = g.vertices.create(name = friend)
g.edges.create(center,'follows',cur_friend)
From what i understand - the above code should have created a graph in Titan with a number of vertices, some of which a connected by the follows edge.
My questions are:
How do I save it in Titan?? (like a commit in SQL)
How do I access it later?? Should I be able to access it through
gremlin shell?? If yes, how??
My next question would be about visualizing the data, but i am very far from there :)
Please help :) I am completely lost in all this Titan, Gremlin, Rexster,etc. :)
Update: One of the requirement of our POC project - is ... python :), that's why i jumped into bulbs straight on. I'll definitely follow the advice below though :)

My answer will be somewhat incomplete because I can't really supply answers around Bulbs but you do ask some specific questions which I can try to answer:
How do I save it in Titan?? (like a commit in SQL)
It's just g.commit() in Java/Groovy.
How do I access it later?? Should I be able to access it through gremlin shell?? If yes, how??
Once it's committed to cassandra, access it with Bulbs, the gremlin shell, some other application, whatever. Not sure what you're asking really, but I like the Gremlin Console for such things so if have cassandra started locally, start up bin/gremlin.sh and do:
g = TitanFactory.build()
.set("storage.backend","cassandra")
.set("storage.hostname","127.0.0.1")
.open();
That will get you a connection to cassandra and you should be able to query your data.
I am completely lost in all this Titan, Gremlin, Rexster,etc
My advice to all new users (especially those new to graphs, cassandra, the jvm, etc.) is to slow down. The fastest way to get discouraged is to try to do python to the bulbs to the rexster to the gremlin over the titan to the cassandra cluster hosted in ec2 with hadoop - and try to load a billion edge graph into that.
If you are new, then start with the latest stuff: TinkerPop3 - http://tinkerpop.incubator.apache.org/ - which bulbs does not yet support - but that's ok because you're learning TinkerPop which is important to learning the whole stack and all of TinkerPop's implementations (e.g. Titan). Use TinkerGraph (not Titan) with a small subset of your data and make sure you get the pattern for loading that small subset right before you try to go full scale. Use the Gremlin Console for everything related to this initial goal. That is a recipe for an easy win. Under that approach you could likely have a Graph going with some queries over your own data in a day and learn a good portion of what you need to do it with Titan.
Once you have your Graph, get it working in Gremlin Server (the Rexster replacement for TP3). Then think about how you might access that via python tooling. Or maybe you figure out how to convert TinkerGraph to Titan (perhaps start with BerkeleyDB rather than cassandra). My point here is to more slowly increment your involvement with different pieces of the ecosystem because it is otherwise overwhelming.

How to count results from many GAE tasks?

I run many-many tasks to get some information and process it. After each task run, I have an integer, which indicates how many portions of the information I've got.
I would like to get sum of these integers received from different tasks.
Currently I use memcache to store sum:
def update_memcache_value(what, val, how_long=86400):
value_old = get_memcache_value(what)
memcache.set('system_'+what, value_old+val, how_long)
def get_memcache_value(what):
value = memcache.get('system_'+what)
if not value:
value = 0
return int(value)
update_memcache_value is called within each task (quite more often than once). But looks like the data there is often lost during the day. I can use NDB to store the same data, but it will require a lot of write ops. Is there any better way to store the same data (counter)?

It sounds like you are specifically looking to have many tasks do a part of a sum and then have those all reduce down to one number at the end... so you want to use MapReduce. Or you could just use Pipelines, as MapReduce is actually built on top of it. If you're worried about write-ops, then you aren't going to be able to take advantage of App Engine's parallelism
Google I/O 2010 - Data pipelines with Google App Engine
https://www.youtube.com/watch?v=zSDC_TU7rtc
Pipelines Library
https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki
MapReduce
https://cloud.google.com/appengine/docs/python/dataprocessing/

Unfortunately if your tasks span throughout the day memcache is not an option.
If you want to reduce the write ops you could set a second counter and backup the value on memcache every 100 tasks or whatever works for you.
if you are expecting to do this with using write ops for every task you could try backing up those results in a 3rd party storage like for example a Google Spreadsheet through the Spreasheets API but it seems like an overkill just to save some write ops (and not as performant, which is guess is not an issue).

Struggling to take the next step in how to store my data

I'm struggling with how to store some telemetry streams. I've played with a number of things, and I find myself feeling like I'm at a writer's block.
Problem Description
Via a UDP connection, I receive telemetry from different sources. Each source is decomposed into a set of devices. And for each device there's at most 5 different value types I want to store. They come in no faster than once per minute, and may be sparse. The values are transmitted with a hybrid edge/level triggered scheme (send data for a value when it is either different enough or enough time has passed). So it's a 2 or 3 level hierarchy, with a dictionary of time series.
The thing I want to do most with the data is a) access the latest values and b) enumerate the timespans (begin/end/value). I don't really care about a lot of "correlations" between data. It's not the case that I want to compute averages, or correlate between them. Generally, I look at the latest value for given type, across all or some hierarchy derived subset. Or I focus one one value stream and am enumerating the spans.
I'm not a database expert at all. In fact I know very little. And my three colleagues aren't either. I do python (and want whatever I do to be python3). So I'd like whatever we do to be as approachable as possible. I'm currently trying to do development using Mint Linux. I don't care much about ACID and all that.
What I've Done So Far
Our first version of this used the Gemstone Smalltalk database. Building a specialized Timeseries object worked like a charm. I've done a lot of Smalltalk, but my colleagues haven't, and the Gemstone system is NOT just a "jump in and be happy right away". And we want to move away from Smalltalk (though I wish the marketplace made it otherwise). So that's out.
Played with RRD (Round Robin Database). A novel approach, but we don't need the compression that bad, and being edge triggered, it doesn't work well for our data capture model.
A friend talked me into using sqlite3. I may try this again. My first attempt didn't work out so well. I may have been trying to be too clever. I was trying to do things the "normalized" way. I found that I got something working at first OK. But getting the "latest" value for given field for a subset of devices, was getting to be some hairy (for me) SQL. And the speed for doing so was kind of disappointing. So it turned out I'd need to learn about indexing too. I found I was getting into a hole I didn't want to. And headed right back where we were with the Smalltalk DB, lot of specialized knowledge, me the only person that could work with it.
I thought I'd go the "roll your own" route. My data is not HUGE. Disk is cheap. And I know real well how to read/write files. And aren't filesystems hierarchical databases anyway? I'm sure that "people in the know" are rolling their eyes at this primitive approach, but this method was the most approachable. With a little bit of python code, I used directories for my structuring, and then a 2 file scheme for each value (one for the latest value, and an append log for the rest of the values). This has worked OK. But I'd rather not be liable for the wrinkles I haven't quite worked out yet. There's as much code involved in how the data is serialized to/from (just using simple strings right now). One nice thing about this approach, is that while I can write python scripts to analyze the data, some things can be done just fine with classic command line tools. E.g (simple query to show all latest rssi values).
ls Telemetry/*/*/rssi | xargs cat
I spent this morning looking at alternatives. Growsed the NOSQL sites. Read up on PyTables. Scanned ZODB tutorial. PyTables looks very suited for what I'm after. Hierarchy of named tables modeling timeseries. But I don't think PyTables works with python3 yet (at least, there is no debian/ubuntu package for python3 yet). Ditto for ZODB. And I'm afraid I don't know enough about what the many different NOSQL databases do to even take a stab at one.
Plea for Ideas
I find myself more bewildered and confused than at the start of this. I was probably too naive that I'd find something that could be a little more "fire and forget" and be past it at this point. Any advice and direction you have, would be hugely appreciated. If someone can give me a recipe that I can meet my needs without huge amounts of overhead/education/ingress, I'd mark that as the answer for sure.

Ok, I'm going to take a stab at this.
We use Elastic Search for a lot of our unstructured data: http://www.elasticsearch.org/. I'm no expert on this subject, but in my day-to-day, I rely on the indices a lot. Basically, you post JSON objects to the index, which lives on some server. You can query the index via the URL, or by posting a JSON object to the appropriate place. I use pyelasticsearch to connect to the indices---that package is well-documented, and the main class that you use is thread-safe.
The query language is pretty robust itself, but you could just as easily add a field to the records in the index that is "latest time" before you post the records.
Anyway, I don't feel that this deserves a check mark (even if you go that route), but it was too long for a comment.

What you describe fits the database model (ex, sqlite3).
Keep one table.
id, device_id, valuetype1, valuetype2, valuetype3, ... ,valuetypen, timestamp
I assume all devices are of the same type (IE, have the same set of values that you care about). If they do not, consider simply setting the value=null when it doesn't apply to a specific device type.
Each time you get an update, duplicate the last row and update the newest value:
INSERT INTO DeviceValueTable (device_id, valuetype1, valuetype2,..., timestamp)
SELECT device_id, valuetype1, #new_value, ...., NOW()
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for a specific device:
SELECT *
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for all devices:
select
DeviceValueTable.*
from
DeviceValueTable a
inner join
(select id, max(timestamp) as newest
from DeviceValueTable group by device_id) as b on
a.id = b.id
You might be worried about the cost (size of storing) the duplicate values. Rely on the database to handle compression.
Also, keep in mind simplicity over optimization. Make it work, then if it's too slow, find and fix the slowness.
Note, these queries were not tested on sqlite3 and may contain typos.

It sounds to me like you want an on-disk, implicitly sorted datastructure like a btree or similar.
Maybe check out:
http://liw.fi/larch/
http://www.egenix.com/products/python/mxBase/mxBeeBase/

Your issue isn't technical, its poor problem specification.
If you are doing anything with sensor data then the old laboratory maxim applies "If you don't write it down, it didn't happen". In the lab, that means a notebook and pen, on a computer it means ACID.
You also seem to be prematurely optimizing the solution, which is well known to be the root of all evil. You don't say what size the data are, but you do say they "come no faster than once per minute, and may be sparse". Assuming they are an even 1.0KB in size, that's 1.5MB/day or 5.3GB/year. My StupidPhone has more storage than you would need in a year, and my laptop has sneezes that are larger.
The biggest problem is that you claim to "know very little" about databases and that is the crux of the matter. Your data is standard old 1950s data-processing boring. You're jumping into buzzword storage technologies when SQLite would do everything you need if only you knew how to ask it. Given that you've got Smalltalk DB down, I'd be quite surprised if it took more than a day's study to learn all the conventional RDBM principles you need and then some.
After that, you'd be able to write a question that can be answered in more than generalities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.