Loading data into Titan with bulbs and then accessing it - python

I am a complete novice in graph databases and all the Titan ecosystem, so please excuse me sounding stupid. I am also suffering from the lack of documentation -_-
I've installed the titan server. I am using Cassandra as a back-end.
I am trying to load basic twitter data into Titan using Python.
I use the bulbs library for this purpose.
Lets say, i have a list of people i follow on twitter in the friends list
my python script goes like this
from bulbs.titan import Graph
# some other imports here
# getting the *friends* list for a specified user here
g = Graph()
# a vertex of a specified user
center = g.vertices.create(name = 'sergiikhomenko')
for friend in friends:
cur_friend = g.vertices.create(name = friend)
g.edges.create(center,'follows',cur_friend)
From what i understand - the above code should have created a graph in Titan with a number of vertices, some of which a connected by the follows edge.
My questions are:
How do I save it in Titan?? (like a commit in SQL)
How do I access it later?? Should I be able to access it through
gremlin shell?? If yes, how??
My next question would be about visualizing the data, but i am very far from there :)
Please help :) I am completely lost in all this Titan, Gremlin, Rexster,etc. :)
Update: One of the requirement of our POC project - is ... python :), that's why i jumped into bulbs straight on. I'll definitely follow the advice below though :)

My answer will be somewhat incomplete because I can't really supply answers around Bulbs but you do ask some specific questions which I can try to answer:
How do I save it in Titan?? (like a commit in SQL)
It's just g.commit() in Java/Groovy.
How do I access it later?? Should I be able to access it through gremlin shell?? If yes, how??
Once it's committed to cassandra, access it with Bulbs, the gremlin shell, some other application, whatever. Not sure what you're asking really, but I like the Gremlin Console for such things so if have cassandra started locally, start up bin/gremlin.sh and do:
g = TitanFactory.build()
.set("storage.backend","cassandra")
.set("storage.hostname","127.0.0.1")
.open();
That will get you a connection to cassandra and you should be able to query your data.
I am completely lost in all this Titan, Gremlin, Rexster,etc
My advice to all new users (especially those new to graphs, cassandra, the jvm, etc.) is to slow down. The fastest way to get discouraged is to try to do python to the bulbs to the rexster to the gremlin over the titan to the cassandra cluster hosted in ec2 with hadoop - and try to load a billion edge graph into that.
If you are new, then start with the latest stuff: TinkerPop3 - http://tinkerpop.incubator.apache.org/ - which bulbs does not yet support - but that's ok because you're learning TinkerPop which is important to learning the whole stack and all of TinkerPop's implementations (e.g. Titan). Use TinkerGraph (not Titan) with a small subset of your data and make sure you get the pattern for loading that small subset right before you try to go full scale. Use the Gremlin Console for everything related to this initial goal. That is a recipe for an easy win. Under that approach you could likely have a Graph going with some queries over your own data in a day and learn a good portion of what you need to do it with Titan.
Once you have your Graph, get it working in Gremlin Server (the Rexster replacement for TP3). Then think about how you might access that via python tooling. Or maybe you figure out how to convert TinkerGraph to Titan (perhaps start with BerkeleyDB rather than cassandra). My point here is to more slowly increment your involvement with different pieces of the ecosystem because it is otherwise overwhelming.

Related

Data Integration Structure using Python and/or SSIS

I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.

DB choice for webapp in Flask running on old SBC

I'll be setting up an webapp with Flask in an old Raspberrypi B+ running raspbian. The pi will also handle the desktop fuzz, so I'll try to keep it as light as possible.
The point of this question is mainly 1- what DB should I use? But I'm also wondering if 2- keeping it in a external usbstick would help? Let's take it step by step.
What DB: Consideration points
I rather do the programming using SQLAlchemy, so restrictions apply
The schema is not complex (around 10 tables)
Only one local user at first, probably forever, so a few querys and connections
Low overhead, the pi will most likely struggle, I'm just trying to minimize it.
The second point is about sd cards burnout. I read somewhere that any db should hit sd cards pretty hard and it got me thinking.
I'll set up some kind of external backup to this db anyway, but should I also keep the path to it in an stick? This should be really simple if I choose to use SQLite.
TYA
SQLite sounds like a perfect fit for this sort of use-case with embedded systems where you need a light-weight, yet full featured database. Many folks use SQLite databases on mobile devices as well for this reason: fairly limited cpu / memory resources, simple storage as a single file.

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

How to package SC instrument for beta testers?

I've built a sample instrument using the following architecture:
A python script reads sample files from a Redis database stored on disk and sends OSC messages to super collider with the path and pitch of a random selection of N samples. On the SC side, the key presses form a midi interface are mapped to select and play one or more of the corresponding samples.
The prototype is functional, and I would like to release a beta for testers, however I have no clue on how to package it. One option that seems plausible is wrap it as a VST, but as far as I understand, there is no stable wrapper for SC and the safest bet would be to re-code the entire instrument into VST.
Another option, that seems more viable, would be to wrap it as a standalone instrument. Would I need to have the beta testers have SC installed, or is there a way to wrap a SC server inside an executable?
Any ideas on this issue--even if they divert from my original approach-- will be highly appreciated.
Fortunately there are lots of options for this in SuperCollider. You may want to start by reviewing this article from the documentation, in which Making Standalone Applications is discussed rather thoroughly.
Alternately, there are some pre-built standalones floating around, frequently on GitHub. I frequently use this repository to package up an installation or instrument and deploy on Raspberry Pi.
I'm not super familiar with VST or Supercollider, but maybe you could try something like Docker. This is a container based solution which might meet your needs
You set up a Dockerfile, which lets you provide instructions to build a container with the SC Server. Then let the person using it decide whether they want a Redis instance inside the same docker container, or if they want to use a separate Redis Container.

best way to process local file and update Titan graph database

I am newbie to Titan graph database.
I am trying to process local data, and insert them into titan db.
I am looking for the program language or script language that can do fast way to process local data and update/insert titan db.
bulbs, is python interface, using REST API to update titan db. but I see sometimes the program hang over there.
Can I use shell script to process the file, and call gremlin script to update titan db?
Thanks a lot for advice.
If the graph schema is not too complex and the data in a single file, the easiest way is to simply use a Gremlin script. Check out this simple recipe to load an edge list:
http://gremlindocs.com/#recipes/reading-from-a-file
If you have a large amount of data, consider using the BatchGraph wrapper for easier programming, auto-commit and better performance:
https://github.com/tinkerpop/blueprints/wiki/Batch-Implementation
Once you have your script, you could run it in the Gremlin REPL or execute it from shell script with gremlin.sh:
https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-through-Groovy#gremlin-and-groovy-shell
Note that your question is about Titan, but I've responded generically with Blueprints in mind (so you will see TinkerGraph examples in many of these links), but since Titan is Blueprints compatible the code should work just as well for Titan.
I know this is an old question, but gremlin-migrate is an npm package that runs gremlin scripts in the order in which they were intended. I'd use it not really for one-off data loads, but more for continuously ensuring your DB schema etc is correct and up-to-date. Good to include in your CI/CD pipeline :-).
Disclosure: I'm the author of the tool, which I created after reading this and not finding a gremlin based migration tool in npm.

Categories

Resources