best way to process local file and update Titan graph database

best way to process local file and update Titan graph database - python

I am newbie to Titan graph database.
I am trying to process local data, and insert them into titan db.
I am looking for the program language or script language that can do fast way to process local data and update/insert titan db.
bulbs, is python interface, using REST API to update titan db. but I see sometimes the program hang over there.
Can I use shell script to process the file, and call gremlin script to update titan db?
Thanks a lot for advice.

If the graph schema is not too complex and the data in a single file, the easiest way is to simply use a Gremlin script. Check out this simple recipe to load an edge list:
http://gremlindocs.com/#recipes/reading-from-a-file
If you have a large amount of data, consider using the BatchGraph wrapper for easier programming, auto-commit and better performance:
https://github.com/tinkerpop/blueprints/wiki/Batch-Implementation
Once you have your script, you could run it in the Gremlin REPL or execute it from shell script with gremlin.sh:
https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-through-Groovy#gremlin-and-groovy-shell
Note that your question is about Titan, but I've responded generically with Blueprints in mind (so you will see TinkerGraph examples in many of these links), but since Titan is Blueprints compatible the code should work just as well for Titan.

I know this is an old question, but gremlin-migrate is an npm package that runs gremlin scripts in the order in which they were intended. I'd use it not really for one-off data loads, but more for continuously ensuring your DB schema etc is correct and up-to-date. Good to include in your CI/CD pipeline :-).
Disclosure: I'm the author of the tool, which I created after reading this and not finding a gremlin based migration tool in npm.

Related

Automate a manual task using Python

I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.

This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.

Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.

Data Integration Structure using Python and/or SSIS

I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.

Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.

How to run a SQL Server Script from Python

Having learned SQL before learning any Python, I have a fairly lengthy program/query that I wrote in SQL server which heavily transforms and calculates the data (Basically taking forecast, inventory, and Bill of Materials, and efficiency data and then automatically generating a production plan. While I am sure there are things I could optimize, the query/program itself is aorund 3,000 lines).
While I have figured out how to update the data in SQL Server using a combination of pandas, pyodbc, and fast_to_sql, I have not been able to find a simple method for running a SQL Server script through Python.
I am sure that I could achieve the same thing by just having the data manipulation occur in python rather than SQL Server, it would be fairly time intensive to translate everything.
If there is anything I can do to clarify please let me know. For reference I am using the 2017 version of Microsoft SQL Server python version 3.8.3.

Try to combine all of your MSSQL scripts into Stored Procedures and then call it from Python.

Loading data into Titan with bulbs and then accessing it

I am a complete novice in graph databases and all the Titan ecosystem, so please excuse me sounding stupid. I am also suffering from the lack of documentation -_-
I've installed the titan server. I am using Cassandra as a back-end.
I am trying to load basic twitter data into Titan using Python.
I use the bulbs library for this purpose.
Lets say, i have a list of people i follow on twitter in the friends list
my python script goes like this
from bulbs.titan import Graph
# some other imports here
# getting the *friends* list for a specified user here
g = Graph()
# a vertex of a specified user
center = g.vertices.create(name = 'sergiikhomenko')
for friend in friends:
cur_friend = g.vertices.create(name = friend)
g.edges.create(center,'follows',cur_friend)
From what i understand - the above code should have created a graph in Titan with a number of vertices, some of which a connected by the follows edge.
My questions are:
How do I save it in Titan?? (like a commit in SQL)
How do I access it later?? Should I be able to access it through
gremlin shell?? If yes, how??
My next question would be about visualizing the data, but i am very far from there :)
Please help :) I am completely lost in all this Titan, Gremlin, Rexster,etc. :)
Update: One of the requirement of our POC project - is ... python :), that's why i jumped into bulbs straight on. I'll definitely follow the advice below though :)

My answer will be somewhat incomplete because I can't really supply answers around Bulbs but you do ask some specific questions which I can try to answer:
How do I save it in Titan?? (like a commit in SQL)
It's just g.commit() in Java/Groovy.
How do I access it later?? Should I be able to access it through gremlin shell?? If yes, how??
Once it's committed to cassandra, access it with Bulbs, the gremlin shell, some other application, whatever. Not sure what you're asking really, but I like the Gremlin Console for such things so if have cassandra started locally, start up bin/gremlin.sh and do:
g = TitanFactory.build()
.set("storage.backend","cassandra")
.set("storage.hostname","127.0.0.1")
.open();
That will get you a connection to cassandra and you should be able to query your data.
I am completely lost in all this Titan, Gremlin, Rexster,etc
My advice to all new users (especially those new to graphs, cassandra, the jvm, etc.) is to slow down. The fastest way to get discouraged is to try to do python to the bulbs to the rexster to the gremlin over the titan to the cassandra cluster hosted in ec2 with hadoop - and try to load a billion edge graph into that.
If you are new, then start with the latest stuff: TinkerPop3 - http://tinkerpop.incubator.apache.org/ - which bulbs does not yet support - but that's ok because you're learning TinkerPop which is important to learning the whole stack and all of TinkerPop's implementations (e.g. Titan). Use TinkerGraph (not Titan) with a small subset of your data and make sure you get the pattern for loading that small subset right before you try to go full scale. Use the Gremlin Console for everything related to this initial goal. That is a recipe for an easy win. Under that approach you could likely have a Graph going with some queries over your own data in a day and learn a good portion of what you need to do it with Titan.
Once you have your Graph, get it working in Gremlin Server (the Rexster replacement for TP3). Then think about how you might access that via python tooling. Or maybe you figure out how to convert TinkerGraph to Titan (perhaps start with BerkeleyDB rather than cassandra). My point here is to more slowly increment your involvement with different pieces of the ecosystem because it is otherwise overwhelming.

Python program on server - control via browser

I have to setup a program which reads in some parameters from a widget/gui, calculates some stuff based on database values and the input, and finally sends some ascii files via ftp to remote servers.
In general, I would suggest a python program to do the tasks. Write a Qt widget as a gui (interactively changing views, putting numbers into tables, setting up check boxes, switching between various layers - never done something as complex in python, but some experience in IDL with event handling etc), set up data classes that have unctions, both to create the ascii files with the given convention, and to send the files via ftp to some remote server.
However, since my company is a bunch of Windows users, each sitting at their personal desktop, installing python and all necessary libraries on each individual machine would be a pain in the ass.
In addition, in a future version the program is supposed to become smart and do some optimization 24/7. Therefore, it makes sense to put it to a server. As I personally rather use Linux, the server is already set up using Ubuntu server.
The idea is now to run my application on the server. But how can the users access and control the program?
The easiest way for everybody to access something like a common control panel would be a browser I guess. I have to make sure only one person at a time is sending signals to the same units at a time, but that should be doable via flags in the database.
After some google-ing, next to QtWebKit, django seems to the first choice for such a task. But...
Can I run a full fledged python program underneath my web application? Is django the right tool to do so?
As mentioned previously, in the (intermediate) future ( ~1 year), we might have to implement some computational expensive tasks. Is it then also possible to utilize C as it is within normal python?
Another question I have is on the development. In order to become productive, we have to advance in small steps. Can I first create regular python classes, which later on can be imported to my web application? (Same question applies for widgets / QT?)
Finally: Is there a better way to go? Any standards, any references?

Django is a good candidate for the website, however:
It is not a good idea to run heavy functionality from a website. it should happen in a separate process.
All functions should be asynchronous, I.E. You should never wait for something to complete.
I would personally recommend writing a separate process with a message queue and the website would only ask that process for statuses and always display a result immediatly to the user
You can use ajax so that the browser will always have the latest result.
ZeroMQ or Celery are useful for implementing the functionality.
You can implement functionality in C pretty easily. I recomment however that you write that functionality as pure c with a SWIG wrapper rather that writing it as an extension module for python. That way the functionality will be portable and not dependent on the python website.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.