Gremlin: how works IO import with python - python

I am trying to know what are the functions called from the command g.io('file.json').read().iterate()
I see that a 'read' step is put in the step_instructions but I can't found the original function to import the file in to the graph.
This because I want to import a lot of data but without a file, using a python object.
I see that io().read() import a big file in a minute and I want to ricreate it but without using a file.
Thanks a lot.

First of all to be clear on the nomenclature, io() is a step, while read() and write() are step modulators and those modulators can only apply to io() step to tell it to read or write respectively. Therefore, as io() currently only works with a string file name consequently you can only read/write from/to files.
If you want to send "a lot of data" with Python, I'd first consider what you mean by that in size. If you're talking millions of vertices and edges, you should first check if the graph database you are using has its own bulk loading tool. If it does, you should use that. You may also consider methods using Gremlin/Spark as described here in the case of JanusGraph. Finally, if you must use pure Gremlin to do your bulk loading, then parameterized traversal with your Python object (I assume a list/dict of some sort) is probably the approach to take. This blog post might offer some inspiration in that line of thinking.

Related

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

Cloudant database listener in python

I'm trying to create a listener in python that automatically retrieve changes on a Cloudant database as they occur. When a change occurs I want to call a specific function.
I have read through the documentation and the API-specifications but couldn't find anything.
Is there a way to do this?
Here's a basic streaming changes feed reader (disclaimer: I wrote it):
https://github.com/xpqz/pylon/blob/master/pylon.py#L165
The official Cloudant Python client library also contains a changes feed follower:
https://python-cloudant.readthedocs.io/en/latest/feed.html
It's pretty easy to get a basic changes feed reader going as the _changes endpoint with a feed=continuous parameter does quite a lot off the bat for you, including passing the results back as self-contained json-objects per line. The hard bit is dealing with quite a non-obvious set of failure conditions.

How to detect end of file using scipy.io.FortranFile

I am reading an formatted sequential file output from a Fortran program. I am using the scipy.io.FortranFile class to do this, and am successfully extracting the information I need.
My problem: I do not know how long the input file is, and have no way of knowing how many records to read in. Currently, I am simply iteratively reading the file until an exception is raised (a TypeError, but I don't know if this is how it would always fail). I would prefer to do this more elegantly.
Is there anyway to detect EOF using the FortranFile class? Or alternately is there a better way to read in unformatted sequential files?
Some cursory research (I am not a Fortran programmer) indicates to me that if reading this using the Fortran READ function, one can check the IOSTAT flag to determine if you are at the end of the file. I would be surprised if a similar capability isn't provided in the FortranFile class, but I don't see any mention of it in the documentation.

complete beginner trying to create a flat file database in python

trying to keep it stupidly simple. Is it a bad idea to move a txt file into and out of a python list? the txt files will probably get to about 2-5k entries. what is the preferred method to create a simple flat file databse?
It might or might not be a bad idea. It depends on what you are trying to achieve, how much memory you have and how big those lines are on average. It also depends on what you are doing with that data. Maybe it is worth it to read and process file line by line? In any case, database assumes indexes, what are you going to do with a list of strings without an index? You cannot search it efficiently, for example.
In any case, if you feel like you need a database, take a look at SQLite. It is a small embedded SQL server written in C with Python interface. It is stable and proven to work. For example, it is being used on iPhone in tons of applications.
If you're looking for a very simple file database, maybe you should look at the shelve module. Example usage:
import shelve
with shelve.open("myfile") as mydb:
mydb["0"] = "first value"
mydb["1"] = "second value"
# ...

I'm using Hadoop for data processing with python, what file format should be used?

I'm using Hadoop for data processing with python, what file format should be used?
I have project with a substantial amount of text pages.
Each text file has some header information that I need to preserve during the processing; however, I don't want the headers to interfere with the clustering algorithms.
I'm using python on Hadoop (or is there a sub package better suited?)
How should I format my text files, and store those text files in Hadoop for processing?
1) Files
If you use Hadoop Streaming, you have to use line-based text-files, data up to the first tab is passed to your mapper as key.
Just look at the documentation for streaming.
You can also put you input-files into HDFS, which would be recommendable for big files. Just look at the "Large Files"-section in the above link.
2) Metadata-preservation
The problem I see is that your header information (metadata) will only be treated as such data, so you have to filter it out by yourself (first step). To pass it along is more difficult, as the data of all input files will just be joined after the map-step.
You will have to add the metadata somewhere to the data itself (second step) to be able to relate it later. You could emit (key, data+metadata) for each data-line of file and thus be able to preserve the metadata for each data-line. Might be a huge overhead, but we are talking MapReduce, means: pfffrrrr ;)
Now comes the part where I don't know how much streaming really differs from a Java-implemented job.
IF streaming invokes one mapper per file, you could spare yourself the following trouble, just take the first input of map() as metadata and add it (or a placeholder) to all following data-emits. If not, the next is about Java-Jobs:
At least with a JAR-mapper you can relate the data to its input-file (see here). But you would have to extract the metadata first, as the map-function just might be invoked on a partition of the file not containing the metadata. I'd propose something like that:
create a metadata-file beforehand, containing an placeholder-index: keyx:filex, metadatax
put this metadata-index into the HDFS
use a JAR-mapper, load during setup() the metadata-index-file
see org.apache.hadoop.hdfs.DFSClient
match filex, set keyx for this mapper
add to each emitted data-line in map() the used keyx
If you're using Hadoop Streaming, your input can be in any line-based format; your mapper and reducer input comes from sys.stdin, which you read any way you want. You don't need to use the default tab-deliminated fields (although in my experience, one format should be used among all tasks for consistency when possible).
However, with the default splitter and partitioner, you cannot control how your input and output is partitioned or sorted, so you your mappers and reducers must decide whether any particular line is a header line or a data line using only that line - they won't know the original file boundaries.
You may be able to specify a partitioner which lets a mapper assume that the first input line is the first line in a file, or even move away from a line-based format. This was hard to do the last time I tried with Streaming, and in my opinion mapper and reducer tasks should be input agnostic for efficiency and reusability - it's best to think of a stream of input records, rather than keeping track of file boundaries.
Another option with Streaming is to ship header information in a separate file, which is included with your data. It will be available to your mappers and reducers in their working directories. One idea would be to associate each line with the appropriate header information in an inital task, perhaps by using three fields per line instead of two, rather than associating them by file.
In general, try and treat the input as a stream and don't rely on file boundaries, input size, or order. All of these restrictions can be implemented, but at the cost of complexity. If you do need to implement them, do so at the beginning or end of your task chain.
If you're using Jython or SWIG, you may have other options, but I found those harder to work with than Streaming.

Categories

Resources