Bulk import of .json files in arangodb with python

Bulk import of .json files in arangodb with python - python

I have huge collection of .json files containing hundreds or thousands of documents I want to import to arangodb collections. Can I do it using python and if the answer is yes, can anyone send an example on how to do it from a list of files? i.e:
for i in filelist:
import i to collection
I have read the documentation but I couldn't find anything even resembling that

So after a lot of trial and error I found out that I had the answer in front of me. So I didn't need to import the .json file, I just needed to read it and then do a bulk import of documents. The code is like this:
a = db.collection('collection_name')
for x in list_of_json_files:
with open(x,'r') as json_file:
data = json.load(json_file)
a.import_bulk(data)
So actually it was quite simple. In my implementation I am collecting the .json files from multiple folders and importing them to multiple collections. I am using the python-arango 5.4.0 driver

I had this same problem. Though your implementation will be slightly different, the answer you need (maybe not the one you're looking for) is to use the "bulk import" functionality.
Since ArangoDB doesn't have an "official" Python driver (that I know of), you will have to peruse other sources to give you a good idea on how to solve this.
The HTTP bulk import/export docs provide curl commands, which can be neatly translated to Python web requests. Also see the section on headers and values.
ArangoJS has a bulk import function, which works with an array of objects, so there's no special processing or preparation required.
I have also used the arangoimport tool to great effect. It's command-line, so it could be controlled from Python, or used stand-alone in a script. For me, the key here was making sure my data was in JSONL or "JSON Lines" format (each line of the file is a self-contained JSON object, no bounding array or comma separators).

Related

Uploading files using design pattern with python

I used to upload csv, excel, json or geojson files in my a postegreSQL using Python/Django.
I noticed that the scripts is redundant and sometimes difficult to maintain when we need to update key or columns. Is there a way to use design pattern? I have never used it before.
Any suggestion or links could be hep!

Is there another/similiar method for sparks.read.format.load outisde of databricks?

I am trying to load an avro file into a sparks dataframe so I can convert it to a pandas and eventually a dictionary. The method I want to use:
df = spark.read.format("avro").load(avro_file_in_memory)
(Note: the avro file data I'm trying to load into the dataframe is already in memory as a response from a request response from python requests)
However, this function uses sparks native to databricks environment, which I am not working in (I looked into pysparks for a similar function/code but could not see anything myself).
Is there any function similar that I can use outside of data bricks to produce the same results?

That Databricks library is open source, but was actually added to core Spark in 2.4 (though still an external library)
In any case, there's a native avro Python library, as well as fastavro, so I'm not entirely sure if you want to be starting up a JVM (because you're using Spark), just to load Avro data into a dictionary. Besides that, an Avro file consists of multiple records, so it would at the very least be a list of dictionaries
Basically, I think you're better off using the approach from your previous question, but start with writing the Avro data to disk, since that seems to be your current issue
Otherwise, maybe a little more searching for what you're looking for would solve this XY problem you're having
https://github.com/ynqa/pandavro

Python JSON API for linked data, with flat files

We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.
The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...
Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.
The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?
We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".
For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.
To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way?
I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?
And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?
Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.

Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.
If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.
If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).
These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.

Save formatted (not serialize) json to disk to use git for storage

In working on a project where I will be using OrderDict to store the data the code will be working on.
However, I also want to save many of these OrderDrict as well as .json-files, and to to commit them to a git-enabled content repository.
But, since json.dumps just serializes and in effect place the JSON content on one line, it will make it tricky to track the difference between committed versions in a visually appealing way for users.
Is there a way to make json output better readable formatting, with indents and line feeds so the output-file can be easy to read, as well as taking better use of gits version and diff features?
Or are there any 3rd party libraries, preferable with Python 3.5 support, out there I've missed?

Write Microsoft Word Doc with MySQL data

I'm programming a MySQL database with Web interface for remote access. I used Django as a framework. But now, I want to generate some reports using the MySQL data and modify them after generating. Therefore, I automatically think of exporting data to or importing from Word. The thing is, how I do this?
I have seen several options. One of them, using Python-docx, a library to generate docx documents in Python. I could have a problem with this, because the generated reports will be large, with lots of images, tables, pages, etc. I worked with xlsxwriter, and when the files were large it took long time to generate de xlsx. I don't know if Python-docx would be the better solution.
Other option is to import data directly from Microsoft Word, using some software for this concrete purpose or using a macro VBA. I have programmed some example code with VBA to import data of MySQL using connectors ODBC and it's immediately possible, but there is thousand of objects and classes of VBA Word to learn.
Exposed the problem, any tips or suggestions??? Thanks in advance!

Another option is to generate HTML & open as a word document.
If you take a document similar to what you want to generate & save as HTML you will see what word does. Take this file as a template for your documents

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bulk import of .json files in arangodb with python - python

Related

Uploading files using design pattern with python

Is there another/similiar method for sparks.read.format.load outisde of databricks?

Python JSON API for linked data, with flat files

Save formatted (not serialize) json to disk to use git for storage

Write Microsoft Word Doc with MySQL data

Categories

Resources