Gremlin: Avoid duplicates without losing perfomance - python

I am trying to insert millions of vertex and edges in the shortest time possible with gremlin python.
I have 2 things to consider:
avoid duplicates for both vertex and edges
avoid to spend 10 hours to insert all data
The major time requested is for looking for the existing vertex and create the relationship.
If I insert edges without checking if a vertex already exists, the script is faster.
I have tryed also with batching the transactions like:
g.addV("person").property("name", "X").as_("p1")
.addV("person").property("name", "Y").as_("p2")
.addE("has_address").from("p1").to(g.V().has("address", "name", "street"))
.addE("has_address").from("p2").to(g.V().has("address", "name", "street2")).iterate()
but I have not improved the performance.
With duplicates I will have the same results in the queries?
I think it would be more expensive later for the queries with duplicates no?
Thanks.

My answer to your last question provided some hints on how to load data "fast" and now that I know your size is in the millions I hope you would consider those strategies.
If you happen to continue with loading with Gremlin and Python, then consider the following points:
I'm not sure where your duplicates arise from but I would look for opportunities to clean that source data and organize it and the load plan to avoid loading them in the first place which will save you cleanup later. I can't say if duplicates will leave you with the same results in your queries because I don't know your data nor do I know your queries. In some graphs and queries I've known duplicates to be irrelevant and expected and in others it could be a disaster.
Definitely try to organize your load into the batching pattern in the blog post I suggested in my other answer. This approach is faster than building giant chained Gremlin traversals filled with hundreds of addV() and addE().
Related to item 1, you've seen the performance pain of looking up graph elements prior to insert. Consider pre-sorting your data in such a way as to avoid repeated vertex lookups. Perhaps one way is to load all your vertices first so that you know they exist and then group/sort your edge load in such a way that you find vertices once and load all edges among that neighborhood of vertices.
Finally, if you can get 1 and 3 figured out, then perhaps you can parallelize the load.
Again, it's impossible to really offer specifics in this sort of forum, but perhaps these ideas will inspire you to an answer.

Related

Storing entries in a very large database

I am writing a Django application that will have entries entered by users of the site. Now suppose that everything goes well, and I get the expected number of visitors (unlikely, but I'm planning for the future). This would result in hundreds of millions of entries in a single PostgreSQL database.
As iterating through such a large number of entries and checking their values is not a good idea, I am considering ways of grouping entries together.
Is grouping entries in to sets of (let's say) 100 a better idea for storing this many entries? Or is there a better way that I could optimize this?
Store one at a time until you absolutely cannot anymore, then design something else around your specific problem.
SQL is a declarative language, meaning "give me all records matching X" doesn't tell the db server how to do this. Consequently, you have a lot of ways to help the db server do this quickly even when you have hundreds of millions of records. Additionally RDBMSs are optimized for this problem over a lot of years of experience so to a certain point, you will not beat a system like PostgreSQL.
So as they say, premature optimization is the root of all evil.
So let's look at two ways PostgreSQL might go through a table to give you the results.
The first is a sequential scan, where it iterates over a series of pages, scans each page for the values and returns the records to you. This works better than any other method for very small tables. It is slow on large tables. Complexity is O(n) where n is the size of the table, for any number of records.
So a second approach might be an index scan. Here PostgreSQL traverses a series of pages in a b-tree index to find the records. Complexity is O(log(n)) to find each record.
Internally PostgreSQL stores the rows in batches with fixed sizes, as pages. It already solves this problem for you. If you try to do the same, then you have batches of records inside batches of records, which is usually a recipe for bad things.

Pytables EArray vs Table for Speed/Efficiency

I'm trying to figure out what is the most efficient way to store time-value pairs in pytables. I'm using pytables since I'm dealing with huge ammounts of data. I will need to perform calculations on the data (average, interpolate, etc.). I don't know the number of rows ahead of time.
I know that an EArray can be appended to, much like a Table. Is there a reason to chose one over the other?
Given my simple data structure (homogeneous time-value pairs) i figured an EArray would be faster/most efficient, but the following quote from the pytables creator himself threw me off:
"...PyTables is specially tuned for, well, tables.
And these entities wear special I/O buffers and query engines that are
fined tuned for maximum speed. *Array objects do not wear the same
machinery."quote location
If the columns have some particular meaning or name, then you should definitely use a Table.
The efficiency largely depends on what kinds of operations you are doing on the data. Most of the time there won't be much of a difference. EArray might be faster for row-access, Tables are probably slightly better at column access, and they should be very similar for whole Table/EArray access.
Of course, the moment you want to do something more than simply access element and instead want to query or transform the data, you should use a Table. Tables are really built up around this idea of querying, via where() methods, and indexing, which makes such operations very fast. EArrays lack this infrastructure and are therefore slower.

Data Transformation in Postgres Using SqlSoup

I have a large dataset of events in a Postgres database that is too large to analyze in memory. Therefore I would like to quantize the datetimes to a regular interval and perform group by operations within the database prior to returning results. I thought I would use SqlSoup to iterate through the records in the appropriate table and make the necessary transformations. Unfortunately I can't figure out how to perform the iteration in such a way that I'm not loading references to every record into memory at once. Is there some way of getting one record reference at a time in order to access the data and update each record as needed?
Any suggestions would be most appreciated!
Chris
After talking with some folks, it's pretty clear the better answer is to use Pig to process and aggregate my data locally. At the scale, I'm operating it wasn't clear Hadoop was the appropriate tool to be reaching for. One person I talked to about this suggests Pig will be orders of magnitude faster than in-DB operations at the scale I'm operating at which is about 10^7 records.

Storing and Processing Large Amounts of Temporal-Spatial Data

As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here.
We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time.
I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to:
(1) For a given location, who was near that location (within a specified distance) over a specified period of time?
(2) Which locations are near each other?
That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations?
We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful.
PostGIS is probably what you are looking for.

Store data series in file or database if I want to do row level math operations?

I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.

Categories

Resources