Copying huge index to create parent-child structure - python

I have 2 sets of indexes, indexes A_* and indexes B_*. I need to create a parent-child structure in B_*. B_* already contains the parent documents, and A_* contains the child documents. So, essentially, I need to copy the child documents from A_* into B_* with some logic in the middle that matches child documents to parent documents based on matching on several fields that serve as a unique key.
A_* contains about 40 indexes with document counts ranging between 100-250 million. Each index is between 100-500 GB. B_* contains 16 indexes with 15 million documents each and of size 20 GB each.
I have tried to do this via a python script, with the main logic being the following:
doc_chunk = helpers.scan(self.es, index=some_index_from_A, size=4000, scroll='5m')
actions = self.doc_iterator(doc_chunk)
deque(helpers.parallel_bulk(self.es, actions, chunk_size=1000, thread_count=4))
The function doc_iterator scrolls through the iterator returned by helpers.scan and, based on values of certain fields in a given child document, determines the id of that document's parent. For each document, it yields indexing actions that index the child documents under the appropriate parent in B_*.
I've tried several different approaches to create this parent-child index, but nothing seems to work:
Running the script in parallel using xargs results in BulkIndexingErrors and leads to, at most, only 1/3 of the corpus being indexed. If this worked, it would be the ideal approach as it would cut down this whole process to 2-4 days.
Running the python script in 1 process doesn't result in BulkIndexingErrors, but it only indexes about 22-28 million documents, at which point a read timeout occurs, and the whole process just hangs indefinitely. This is the less ideal approach as in the best case it would take 7-8 days to finish. During one of my attempts to run it this way, I was monitoring the cluster in Kibana and noticed that searches had spiked to 30,000 documents/second, after which they immediately plummeted to 0 and never picked up afterwards. Indexing tapered off at that point.
I have tried different values for scan size, chunk size, and thread count. I get the fastest performance for 1 process with scan size of 6000, chunk size of 1000, and thread count of 6, but I also noticed the aforementioned read spike with this setting, so it seems like I may be reading too much. Taking it down to a scan size of 4000 still resulted in the read timeouts (I was unable to monitor the search rate at that setting).
Some more details:
ES version: 5.2.1
Nodes: 6
Primary shards: 956
Replicas: 76
I currently need to run the script from a different server from the one where ES is running.
I need to find a way to finish the parent-child index in as few days as possible. Any tips to fix the problems with my aforementioned attempts would help, and new ideas are also welcome.

Related

How can I optimize postgres insert/update request of huge amount of data?

I'm working on a pathfinding project that use topographic data of huge areas.
In order to reduce the huge memory load, my plan is to pre-process the map data by creating nodes that are saved in a PostgresDB on start-up, and then accessed as needed by the algorithm.
I've created 3 docker containers for that, the postgres DB, Adminer and my python app.
It works as expected with small amount of data, so the communications between the containers or the application isn't a problem.
The way it works is that you give a 2D array, it takes the first row, convert each element in node and save it in the DB using an psycopg2.extras.execute_value before going to the second row, then third...
Once all nodes are registered, it updates each of them by searching for their neighbors and adding their id in the right column. That way it takes longer to pre-process the data, but I have easier access when running the algorithm.
However, I think the DB have trouble processing the data past a certain point. The map I gave comes from a .tif file of 9600x14400, and even when ignoring useless/invalid data, that amount to more than 10 millions of nodes.
Basically, it worked quite slow but okay, until around 90% of the node creation process, where the data stopped being processed. Both python and postgres container were still running and responsive, but there was no more node being created, and the neighbor-linking part of the pre-processing didn't start either.
Also there were no error message in either sides.
I've read that the rows limit in a postgres table is absurdly high, but the table also become really slow once a lot of elements are in it, so could it be that it didn't crash or freeze, but just takes an insane amount of time to complete the remaining node creations request?
Would reducing the batch size even more help in that regard?
Or would maybe splitting the table into multiple smaller ones be better?
My queries and psycopg function I've used were not optimized for the mass inserts and update I was doing.
The changes I've made were:
Reduce batch size from 14k to 1k
Making a larger SELECT queries instead of smaller ones
Creating indexes on importants columns
Changing a normal UPDATE query to the format of an UPDATE FROM with also an executing_value instead of cursor.execute
It made the execution time go from around an estimated 5.5 days to around 8 hours.

Will the for loop effect the speed in pyspark dataframe

I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes
A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:
Looping with the following command in Python...
collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)
Gives me these results...
500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.
Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.
I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and
top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.
You should put an ascending index on your "timestamp" field:
collection.ensure_index("timestamp") # shorthand for single-key, ascending index
If this index should contain unique values:
collection.ensure_index("timestamp", unique=True)
Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):
document 1 - assuming blank collection, definitely gets inserted
document 2 - check document 1, update or insert occurs
document 3 - check documents 1-2, update or insert occurs
...etc...
document 32000 - check documents 1-31999, update or insert
When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.
You should compare the results of collection.find({"timestamp": x}).explain() with and without the index (note you may need to use the hint() method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.
A notable wrinkle for you to double check- since you have not set multi=True in your update call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.
Sorry for the link spam, but these are all must-reads:
http://docs.mongodb.org/manual/core/indexes/
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Number of map tasks and split size

What I'm trying to do
I'm new to hadoop and I'm trying to perform MapReduce several times with a different number of mappers and reducers, and compare the execution time. The file size is about 1GB, and I'm not specifying the split size so it should be 64MB. I'm using a machine with 4 cores.
What I've done
The mapper and reducer are written in python. So, I'm using hadoop streaming. I specified the number of map tasks and reduce tasks by using '-D mapred.map.tasks=1 -D mapred.reduce.tasks=1'
Problem
Because I specified to used 1 map task and 1 reduce task, I expected to see just one attempt but I actually have 38 map attempts, and 1 reduce task. I read tutorials and SO questions similar to this problem, and some said that the default map task is 2, but I'm getting 38 map tasks. I also read that mapred.map.tasks only suggests the number and the number of map tasks is the number of split size. However, 1GB divided by 64MB is about 17, so I still don't understand why 38 map tasks were created.
1) If I want to use only 1 map task, do I have to set the input splits size to 1GB??
2) Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
Number of mappers is actually governed by the InputFormat you are using. Having said that, based on the type of data you are processing, InputFormat may vary. Normally, for the data stored as files in HDFS FileInputFormat, or a subclass, is used which works on the principle of MR split = HDFS block. However, this is not always true. Say you are processing a flat binary file. In such a case there is no delimiter(\n or something else) to represent the split boundary. What would you do in such a case? So, the above principle doesn't always work.
Consider another scenario wherein you are processing data stored in a DB, and not in HDFS. What will happen in such a case as there is no concept of 64MB block size when we talk about DBs?
The framework tries its best to carry out the computation in a manner as efficient as possible, which might involve creation of lesser/more number of mappers as specified/expected by you. So, in order to see how exactly mappers are getting created you need to look into the InputFormat you are using in your job. getSplits() method to be precise.
If I want to use only 1 map task, do I have to set the input splits size to 1GB??
You can override the isSplitable(FileSystem, Path) method of your InputFormat to ensure that the input files are not split-up and are processed as a whole by a single mapper.
Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
It depends on availability. Mappers can run on multiple cores simultaneously. And a single core can run multiple mappers sequentially.
Some add-on to your question 2: the parallelism of running map/reduce tasks on a node is controllable. One can set the maximum number of map/reduce tasks running simultaneously by a tasktracker via mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum. Defaults for both parameters are 2. For 4-core node mapreduce.tasktracker.map.tasks.maximum should be increased to at least 4, i.e. to make use of each core. 2 for max-reduce-tasks is expectedly ok. Btw, finding out best values for max map/reduce tasks is non-trivial as it depends on the degree of jobs parallelism on a cluster, whether mappers/reducers of a job(-s) are io- or computationally intensive, etc.

Insert performance with Cassandra

sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?

Categories

Resources