I have an application that reads a series of XML files containing logs of vehicles passages in a road. The application then processes each record, transform a few of the informations to match the database columns and inserts it into a cassandra database (running a single node in a remote server [it's in an internal network so connection isn't really an issue]). After inserting data in the database, the process for each file then goes on to read this data and produce information for summary tables, that leaves information ready for a drilldown analysis made in an unrelated part of the application.
I'm using multiprocessing to process many XML files in parallel, and the trouble I'm having is with communicating to the cassandra server. Schematically, the process goes as follows:
Read record from XML file
Process record's data
insert processed data into the database (using .execute_async(query))
repeat 1 to 3 until the XMl file is over
Wait for the responses of all the insert queries I made
Read data from the database
Process the read data
Insert the processed data in summary tables
Now, this is running smoothly in multiple parallel processes, until, when one process goes on to step 6, its request (that's made using .execute(query), meaning I'll wait for the response) is always facing a timeout. The error I receive is:
Process ProcessoImportacaoPNCT-1:
Traceback (most recent call last):
File "C:\Users\Lucas\Miniconda\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\ImportacaoArquivosPNCT.py", line 231, in run
core.CalculoIndicadoresPNCT.processa_equipamento(sessao_cassandra, equipamento, data, sentido, faixa)
File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\CalculoIndicadoresPNCT.py", line 336, in processa_equipamento
desvio_medias(sessao_cassandra, equipamento, data_referencia, sentido, faixa)
File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\CalculoIndicadoresPNCT.py", line 206, in desvio_medias
veiculos = sessao_cassandra.execute(sql_pronto)
File "C:\Users\Lucas\Miniconda\lib\site-packages\cassandra\cluster.py", line 1594, in execute
result = future.result(timeout)
File "C:\Users\Lucas\Miniconda\lib\site-packages\cassandra\cluster.py", line 3296, in result
raise self._final_exception
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I have changed the timeout in the server to absurd amounts of time (500000000 ms for instance), and I have also attempted setting the timeout limit in the client, with .execute(query, timeout=3000) but still, no success.
Now, when more processes hit the same problem and the intense writing from steps 1-3 in multiple processes stops, the last processes to get to step 6 have success in following the procedure, which makes me think the problem is that cassandra is giving priority to the tens of thousands of insert requests I'm asking per second and either ignoring my read request or putting it way back in the line.
A way to solve this, in my opinion, would be if in any way I could ask cassandra to give priority to my read request so that I can keep processing, even if that means slowing down the other processes.
Now, as a side note, you might think my process modelling is not optimal, and I'd love to hear opinions on that, but for the reality of this application this is, in our vision, the best way to proceed. So we have actually thought extensively about optimising the process, but (if the cassandra server can handle it) this is optimal for our reality.
So, TL;DR: Is there a way of giving priority to a query when executing tens of thousands of assynchronous queries? If not, is there a way of executing tens of thousands of insert queries and read queries per second in a way that the requests don't timeout? additionally, what would you suggest I do to solve the problem? run less processes in parallel is obviously a solution but one I'm trying to avoid. So, Would love to hear everyone's thoughts.
Storing the data while inserting so I don't need to read it again for summary is not a possibility because the XML files are huge and memory is an issue.
I don't know of a way to give priority to read queries. I believe internally Cassandra has separate thread pools for read and write operations, so those are running in parallel. Without seeing the schema and queries you're doing, it's hard to say if you are doing a very expensive read operation or if the system is just so swamped with writes that it can't keep up with the reads.
You might want to try monitoring what's going on in Cassandra as your application is running. There are several tools you can use to monitor what's going on. For example, if you ssh to your Cassandra node and run:
watch -n 1 nodetool tpstats
This will show you the thread pool stats (updated once per second). You'll be able to see if the queues are filling up or operations are getting blocked. If any of the "Dropped" counters increase, that's a sign you don't have enough capacity for what you're trying to do. If that's the case, then add capacity by adding more nodes, or change your schema and approach so that the node has less work to do.
Other useful things to monitor (on linux use watch -n 1 to monitor continuously):
nodetool compactionstats
nodetool netstats
nodetool cfstats <keyspace.table name>
nodetool cfhistograms <keyspace> <table name>
It also good to monitor the node with linux commands like top and iostat to check the CPU utilization and disk utilization.
My impression from what you say is that your single node doesn't have enough capacity to do all the work you're giving it, so either you need to process less data per unit of time, or add more Cassandra nodes to spread out the workload.
I'm currently facing my own timeout error due to partitions having too many rows, so I may have to add cardinality to my partition key to make the contents of each partition smaller.
Related
I have a simple one node Cassandra cluster with basic keyspace configuration that has replication_factor=1
In this keyspace, we have about 230 tables. Each table has roughly 40 columns. The writes we do to these tables are at roughly the rate of 30k writes in five minutes just once a day. I have about 6 python workers scripts that make these writes to any one table at a time and they will all continue making these writes till all 230 tables are written to for the day. The scripts use the python cassandra-driver with a simple session to make these writes. As far as the data being written here, a lot of them are nulls.
Effectively, if I am right, this can be thought of as 6 concurrent connection making 30k+ entries in five minutes per day.
I understand how cassandra writes and deletes work and am familiar with coordinator nodes etc. I am observing a traceback that occurs intermittently as described below:
"cassandra/cluster.py", line 2030, in cassandra.cluster.Session.execute (cassandra/cluster.c:38536)
app_nstablebuilder.1.69j772led82k#swarm-worker-gg37 | File "cassandra/cluster.py", line 3844, in cassandra.cluster.ResponseFuture.result (cassandra/cluster.c:80834)
app_nstablebuilder.1.69j772led82k#swarm-worker-gg37 | cassandra.WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
My question has to do with how to approach solving this problem. I am unable to verify whether the problem has come out of my workers' scripts or with the Cassandra cluster itself. Should I be slowing down my workers in doing their writes? Should I run some sort of diagnostic to improve Cassandra performance?
All the solutions I have read till now have to do with multinode clusters and I couldn't find one for a single node cluster.
I feel like our cluster is unhealthy and that my efforts should be targetted in fixes there. If so, I'm unsure of where to begin. Could anyone point me in the right direction?
If there's any further information I could provide to help, do let me know.
Inserting nulls will create tombstones. Excluding the null columns from the query will not create tombstones. You can read a little bit on that matter here. I'm not sure that inserting nulls may cause this, but inserting nulls (that would create tombstones) is definitely an improvement to take into account.
I currently have a large database (stored as numpy array) that I'd like to perform a search on, however due to the size I'd like to split the database into pieces and perform a search on each piece before combining the results.
I'm looking for a way to host the split database pieces on separate python processes where they will wait for a query, after which they perform the search, and send the results back to the main process.
I've tried a load of different things with the multiprocessing package, but I can't find any way to (a) keep processes alive after loading the database on them, and (b) send more commands to the same process after initialisation.
Been scratching my head about this one for several days now, so any advice would be much appreciated.
EDIT: My problem is analogous to trying to host 'web' apis in the form of python processes, and I want to be able to send and receive requests at will without reloading the database shards every time.
I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.
I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/
We have hundreds of thousands of tasks that need to be run at a variety of arbitrary intervals, some every hour, some every day, and so on. The tasks are resource intensive and need to be distributed across many machines.
Right now tasks are stored in a database with an "execute at this time" timestamp. To find tasks that need to be executed, we query the database for jobs that are due to be executed, then update the timestamps when the task is complete. Naturally this leads to a substantial write load on the database.
As far as I can tell, we are looking for something to release tasks into a queue at a set interval. (Workers could then request tasks from that queue.)
What is the best way to schedule recurring tasks at scale?
For what it's worth we're largely using Python, although we have no problems using components (RabbitMQ?) written in other languages.
UPDATE: Right now we have about 350,000 tasks that run every half hour or so, with some variation. 350,000 tasks * 48 times per day is 16,800,000 tasks executed per day.
UPDATE 2: There are no dependencies. The tasks do not have to be executed in order and do not rely on previous results.
Since ACID isn't needed and you're okay with tasks potentially running twice, I wouldn't keep the timestamps in the database at all. For each task, create a list of [timestamp_of_next_run, task_id] and use a min-heap to store all of the lists. Python's heapq module can maintain the heap for you. You'll be able to very efficiently pop off the task with the soonest timestamp. When you need to run a task, use its task_id to look up in the database what the task needs to do. When a task completes, update the timestamp and put it back into the heap. (Just be careful not to change an item that's currently in the heap, as that will break the heap properties).
Use the database only to store information that you will still care about after a crash and reboot. If you won't need the information after a reboot, don't spend the time writing to disk. You will still have a lot of database read operations to load the information about a task that needs to run, but a read is much cheaper than a write.
If you don't have enough RAM to store all of the tasks in memory at the same time, you could go with a hybrid setup where you keep the tasks for the next 24 hours (for example) in RAM and everything else stays in the database. Alternately, you could rewrite the code in C or C++, which are less memory hungry.
If you don't want a database, you could store just the next run timestamp and task id in memory. You could store the properties for each task in a file named [task_id].txt. You would need a data structure to store all the tasks, sorted by timestamp in memory, an AVL tree seems like it would work, here's a simple one for python: http://bjourne.blogspot.com/2006/11/avl-tree-in-python.html. Hopefully Linux (I assume that's what you are running on) could handle millions of files in a directory, otherwise you might need to hash on the task id to get a sub folder).
Your master server would just need to run a loop, popping off tasks out of the AVL tree until the next task's timestamp is in the future. Then you could sleep for a few seconds and start checking again. Whenever a task runs, you would update the next run timestamp in the task file and re-insert it into the AVL tree.
When the master server reboots, there would be the overhead of reloading all tasks id and next run timestamp back into memory, so that might be painful with millions of files. Maybe you just have one giant file and give each task 1K space in the file for properties and next run timestamp and then use [task_id] * 1K to get to the right offset for the task properties.
If you are willing to use a database, I am confident MySQL could handle whatever you throw at it given the conditions you describe, assuming you have 4GB+ RAM and several hard drives in RAID 0+1 on your master server.
Finally, if you really want to get complicated, Hadoop might work too: http://hadoop.apache.org/
If you're worried about writes, you can have a set of servers that dispatch the tasks (may be stripe the servers to equalize load) and have each server write bulk checkpoints to the DB (this way, you will not have so many write queries). You still have to write to be able to recover if scheduling server dies, of course.
In addition, if you don't have a clustered index on timestamp, you will avoid having a hot-spot at the end of the table.
350,000 tasks * 48 times per day is
16,800,000 tasks executed per day.
To schedule the jobs, you don't need a database.
Databases are for things that are updated. The only update visible here is a change to the schedule to add, remove or reschedule a job.
Cron does this in a totally scalable fashion with a single flat file.
Read the entire flat file into memory, start spawning jobs. Periodically, check the fstat to see if the file changed. Or, even better, wait for a HUP signal and use that to reread the file. Use kill -HUP to signal the scheduler to reread the file.
It's unclear what you're updating the database for.
If the database is used to determine future schedule based on job completion, then a single database is a Very Dad Idea.
If you're using the database to do some analysis of job history, then you have a simple data warehouse.
Record completion information (start time, end time, exit status, all that stuff) in a simple flat log file.
Process the flat log files to create a fact table and dimension updates.
When someone has the urge to do some analysis, load relevant portions of the flat log files into a datamart so they can do queries and counts and averages and the like.
Do not directly record 17,000,000 rows per day into a relational database. No one wants all that data. They want summaries: counts and averages.
Why hundreds of thousands and not hundreds of millions ? :evil:
I think you need stackless python, http://www.stackless.com/. created by the genius of Christian Tismer.
Quoting
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure. More
readable code. Increased programmer
productivity.
Is used for massive multiplayer games.