Apologies if too stupid question:
I run the weather data logging program Cumulus on my pc. It writes each data sample to the realtime.txt file every second (the previous sample is always overwritten i.e. file size stays at 1 data row).
I want to publish a pseudo-realtime chart of the last hour of wind data (strength and direction) to the web. My plan is to have a Python program reading the realtime.txt data rows each second and append them to a buffer which is saved to a file. A Google App script is then time-triggered every minute, reads the buffer file and appends the last minute's worth of data to a circular one-hour data table in Google Sheets. The table's data is used in a line chart, which is then published and can be read by anyone.
So basically I have a short term (1-minute) FIFO which feeds a longer term (1-hour) FIFO using a buffer file for transfer.
Now my problem is how to sync these FIFOs so that every one-second data row is counted, and no rows are counted twice. As the two apps are unsynchronized, this is not trivial at least not for me.
I'm thinking of giving each data row a unique row number, enabling the App script to figure out where to put the next chunk of data in the longer term buffer. Also to expand the short term FIFO to 2 minutes, which would ensure that there is always at least 1-minutes worth of fresh data available regardless of any jitter in the trigger time of the App script.
But converting these thoughts into code is proving a bit overwhelming for me ;( Anyone had the same problem or can otherwise advise, even slightly?
Related
I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...
There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.
Background:
I am getting a temperature float from an arduino via a serial connection. I need to be able to cache this temperature data every 30 seconds for other applications (e.g. web, thermostat controller) to access and not overload the serial connection.
Currently I cache this data to RAM as a file in /run (I'm trying to follow Linux convention). Then, other applications can poll the file for the temperature as they want it all day long with i/o now the only bottle neck (using an rpi, so not a lot of enterprise-level need here).
Problem:
I think when an app reads these files, it risks reading corrupt data. Should a writer update the file, and a reader try to read the file at the same time, can corrupt data be read, causing the thermostat to behave erratically?
Should I just use sqlite3 as an overkill solution, or use file locks (and does that risk something else not working perfectly)?
This is all taking place in multiple python processes. Is Linux able to handle this situation natively or do I need to apply somehow the principles mentioned here?
Calls to write(2) ought to be atomic under Linux.
Which means as long you are writing a single buffer, you can be certain that readers won't read an incomplete record. You might want to use os.write to make sure that no buffering/chunking happens you are not aware of.
if a read is happening and a file is updated, will the read use the new data while in the middle of a file, or does it somehow know how to get data from the old file (how)?
If there is exactly one read(2) and one write(2), you are guaranteed to see a consistent result. If you split your write into two, it might happen that you write the first part, read and then write the second part which would be an atomicity violation. In case you need to write multiple buffers, either combine them yourself or use writev(2).
Ok so I am using gspread to pull data on google spreadsheets but for what I am doing I need to pull data from long columns. Anyway the data that is being pulled doesn't need to be there until half way through the program. Is there a way to pull that data at the beginning while the first half of the program is running?
-As it runs right now it looks up some individual values ~5 seconds
-then it pulls the data from the columns and takes ~4-15 seconds (it varies) but it isn't doing ANYTHING but pulling the data so it just sits there.
-then it continues and does the rest of the calculations which take ~1 second.
I feel like this is inefficient and since it deals with minutes I worry that it might start to interfere with the way it runs when the columns get especially long...
Here is the paste bin for the code with my information removed http://pastebin.com/Wf5bfmZ0
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.
I have a VPS that's hosting multiple virtual hosts. Each host has it's own access.log and error.log. Currently, there's no log rotation setup, though, this may change.
Basically, I want to parse these logs to monitor bandwidth and collect stats.
My idea was to write a parser and save the information to a small sqlite database. The script will run every 5 minutes and use Python's seek and tell methods to open the log files from the last parsed locations. This prevents me from parsing a 10GB log file every 5 minutes when all I need is the new information sitting at the end of it (no log rotation, remember?).
After some thought, I realised that all I'm doing is taking the information from the log files and putting them into a database... Moving the data from one location to another :/
So how else can I do this?
I want to be able to do something like:
python logparse.py --show=bandwidth --between-dates=25,05|30,05 --vhost=test.com
This would open the log file for test.com and show me the bandwidth used for the specified 5 days.
Now, my question is, how do I prevent myself from parsing 10GB worth of data when I only want 5 days worth of data?
If I were to use my idea of saving the log data to a database every 5 minutes, I could just save a unix timestamp of the dates and pull out the data between them. Easy. But I'd prefer to parse the log file directly.
Unless you create different log files for each day, you have no way other than to parse on request the whole log.
I would still use a database to hold the log data, but with your desired time-unit resolution (eg. hold the bandwidth at a day / hour interval). Another advantage in using a database is that you can make range queries, like the one you give in your example, very easily and fast. Whenever you have old data that you don't need any more you can delete it from the database to save up space.
Also, you don't need to parse the whole file each time. You could monitor the writes to the file with the help of pyinotify whenever a line is written you could update the counters in the database. Or you can store the last position in the file whenever you read from it and read from that position the next time. Be careful when the file is truncated.
To sum it up:
hold your data in the database at day resolution (eg. the bandwith for each day)
use pyinotify to monitor the writes to the log file so that you don't read the whole file over and over again
If you don't want to code your own solution, take a look at Webalizer, AWStats or pick a tool from this list.
EDIT:
WebLog Expert also looks promising. Take a look of one of the reports.
Pulling just the required 5 days of data from a large logfile comes down to finding the right starting offset to seek() the file to before you begin parsing.
You could find that position each time using a binary search through the file: seek() to os.stat(filename).st_size / 2, call readline() once (discarding the result) to skip to the end of the current line, then do two more readline()s. If the first of those lines is before your desired starting time, and the second is after it, then your starting offset is tell() - len(second_line). Otherwise, do the standard binary search algorithm. (I'm ignoring the corner cases where the line you're looking for is the first or last or not in the file at all, but those are easy to add)
Once you have your starting offset, you just keep parsing lines from there until you reach one that's newer than the range you're interested in.
This will be much faster than parsing the whole logfile each time, of course, but if you're going to be doing a lot of these queries, then a database probably is worth the extra complexity. If the size of the database is a concern, you could go for a hybrid approach where the database is an index to the log file. For example, you could store the just the byte-offset of the start of each day in the database. If you don't want to update the database every 5 minutes, you could have logparse.py update it with new data each time it runs.
After all that, though, as Pierre and the_void have said, do make sure you're not reinventing the wheel -- you're not the first person ever to need bandwidth statistics :-)
Save the last position
When you have finished with the parsing of a log file, save the position in a table of your database that reference both the full file path and the position. When you run the parser 5 minutes after, you query the database for the log your are going to parse, retrieve the position and start from there.
Save the first line of data
When you have log rotation, add an additionnal key in the database that will contain the first line of the log file. So when you start with a file, first read the first line. When you query the database, you have then to check on the first line and not on the file name.
First line should be unique, always, since you have the timestamp. But don't forget that W3C compliant log file usually write headers at the beginning of the file. So the first line should be the first line of data.
Save the data you need only
When parsing W3C, it's very easy to read the bytes sent. Parsing will be very fast if you keep that information only. The store it in your database, either by updating an existing row in your database, or adding a new row with a timestamp that you can aggregate with others later in a query.
Don't reinvent the wheel
Unless what you are doing is very specific, I recommand you to grab an open source parser on the web. http://awstats.sourceforge.net/