I have data arriving at the rate of one item per second and need to store it in python 3.5. I need to be able to calculate the running total from the last 24 hours and also the totals for each day as the data arrives. Restarting the program should reset all of the stored data so there's no need for persistent storage.
I was planning on using pandas dataframes but they don't seem to be suited to appending data.
Is there a method of storing data indexed by time, allowing appending data one row at a time?
Related
So I want to scrape a weather forecast table once a day and store my results for future analysis. I want to store the data but im not sure how to.
Example of the data: Forecast Table
My four variables of interest are wind speed, wind gusts, wave height, and wave period.
This is my first python project involving time series data and I’m fairly new to databases so go easy on me and ELI5 please.
In the Python For Everyone course I recently took, I learned about relational databases and using SQLlite. The main idea here was basically to be efficient in storing data and never storing the same data twice. However none of the examples involved time series data. And so now I’m not sure what the best approach here is.
If I create a table for each variable and lastly one for the date I scraped the forecast. The date of scraping would then serve as the primary key. In this example the variables such as windspeed's first column would be date of scraping followed by the next columns being the forecasted values for the time stamps. Although this would make the storage more efficient as opposed to creating a new table every day, there are a few problems. The timestamps are not uniform (see image, forecast times are only from 3am to 9pm). Also depending on the time of day that the forecast is scraped the date and time values on the timestamps are always changing and thus the next timestamp is not always in 2 hours.
Seeing as each time I scrape the forecast, I get a new table, should I create a new database table each time in sqlite? This seems like a rather rudimentary solution, and I’m sure there are better ways to store the data.
How would you go about this?
Summarizing my comments:
You may want to append forecast data from a new scrapping to the existing data in the same database table.
From each new web-scrapping you will get approx. 40 new records with the same scrapping time stamp but different forecast time stamp.
e.g., this would be the columns of the table using ID as primary key with AUTOINCREMENT:
ID
Scrapping_time
Forecast_hours
Wind_speed
Wind_gusts
Wind_direction
Wave
Wave_period
wave_direction
Note:
if you use SQLite, you could leave out the ID column as SQLite would add such ROWID column by default if no other primary key had been specified
(https://www.sqlite.org/autoinc.html)
I am getting my data from Binance in two forms. I am first getting historical data and creating a dataframe in pandas, and then using sqlite3 to create a database so I can call on this live updated DB from another script.
Then I am using binance websockets to get data in 1 minute intervals. I want to use technical indicators like moving averages, and I have easily calculated them for the historical data.
frame['SMA7'] = frame['Close'].rolling(7).mean()
frame['SMA25'] = frame['Close'].rolling(25).mean()
However, the live data comes in one price point at a time, and I can't figure out how to append the data, SMAs included, to the dataframe without calling the DB over and over to fetch the old price data and recalculate the moving average. Is there a methodology to this kind of database update where the new row is partially dependent on the previous data?
The excel equivalent would be to have the formulas pre-populated in the table so that when the data lands the output is generated. Three possibilities I have considered:
Retrieve old data from db - calculate values and the append the row.
Maintain the DB and a dataframe simultaneously, and when the new price point comes in, append it to the dataframe, calculate the values, and then append the last row of the dataframe to the database.
Recall the data entirely, rewrite the db as if it were new.
I like the idea of having my historical stock data stored in a database instead of CSV. Is there a speed penalty for fetching large data sets from MariaDB compared to CSV
Quite the opposite. Whenever you fetch data from a CSV, unless you have a stopping condition (for example, take the first entry with x = 3) you must parse every single line in the file. This is an expensive operation because not only do you have to read all of the lines (making it O(n)), but in general, you will be typecasting as well. In a database, you have already processed all of the lines, and if in this case there is an index on x or whatever attribute you are searching by, the database will be able to find the information in O(log(n)) time and will not look at the vast majority of entries.
I'm trying to process a "big" set of data. It is an excel sheet with 5k rows and 30 columns. Most of the data stored in the cells are strings. What I have to do is to perform simple tasks on this data, such as:
Number of repetitions of a string
Checking some rules that should follow the data in the same row (only a few if's are needed to check it)
And so on...
My first attempt was to create 5k objects (1 per row), charge the data in them and then start running the tests. But saving the data in these objects took something like an hour for only 1k rows. I did this in python with the module openpyexcel with the read only mode.
My question is... is there a faster way to do this?
THE ANSWER TO MY QUESTION WAS HERE
link
I am storing multiple time-series in a MongoDB with sub-second granularity. The DB is updated by a bunch of Python scripts, and the data stored serve two main purposes:
(1) It's a central information source for the latest data from all series. Multiple scripts access it every second or so to read the latest datapoint in each collection.
(2) It's a long-term data store. I often load the whole DB into Python to analyse trends in the data.
To keep the DB as efficient as possible, I want to bucket my data (ideally holding one document per day in each collection). Because of (1), however, the bigger the buckets, the more expensive the sorting required to access the last datapoint.
I can think of two solutions here, but I'm not sure what alternatives there are, or which is the best way:
a) Store the latest timestamp in a one-line document in a separate db/collection. No sorting required on read, but an additional write required every time a any series gets a new datapoint.
b) Keep the buckets smaller (say 1-hour each) and sort.
With a) you write smallish documents to a separate collection, which is performance wise preferable to updating large documents. You could write all new datapoints in this collection and aggregate them for the hour or day, depending on your preference. But as you said this requires an additional write operation.
With b) you need to keep the index size for the sort field in mind. Does the index size fit in memory? That's crucial for the performance of the sort, as you do not want to do any in memory sorting of a large collection.
I recommend exploring the hybrid approach, of storing individual datapoints for a limited time in an 'incoming' collection. Once your bucketing interval of hour or day approaches, you can aggregate the datapoints into buckets and store them in a different collection. Of course there is now some additional complexity in the application, that needs to be able to read bucketed and datapoint collections and merge them.