Do data extracts need to be timestamped? - python

Should I timestamp my data extracts?
A few collegues an me work together on a python server to solve a data science related problem. I wrote a few functions to extract my data from my source data base and save it to the python server for further processing. Now I'm struggling with whether I should save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract. I read alot about data not needing the same kind of version control as code does and I don't really want to clutter the server with multiple, vastly redundant data extracts.

save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract.
Is the change of a feature over time important to your data science related problem?
Do you have any metrics which could tell a story if measured over time?
Perhaps you can store the delta since last data pull instead of redundant features (feature engineer on a different table).
Just a couple of thoughts. Good luck :)

Related

Patterns for processing multi source CSVs in Python

I have multiple data sources of financial data that I want to parse into a common data model.
API retrieval. Single format from single source (currently)
CSV files – multiple formats from multiple sources
Once cleaned and validated, the data is stored in a database (this is a Django project, but I don’t think that’s important for this discussion).
I have opted to use Pydantic for the data cleaning and validation, but am open to other options.
Where I’m struggling is with the preprocessing of the data, especially with the CSVs.
Each CSV has a different set of headers and data structure. Some CSVs contain all information over a single row, while others present in multiple rows. As your can tell, there are very specific rules for each data source based on its origin. I have a dict that maps all the header variations to the model fields. I filter this by source.
Currently, I’m loading the CSV into a Pandas data frame using the group by function break the data up into blocks. I can then loop through the groups, modify the data based on it’s origin, and then assign the data to the appropriate columns to pass into a Pydantic BaseModel. After I did this, it seemed a bit pointless to be using Pydantic, as all the work was being done beforehand.
To make things more reusable, I thought of moving all the logic into the Pydantic BaseModel, passing the raw grouped data into a property, and processing into the appropriate data elements. But, this just seems wrong.
As with most problems, I’m sure this has been solved before. I’m looking for some guidance on appropriate patterns for this style of processing. All of the examples I’ve found to date are based on a single input format.

Save periodically gathered data with python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.
Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

HDF5 with Python, Pandas: Data Corruption and Read Errors

So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?

store large data python

I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.
For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.

splitting gtfs transit data into smaller ones

I sometime have a very large size of gtfs zip file - valid for a period of 6 months, but this is not economic for loading such big data size into a low resource (for example, 2 gig of memory and 10 gig hard disk) EC2 server.
I hope to be able split this large size gtfs into 3 smaller gtfs zip files with 2 months (6months/3files) period worth of valid data, of course that means I will need to replace data every 2 months.
I have found a python program that achieve the opposite goal MERGE here https://github.com/google/transitfeed/blob/master/merge.py (this is a very good python project btw.)
I am very thankful for any pointer.
Best regards,
Dunn.
It's worth noting that entries in stop_times.txt are usually the biggest memory hog when it comes to loading a GTFS feed. Since most systems do not replicate trips+stop_times for the dates when those trips are active, reducing the service calendar probably won't save you much.
That said, there are some tools for slicing and dicing GTFS. Check out the OneBusAway GTFS Transformer tool, for example:
http://developer.onebusaway.org/modules/onebusaway-gtfs-modules/1.3.3/onebusaway-gtfs-transformer-cli.html
Another, more recent option for processing large GTFS files is transitland-lib. It's written in the Go programming language, which is quite efficient at parsing huge GTFS feeds.
See the transitland extract command, which can take a number of arguments to cut an existing GTFS feed down to smaller size:
% transitland extract --help
Usage: extract <input> <output>
-allow-entity-errors
Allow entities with errors to be copied
-allow-reference-errors
Allow entities with reference errors to be copied
-create
Create a basic database schema if none exists
-create-missing-shapes
Create missing Shapes from Trip stop-to-stop geometries
-ext value
Include GTFS Extension
-extract-agency value
Extract Agency
-extract-calendar value
Extract Calendar
-extract-route value
Extract Route
-extract-route-type value
Extract Routes matching route_type
-extract-stop value
Extract Stop
-extract-trip value
Extract Trip
-fvid int
Specify FeedVersionID when writing to a database
-interpolate-stop-times
Interpolate missing StopTime arrival/departure values
-normalize-service-ids
Create Calendar entities for CalendarDate service_id's
-set value
Set values on output; format is filename,id,key,value
-use-basic-route-types
Collapse extended route_type's into basic GTFS values

Categories

Resources