splitting gtfs transit data into smaller ones - python

I sometime have a very large size of gtfs zip file - valid for a period of 6 months, but this is not economic for loading such big data size into a low resource (for example, 2 gig of memory and 10 gig hard disk) EC2 server.
I hope to be able split this large size gtfs into 3 smaller gtfs zip files with 2 months (6months/3files) period worth of valid data, of course that means I will need to replace data every 2 months.
I have found a python program that achieve the opposite goal MERGE here https://github.com/google/transitfeed/blob/master/merge.py (this is a very good python project btw.)
I am very thankful for any pointer.
Best regards,
Dunn.

It's worth noting that entries in stop_times.txt are usually the biggest memory hog when it comes to loading a GTFS feed. Since most systems do not replicate trips+stop_times for the dates when those trips are active, reducing the service calendar probably won't save you much.
That said, there are some tools for slicing and dicing GTFS. Check out the OneBusAway GTFS Transformer tool, for example:
http://developer.onebusaway.org/modules/onebusaway-gtfs-modules/1.3.3/onebusaway-gtfs-transformer-cli.html

Another, more recent option for processing large GTFS files is transitland-lib. It's written in the Go programming language, which is quite efficient at parsing huge GTFS feeds.
See the transitland extract command, which can take a number of arguments to cut an existing GTFS feed down to smaller size:
% transitland extract --help
Usage: extract <input> <output>
-allow-entity-errors
Allow entities with errors to be copied
-allow-reference-errors
Allow entities with reference errors to be copied
-create
Create a basic database schema if none exists
-create-missing-shapes
Create missing Shapes from Trip stop-to-stop geometries
-ext value
Include GTFS Extension
-extract-agency value
Extract Agency
-extract-calendar value
Extract Calendar
-extract-route value
Extract Route
-extract-route-type value
Extract Routes matching route_type
-extract-stop value
Extract Stop
-extract-trip value
Extract Trip
-fvid int
Specify FeedVersionID when writing to a database
-interpolate-stop-times
Interpolate missing StopTime arrival/departure values
-normalize-service-ids
Create Calendar entities for CalendarDate service_id's
-set value
Set values on output; format is filename,id,key,value
-use-basic-route-types
Collapse extended route_type's into basic GTFS values

Related

Is there a Panda feature for streaming to / from a large binary source fast instead of CSV or JSON? Or is there another tool for it?

JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.
At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.
I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.
I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.
What's the right tool? Can Pandas do this, or is there something better?
I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..

Do data extracts need to be timestamped?

Should I timestamp my data extracts?
A few collegues an me work together on a python server to solve a data science related problem. I wrote a few functions to extract my data from my source data base and save it to the python server for further processing. Now I'm struggling with whether I should save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract. I read alot about data not needing the same kind of version control as code does and I don't really want to clutter the server with multiple, vastly redundant data extracts.
save the extract with a timestamp, the result being that every time I start my pipeline another extract is saved or omit the timestamp and overwrite the old extract.
Is the change of a feature over time important to your data science related problem?
Do you have any metrics which could tell a story if measured over time?
Perhaps you can store the delta since last data pull instead of redundant features (feature engineer on a different table).
Just a couple of thoughts. Good luck :)

Convert CSV table to Redis data structures

I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.
You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).

I'm looking for a means of hash coding audio speech files for comparison via SQL

I've been developing a tool to compare an audio file recorded on day one to another recorded thirty days later. My training is in linguistics and this tool will be used to catalogue, index, and compare a database of unique vocal recordings. I am aware of commercial grade APIs such as MusicBrainz or EchoNest, but cannot use them for this project. All the files must be locally stored and cannot be contributed to an online database.
At present, I have spectrograms of each file and a batch converter that can convert to almost any sound file. I use a spectrum analyzer to exactly match the spectrograms (like a hash map overlay) and am able to match my results with 96% accuracy. However, as my project grows my storage needs will become far too lofty for this method.
My thought is this - if I can adjust the audio files to a similar frame speed, I should be able to hash code the acoustic data and store the hash strings in a simple SQL table rather than whole audio files or spectrograms. I don't want to hash the whole file - just the acoustics, for matching. I've found a few overkill solutions via Python (dejavu, libmo, etc) but as a linguist, not a computers person, I am unsure if a novice can wrangle the code for hashing audio data
I'm looking to have a way to create hash values (or another checksum) within the next week or so.Thoughts from the interwebz?

store large data python

I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.
For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.

Categories

Resources