I am storing data (JSON) into blobs in Azure, capturing it hourly to create relatively small JSON documents. Between backup times, I may have produce 10s or 100s (unlikely 1000s, but possible) of these documents that I then want to backup into blobs and organise by the year, month, day, and hour.
Two approaches I came up with are:
Making the hour a folder and storing a separate blob for every backup within it
Making each hour its own blob under the day's folder and appending all new documents to that blob so they are stored together
The access case will usually be they'll have somewhat frequent reads for awhile before being backed-off into cold/archive once they get old.
My question is: should I be favouring one method over the other for best practice, resource, or logical reasons, or is it basically personal preference with negligible performance hits? I'm especially interested in any resource differences in terms of reads and writes as I couldn't find or work out any useful information about that.
I'm also curious if there is any access benefits particularly for the append method (although the trade-off might be having to make sure you don't mess the blob up as you append to it) as you'll be storing the per-hour data always together in the same file, as well as how nicely one method or the other might fit with how the Python SDK is architected.
For this scenario I am using Python and making use of the Azure Python SDK packages.
Any other suggestions/methods also very welcome. Thanks.
If the read/write requirements are low, then it won’t matter, if you need high throughput then you might opt not to name your files this way.
Take a look at this, specifically the partitioning section.
Performance and scalability checklist for Blob storage - Azure Storage | Microsoft Learn
Additional information: Note that “relatively small” and “somewhat frequent” mean different things to different people. some users might interpret that to mean < 1KB and several times an hour, while someone else might interpret it to mean < 1MB and several times a second (or even several times a ms). If the former, There is nothing to worry about.
If you still have any question on performance, I would recommended to contact support.
I'm really new to APM & Kibana, but ok with Python & ElasticSearch. Before I had Graphite and it was quite easy to do custom tracking.
I'm looking to track 3 simple custom metrics and their evolution over time.
CounterName and it's value. For example queue_size: 23 and send it by any of the workers. What happens when different workers send different values? (because of the time, the value might increase/decrease rapidly).
I do have 20 names of queues to track. Should I put all under a service_name or should I use labels?
Before I used:
self._graphite.gauge("service.queuesize", 3322)
No idea what to have here now:
....
Time spent within a method. I saw here it's possible to have a context manager.
Before I had:
with self._graphite.timer("service.action")
Will become
with elasticapm.capture_span('service.action')
Number of requests. (only count no other tracking)
Before I had
self._graphite.incr("service.incoming_requests")
Is this correct?
client.begin_transaction('processors')
client.end_transaction('processors')
...
THanks a lot!
You can add a couple of different types of metadata to your events in APM. Since it sounds like you want to be able to search/dashboard/aggregate over these counters, you probably want labels, using elasticapm.label().
elasticapm.capture_span is indeed the correct tool here. Note that it can be used either as a function decorator, or as a context manager.
Transactions are indeed the best way to keep track of request volume. If you're using one of the supported frameworks these transactions will be created automatically, so you don't have to deal with keeping track of the Client object or starting the transactions yourself.
I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.
I'm struggling with how to store some telemetry streams. I've played with a number of things, and I find myself feeling like I'm at a writer's block.
Problem Description
Via a UDP connection, I receive telemetry from different sources. Each source is decomposed into a set of devices. And for each device there's at most 5 different value types I want to store. They come in no faster than once per minute, and may be sparse. The values are transmitted with a hybrid edge/level triggered scheme (send data for a value when it is either different enough or enough time has passed). So it's a 2 or 3 level hierarchy, with a dictionary of time series.
The thing I want to do most with the data is a) access the latest values and b) enumerate the timespans (begin/end/value). I don't really care about a lot of "correlations" between data. It's not the case that I want to compute averages, or correlate between them. Generally, I look at the latest value for given type, across all or some hierarchy derived subset. Or I focus one one value stream and am enumerating the spans.
I'm not a database expert at all. In fact I know very little. And my three colleagues aren't either. I do python (and want whatever I do to be python3). So I'd like whatever we do to be as approachable as possible. I'm currently trying to do development using Mint Linux. I don't care much about ACID and all that.
What I've Done So Far
Our first version of this used the Gemstone Smalltalk database. Building a specialized Timeseries object worked like a charm. I've done a lot of Smalltalk, but my colleagues haven't, and the Gemstone system is NOT just a "jump in and be happy right away". And we want to move away from Smalltalk (though I wish the marketplace made it otherwise). So that's out.
Played with RRD (Round Robin Database). A novel approach, but we don't need the compression that bad, and being edge triggered, it doesn't work well for our data capture model.
A friend talked me into using sqlite3. I may try this again. My first attempt didn't work out so well. I may have been trying to be too clever. I was trying to do things the "normalized" way. I found that I got something working at first OK. But getting the "latest" value for given field for a subset of devices, was getting to be some hairy (for me) SQL. And the speed for doing so was kind of disappointing. So it turned out I'd need to learn about indexing too. I found I was getting into a hole I didn't want to. And headed right back where we were with the Smalltalk DB, lot of specialized knowledge, me the only person that could work with it.
I thought I'd go the "roll your own" route. My data is not HUGE. Disk is cheap. And I know real well how to read/write files. And aren't filesystems hierarchical databases anyway? I'm sure that "people in the know" are rolling their eyes at this primitive approach, but this method was the most approachable. With a little bit of python code, I used directories for my structuring, and then a 2 file scheme for each value (one for the latest value, and an append log for the rest of the values). This has worked OK. But I'd rather not be liable for the wrinkles I haven't quite worked out yet. There's as much code involved in how the data is serialized to/from (just using simple strings right now). One nice thing about this approach, is that while I can write python scripts to analyze the data, some things can be done just fine with classic command line tools. E.g (simple query to show all latest rssi values).
ls Telemetry/*/*/rssi | xargs cat
I spent this morning looking at alternatives. Growsed the NOSQL sites. Read up on PyTables. Scanned ZODB tutorial. PyTables looks very suited for what I'm after. Hierarchy of named tables modeling timeseries. But I don't think PyTables works with python3 yet (at least, there is no debian/ubuntu package for python3 yet). Ditto for ZODB. And I'm afraid I don't know enough about what the many different NOSQL databases do to even take a stab at one.
Plea for Ideas
I find myself more bewildered and confused than at the start of this. I was probably too naive that I'd find something that could be a little more "fire and forget" and be past it at this point. Any advice and direction you have, would be hugely appreciated. If someone can give me a recipe that I can meet my needs without huge amounts of overhead/education/ingress, I'd mark that as the answer for sure.
Ok, I'm going to take a stab at this.
We use Elastic Search for a lot of our unstructured data: http://www.elasticsearch.org/. I'm no expert on this subject, but in my day-to-day, I rely on the indices a lot. Basically, you post JSON objects to the index, which lives on some server. You can query the index via the URL, or by posting a JSON object to the appropriate place. I use pyelasticsearch to connect to the indices---that package is well-documented, and the main class that you use is thread-safe.
The query language is pretty robust itself, but you could just as easily add a field to the records in the index that is "latest time" before you post the records.
Anyway, I don't feel that this deserves a check mark (even if you go that route), but it was too long for a comment.
What you describe fits the database model (ex, sqlite3).
Keep one table.
id, device_id, valuetype1, valuetype2, valuetype3, ... ,valuetypen, timestamp
I assume all devices are of the same type (IE, have the same set of values that you care about). If they do not, consider simply setting the value=null when it doesn't apply to a specific device type.
Each time you get an update, duplicate the last row and update the newest value:
INSERT INTO DeviceValueTable (device_id, valuetype1, valuetype2,..., timestamp)
SELECT device_id, valuetype1, #new_value, ...., NOW()
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for a specific device:
SELECT *
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for all devices:
select
DeviceValueTable.*
from
DeviceValueTable a
inner join
(select id, max(timestamp) as newest
from DeviceValueTable group by device_id) as b on
a.id = b.id
You might be worried about the cost (size of storing) the duplicate values. Rely on the database to handle compression.
Also, keep in mind simplicity over optimization. Make it work, then if it's too slow, find and fix the slowness.
Note, these queries were not tested on sqlite3 and may contain typos.
It sounds to me like you want an on-disk, implicitly sorted datastructure like a btree or similar.
Maybe check out:
http://liw.fi/larch/
http://www.egenix.com/products/python/mxBase/mxBeeBase/
Your issue isn't technical, its poor problem specification.
If you are doing anything with sensor data then the old laboratory maxim applies "If you don't write it down, it didn't happen". In the lab, that means a notebook and pen, on a computer it means ACID.
You also seem to be prematurely optimizing the solution, which is well known to be the root of all evil. You don't say what size the data are, but you do say they "come no faster than once per minute, and may be sparse". Assuming they are an even 1.0KB in size, that's 1.5MB/day or 5.3GB/year. My StupidPhone has more storage than you would need in a year, and my laptop has sneezes that are larger.
The biggest problem is that you claim to "know very little" about databases and that is the crux of the matter. Your data is standard old 1950s data-processing boring. You're jumping into buzzword storage technologies when SQLite would do everything you need if only you knew how to ask it. Given that you've got Smalltalk DB down, I'd be quite surprised if it took more than a day's study to learn all the conventional RDBM principles you need and then some.
After that, you'd be able to write a question that can be answered in more than generalities.
I'm working with two databases, a local version and the version on the server. The server is the most up to date version and instead of recopying all values on all tables from the server to my local version,
I would like to enter each table and only insert/update the values that have changed, from server, and copy those values to my local version.
Is there some simple method to handling such a case? Some sort of batch insert/update? Googl'ing up the answer isn't working and I've tried my hand at coding one but am starting to get tied up in error handling..
I'm using Python and MySQLDB... Thanks for any insight
Steve
If all of your tables' records had timestamps, you could identify "the values that have changed in the server" -- otherwise, it's not clear how you plan to do that part (which has nothing to do with insert or update, it's a question of "selecting things right").
Once you have all the important values, somecursor.executemany will let you apply them all as a batch. Depending on your indexing it may be faster to put them into a non-indexed auxiliary temporary table, then insert/update from all of that table into the real one (before dropping the aux/temp one), the latter of course being a single somecursor.execute.
You can reduce wall-clock time for the whole job by using one (or a few) threads to do the selects and put the results onto a Queue.Queue, and a few worker threads to apply results plucked from the queue into the internal/local server. (Best balance of reading vs writing threads is best obtained by trying a few and measuring -- writing per se is slower than reading, but your bandwidth to your local server may be higher than to the other one, so it's difficult to predict).
However, all of this is moot unless you do have a strategy to identify "the values that have changed in the server", so it's not necessarily very useful to enter into more discussion about details "downstream" from that identification.