I have a number of python processes that each repetitively query a separate betting API. The requests come in bursts of ~20-100 all at once, then the process goes away to parse the responses and repeats a second later or so. I am hoping to use Cassandra as my raw storage for requests and responses. This will allow me to debug problems with the parsed data and/or re-parse later. I am trying to devise a schema for this.
I am thinking I can have a separate table (column family) per API, not much to say about that. My initial idea for the table schema was:
stripe text, // free text to describe the flavour of the request, e.g. live games
date int, // YYYYMMDD
requests map<datetime, text>,
responses map<datetime, text>
I could then append the requests and responses to the correct row as they happened and end up with a row per day of timewise sorted requests and responses. I could then easily go back and find data for a given day (which seems like a reasonable chunk to process at a time), then go to a specific point in time on the day if required.
The problem here is obvious, 2 requests made at exactly the same time given my timestamping resolution, will end up overwriting one another. As unlikely as it might be it is wrong.
I then went on to a second idea I didn't really like, disambiguate the key using the timestamp and a hash of the request, assuming that the same request at the same time should return the same result and therefore be unique enough, ie str(timestamp) + str(hash(request)), meaning the schema becomes (datetime becomes text)
stripe text, // free text to describe the flavour of the request, e.g. live games
date int, // YYYYMMDD
requests map<text, text>,
responses map<text, text>
This sucks because text takes more space and is slower to compare but I was willing to accept it, then I hit this problem:
E InvalidRequest: code=2200 [Invalid query] message="Map value is too long. Map values are limited to 65535 bytes but 435145 bytes value provided"
This is basically telling me I can't ever put these things in a collection column anyway as responses are of arbitrary size and almost always bigger than the limit.
I am new in the Cassandra world but thought that these CQL maps end up corresponding to separate column names and values in the record, and that each column has a size limit of 2GB. One thing I can think of is to not use a map and keep altering the table schema every time, then inserting a normal value into the cell but I am not sure how that is different in the underlying store.
So I guess I have 2 questions:
Is this just a limitation of CQL or all of Cassandra?
Can someone more experienced think of an overall better approach?
Thanks for reading
KCH
To answer my own question - my misunderstanding lies in the fact that in CQL, the first part of the key is always the row key so for a composite key the rest of the parts of the key form the column key. Maps also end up in separate columns of the same row using their own key name "unrolling" convention but applying the limit size.
Related
I use django 1.11 with postgresql as the database. I know how to store and retrive data from a db but I can't find an example to which is the correct way to store and to retrieve an entire discussion of two users.
This is my simple idea:
Two users connect to 127.0.0.1 and in this page there is a text-area form. Both users can write into the text-area and by press a button they post their content. The page will reload and now all message are being displayed.
What I want to know is that if the correct way to store and retrive would be:
one db row => single message user
If two users exchange, say 15 messages, it will store 15 rows and to make a univocal discussion, I can put another column into the db something like discussion "id", so 15 rows would have the same id and the user:
db row1 ---> "pk=1, message=hello there, user=Mike, id=45")
db row2 ---> "pk=2, message=hello world, user=Jessy, id=45")
When the page reload clearly in django will run:
discussion = Discussion.objects.all().filter(id=45)
to retrieve the discussion.
Only two user can discuss in private, so every two user have a discussion page like 127.0.0.1/one, 127.0.0.1/two and so on..
If this is the correct way to store and retrive from the db, my question is how that would scale? Can I rely on that design to store and retrive data from the database efficiently or it will be heavy in near future? I worry that 1000 users could quickly grow into 10000 rows.
So the answer to your question depends on how you plan on using the data in the future and what you need to do with it. It is entirely possible to store an entire conversation between N users in a columnar database such as Postgres as individual records per message. However, as with all programming questions, there are multiple paradigms to answer your question. I will explore the pros/cons of a couple of them here (with the knowledge that there are certainly more).
Paradigm 1 New record (row) per message
Pros:
Simpler querying for individual messages.
Analytical functions can easily be applied at a message level (i.e. summing number of messages by certain users)
Record size is (relatively) small
Cons:
Very long table sizes
Searching becomes time consuming as table grows.
Post-processing needed on a collection (i.e. All records from a conversation)
More work is shifted to the server
Paradigm 2 New record (row) per conversation
Pros:
Simpler querying for individual conversations
Shorter table sizes
Post-processing needed on an object (i.e. The entire conversation stored as a JSON object)
Cons:
Larger row size that can grow substantially depending on the number and size of messages.
Harder to query individual messages or text within messages (need to use more expensive functions such as LIKE % on blobs of text = slow)
Less conducive to preforming any type of analytical function on messages.
Messages become an append exercise
More work is shifted to the client/application
Which is best? YMMV
Again, there are probably a half-dozen or so more ways you could store your application's messages, and all depend on your downstream needs. Additionally, I would implode you to look into projects such as Apache Kafka which specialize in message publishing as potentially a scaleable, drop in solution.
Three recommendations:
If you give PostgreSQL a decent amount of resources (say, an Amazon m3.large instance), then "a lot of rows" for a PostgreSQL database is around 100 million rows (depending). That's not a limit, it's just enough rows that you'll have to spend some time working on performance. So assuming that chats average 100 messages, then that would be one million conversations. So having one row per message is not a performance problem at the scale you're talking about.
Don't use a numerical PK as your main way of ordering conversations (you might still have one, Django likes having one). Do have a timestamptz column, which is how you reconstruct the order of conversations.
Have a unique index on user, timestamptz (since a user can't post two messages simultaneously), and another unique index on conversation, timestamptz (this will allow you to reconstruct conversations quickly).
You should also have a table called "conversations" which summarizes conversation_id, list-of-users, because this will make it easy to answer the request "show me all my conversations".
Does that answer your questions?
I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.
After enabling Appstats and profiling my application, I went on a panic rage trying to figure out how to reduce costs by any means. A lot of my costs per request came from queries, so I sought out to eliminate querying as much as possible.
For example, I had one query where I wanted to get a User's StatusUpdates after a certain date X. I used a query to fetch: statusUpdates = StatusUpdates.query(StatusUpdates.date > X).
So I thought I might outsmart the system and avoid a query, but incur higher write costs for the sake of lower read costs. I thought that every time a user writes a Status, I store the key to that status in a list property of the user. So instead of querying, I would just do ndb.get_multi(user.list_of_status_keys).
The question is, what is the difference for the system between these two approaches? Sure I avoid a query with the second case, but what is happening behind the scenes here? Is what I'm doing in the second case, where I'm collecting keys, just me doing a manual indexing that GAE would have done for me with queries?
In general, what is the difference between get_multi(keys) and a query? Which is more efficient? Which is less costly?
Check the docs on billing:
https://developers.google.com/appengine/docs/billing
It's pretty straightforward. Reads are $0.07/100k, smalls are $0.01/100k, so you want to do smalls.
A query is 1 read + 1 small / entity
A get is 1 read. If you are getting more than 1 entity back with a query, it's cheaper to do a query than reading entities from keys.
Query is likely more efficient too. The only benefit from doing the gets is that they'll be fully consistent (whereas a query is eventually consistent).
Storing the keys does not query, as you cannot do anything with just the keys. You will still have to fetch the Status objects from memory. Also, since you want to query on the date of the Status object, you will need to fetch all the Status objects into memory and compare their dates yourself. If you use a Query, appengine will fetch only the Status with the required date. Since you fetch less, your read costs will be lower.
As this is basically the same question as you have posed here, I suggest that you look at the answer I gave there.
I'm trying to move from redis to dynamoDB and sofar everything is working great! The only thing I have yet to figure out is key expiration. Currently, I have my data setup with one primary key and no range key as so:
{
"key" => string,
"value" => ["string", "string"],
"timestamp" => seconds since epoch
}
What I was thinking was to do a scan over the database for where timestamp is less than a particular value, and then explicitly delete them. This, however, seems extremely inefficient and would use up a ridiculous number of read/write units for no reason! On top of which, the expirations would only happen when I run the scan, so they could conceivably build up.
So, has anyone found a good solution to this problem?
I'm also using DynamoDB like the way we used to use Redis.
My suggestion is to write the key into different time-sliced tables.
For example, say a type of record should last few minutes, at most less an hour, then you can
Create a new table every day for this type of record and store new records in today's table.
Use a read repair tip when you read records, which means if you can't find a record in today's table, you try to find it in yesterday's table and put in today's table if necessary.
If you find the record in either table, verify it with it's timestamp. It's not necessary to delete expired records at this moment.
Drop entire stale tables in your tasks.
This is easier to maintain and cost-efficient.
You could do lazy expiration and delete it on request.
For example:
store key "a" with an attribute "expiration", expires in 10 minutes.
fetch in 9 minutes, check expiration, return it.
fetch in 11 minutes. check expiration. since it's less than now, delete the entry.
This is what memcached was doing when I looked at the source a few years ago.
You'd still need to do a scan to remove all the old entries.
You could also consider using Elasticache, which is meant for caching rather than a permanent data store.
It seems that Amazon just added expiration support to DynamoDB (as of feb 27 2017). Take a look at the official blog post:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
You could use the timestamp as the range key which would be indexed and allow for easier operations based on the time.
I am building an application where I am trying to allow users to submit a list of company and date pairs and find out whether or not there was a news event on that date. The news events are stored in a dictionary with a company identifier and a date as a key.
newsDict('identifier','MM/DD/YYYY')=[list of news events for that date]
The dictionary turned out to be much larger than I thought-too big even to build it in memory so I broke it down into three pieces, each piece is limited to a particular range of company identifiers.
My plan was to take the user submitted list and using a dictionary group the user list of company identifiers to match the particular newsDict that the company events would be expected to be found and then load the newsDicts one after another to get the values.
Well now I am wondering if it would not be better to keep the news events in a list with each item of the list being a sublist list of a tuple and another list
[('identifier','MM/DD/YYYY'),[list of news events for that date]]
my thought then is that I would have a dictionary that would have the range of the list for each company identifier
companyDict['identifier']=(begofRangeinListforComp,endofRangeinListforComp)
I would use the user input to look up the ranges I needed and construct a list of the identifiers and ranges sorted by the ranges. Then I would just read the appropriate section of the list to get the data and construct the output.
The biggest reason I see for this is that even with the dictionary broken into thirds each section takes about two minutes to load on my machine and the dictionary ends up taking about 600 to 750 mb of ram.
I was surprised to note that a list of eight million lines took only about 15 seconds to load and used about 1/3 of the memory of the dictionary that had 1/3 the entries.
Further, since I can discard the lines in the list as I work through the list I will be freeing memory as I work down the user list.
I am surprised as I thought a dictionary would be the most efficient way to do this. but my poking at it suggests that the dictionary requires significantly more memory than a list. My reading of other posts on SO and elsewhere suggests that any other structure is going to require pointer allocations that are more expensive than list pointers. Am I missing something here and is there a better way to do this?
After reading Alberto's answer and response to my comment I spent some time trying to figure out how to write the function if I were to use a db. Now I might be hobbled here because I don't know much about db programming but
I think the code to implement using a db would be much more complicated than:
outList=[]
massiveFile=open('theFile','r')
for identifier in sortedUserList
# I get the list and sort it by the key of the dictionary
identifierList=massiveFile[theDict[identifier]['beginPosit']:theDict[identifier]['endPosit']+1]
for item in identifierList:
if item.startswith(manipulation of the identifier)
outList.append(item)
I have to wrap this in a function I didn't see anything that would be as comparably simple if I converted the list to a db.
Of course simpler was not the reason to bring me to this forum. I still don't see that using another structure will cost less memory. I have 30000 company identifiers and approximately 3600 dates. Each item in my list is an object in the parlance of OOD. That is where I am struggling I spent six hours this morning organizing the data for a dictionary before I gave up. Spending that amount of time to implement a database and then find that I am using half a gig or more of someone else's memory to load it seems problematic
With such a large amount of data, you should be using a database. This would be far better than looking at a list, and would be the most appropriate way of storing your data anyway. If you're using Python, it has SQLite built in I believe.
The dictionary will take more memory because it is effectively a hash.
You don't have to go so far as using a database, since your lookup requirements are so simple. Just use the file system.
Create a directory structure based on the company name (or ticker), with subdirectories for each date. To find whether data exists and load it up, just form the name of the subdirectory where the data would be, and see if it exists.
E.g., IBM news for May 21 would be in C:\db\IBM\20090521\news.txt, if in fact there were news for that day. You just check if the file exists; no searches.
If you want to try and boost speed from there, come up with a scheme to cache a limited amount of results that are likely to be frequently requested (assuming you're operating a server). For that, you'd use a hash.