I'm using a python script to run hourly scrapes of a website that publishes the most popular hashtags for a social media platform. They're to be stored in a database (MYSQL), with each row being a hashtag and then a column for each hour that it appears in the top 20, where the number of uses within that past hour is listed.
So, the amount of rows as well as columns will constantly increase, as new hashtags appear and ones that have previously appeared resurface into the top 20.
Is there a best way to go about this?
Your design is poorly suited for a relational database such as MySQL. The best way to go about it is to either redesign your storage layout to a form that a relational database works well with (eg. make each row a (hashtag, hour) pair), or use something other than a relational database to store it.
Related
I would like to create a website where I show some text but mainly dynamic data in tables and plots. Let us assume that the user can choose whether he wants to see the DAX or the DOW JONES prices for a specific timeframe. I guess these data I have to store in a database. As I am not experienced with creating websites, I have no idea what the most reasonable setup for this website would be.
Would it be reasonable for this example to choose a database where every row corresponds of 9 fields, where the first column is the timestamp (lets say data for every minute), the next four columns correspond to the high, low, open, close price of DAX for this timestamp and columns 5 to 9 correspond to high, low, open, close price for DOW JONES?
Could this be scaled to hundreds of columns with a reasonable speed
of the database?
Is this an efficient implementation?
When this website is online, you can choose whether you want to see DAX or DOW JONES prices for a specific timeframe. The corresponding data would be chosen via python from the database and plotted in the graph. Is this the general idea how this will be implemented?
To get the data, I can run another python script on the webserver to dynamically collect the desired data and write them in the database?
As a total beginner with webhosting (is this even the right term?) it is very hard for me to ask precise questions. I would be happy if I could find out whats the general structure I need to create the website, the database and the connection between both. I was thinking about amazon web services.
You could use a database, but that doesn't seem necessary for what you described.
It would be reasonable to build the database as you described. Look into SQL for doing so. You can download a package XAMPP that will give you pretty much everything you need for that. This is easily scalable to hundreds of thousands of entries - that's what databases are for.
If your example of stock prices is actually what you are trying to show, however, this is completely unnecessary as there are already plenty of databases that have this data and will allow you to query them. What you would really want in this scenario is an API. Alpha Vantage is a free service that will serve you data on stock prices, and has plenty of documentation to help you get it set up with python.
I would structure the project like this:
Use the python library Flask to set up the back end.
In addition to instantiating the Flask app, instantiate the Alpha Vantage class as well (you will need to pip install both of these).
In one of the routes you declare under Flask, use the Alpha Vantage api to get the data you need and simply display it to the screen.
If I am assuming you are a complete beginner, one or more of those steps may not make sense to you, in which case take them one at a time. Start by learning how to build a basic Flask app, then look at the API.
YouTube is your friend for both of these things.
I use django 1.11 with postgresql as the database. I know how to store and retrive data from a db but I can't find an example to which is the correct way to store and to retrieve an entire discussion of two users.
This is my simple idea:
Two users connect to 127.0.0.1 and in this page there is a text-area form. Both users can write into the text-area and by press a button they post their content. The page will reload and now all message are being displayed.
What I want to know is that if the correct way to store and retrive would be:
one db row => single message user
If two users exchange, say 15 messages, it will store 15 rows and to make a univocal discussion, I can put another column into the db something like discussion "id", so 15 rows would have the same id and the user:
db row1 ---> "pk=1, message=hello there, user=Mike, id=45")
db row2 ---> "pk=2, message=hello world, user=Jessy, id=45")
When the page reload clearly in django will run:
discussion = Discussion.objects.all().filter(id=45)
to retrieve the discussion.
Only two user can discuss in private, so every two user have a discussion page like 127.0.0.1/one, 127.0.0.1/two and so on..
If this is the correct way to store and retrive from the db, my question is how that would scale? Can I rely on that design to store and retrive data from the database efficiently or it will be heavy in near future? I worry that 1000 users could quickly grow into 10000 rows.
So the answer to your question depends on how you plan on using the data in the future and what you need to do with it. It is entirely possible to store an entire conversation between N users in a columnar database such as Postgres as individual records per message. However, as with all programming questions, there are multiple paradigms to answer your question. I will explore the pros/cons of a couple of them here (with the knowledge that there are certainly more).
Paradigm 1 New record (row) per message
Pros:
Simpler querying for individual messages.
Analytical functions can easily be applied at a message level (i.e. summing number of messages by certain users)
Record size is (relatively) small
Cons:
Very long table sizes
Searching becomes time consuming as table grows.
Post-processing needed on a collection (i.e. All records from a conversation)
More work is shifted to the server
Paradigm 2 New record (row) per conversation
Pros:
Simpler querying for individual conversations
Shorter table sizes
Post-processing needed on an object (i.e. The entire conversation stored as a JSON object)
Cons:
Larger row size that can grow substantially depending on the number and size of messages.
Harder to query individual messages or text within messages (need to use more expensive functions such as LIKE % on blobs of text = slow)
Less conducive to preforming any type of analytical function on messages.
Messages become an append exercise
More work is shifted to the client/application
Which is best? YMMV
Again, there are probably a half-dozen or so more ways you could store your application's messages, and all depend on your downstream needs. Additionally, I would implode you to look into projects such as Apache Kafka which specialize in message publishing as potentially a scaleable, drop in solution.
Three recommendations:
If you give PostgreSQL a decent amount of resources (say, an Amazon m3.large instance), then "a lot of rows" for a PostgreSQL database is around 100 million rows (depending). That's not a limit, it's just enough rows that you'll have to spend some time working on performance. So assuming that chats average 100 messages, then that would be one million conversations. So having one row per message is not a performance problem at the scale you're talking about.
Don't use a numerical PK as your main way of ordering conversations (you might still have one, Django likes having one). Do have a timestamptz column, which is how you reconstruct the order of conversations.
Have a unique index on user, timestamptz (since a user can't post two messages simultaneously), and another unique index on conversation, timestamptz (this will allow you to reconstruct conversations quickly).
You should also have a table called "conversations" which summarizes conversation_id, list-of-users, because this will make it easy to answer the request "show me all my conversations".
Does that answer your questions?
We have a table in Azure Table Storage that is storing a LOT of data in it (IoT stuff). We are attempting a simple migration away from Azure Tables Storage to our own data services.
I'm hoping to get a rough idea of how much data we are migrating exactly.
EG: 2,000,000 records for IoT device #1234.
The problem I am facing is in getting a count of all the records that are present in the table with some constrains (EG: Count all records pertaining to one IoT device #1234 etc etc).
I did some fair amount of research to find posts that say that this count feature is not implemented in the ATS. These posts however, were circa 2010 to 2014.
I'm assumed (hoped) that this feature has been implemented now since it's now 2017 and I'm trying to find docs to it.
I'm using python to interact with out ATS.
Could someone please post the link to the docs here that show how I can get the count of records using python (or even HTTP / rest etc)?
Or if someone knows for sure that this feature is still unavailable, that would help me move on as well and figure another way to go about things!
Thanks in advance!
Returning number of entities in the table storage is for sure not available in Azure Table Storage SDK and service. You could make a table scan query to return all entities from your table but if you have millions of these entities the query will probably time out. it is also going to have pretty big perf impact on your table. Alternatively you could try making segmented queries in a loop until you reach the end of the table.
Or if someone knows for sure that this feature is still unavailable,
that would help me move on as well and figure another way to go about
things!
This feature is still not available or in other words as of today there's no API which will give you a count of total number of rows in a table. You would have to write your own code to do so.
Could someone please post the link to the docs here that show how I
can get the count of records using python (or even HTTP / rest etc)?
For this you would need to list all entities in a table. Since you're only interested in the count, you can reduce the size response data by making use of Query Projection and fetching just one or two attributes of the entities (may be PartitionKey and RowKey). Please see my answer here for more details: Count rows within partition in Azure table storage.
I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.
I am working on a web application for downloading resources of an unimportant type. It's written in python using the flask web framework. I use the SQLAlchemy DB system.
It has a user authentication system and you can download the resources only while logged in.
What I am trying to do is a download history chart for every resource and every user. To elaborate, each user could see two charts of their download activity on their profile page, for the last 7 days and the last year respectively. Each resource would also have a similar pair of charts, but they would instead visualize how many times the resource itself was downloaded in the time periods.
Here is an example screenshot of the charts
(Don't have enough reputation to embed images)
http://dl.dropbox.com/u/5011799/Selection_049.png
The problem is, I can't seem to figure out what the best way to store the downloads in a database would be. I found 2 ways that are relatively easy to implement and should work:
1) I could store the download count for each day in the last week in separate fields and every 24 hours just get rid of the first one and move them to the left by 1. This, however, seems like a kind of a hacky way to do this.
2) I could also create a separate table for the downloads and every time a user downloads a resource I would insert a row into the table with the Datetime, user_id of the downloader and the resource_id of the downloaded resource. This would allow me to do some nice querying of time periods etc. The problem with that configuration could be the row count in the table. I have no idea how heavily the website is going to be used, but if I do the math with 1000 downloads / day, I am going to end up with over 360k rows in just the first year. I don't know how fast that would to perform. I know I could just archive old entries if performace started being a huge problem.
I would like to know whether the 2nd option would be fast enough for a web app and what configuration you would use.
Thanks in advance.
I recommend the second approach, with periodic aggregation to improve performance.
Storing counts by day will force you to SELECT the existing count so that you can either add to it with an UPDATE statement or know that you need to INSERT a new record. That's two trips to the database on every download. And if things get out of whack, there's really no easy way to determine what happened or what the correct numbers ought to be. (You're not saving information about the individual events.) That's probably not a significant concern for a simple download count, but if this were sensitive information it might matter.
The second approach simply requires a single INSERT for each download, and because each event is stored separately, it's easy to troubleshoot. And, as you point out, you can slice this data any way you like.
As for performance, 360,000 rows is trivial for a modern RDBMS on contemporary hardware, but you do want to make sure you have an index on date, username/resource name or any other columns that will be used to select data.
Still, you might have more volume than you expect, or maybe your DB is iffy (I'm not familiar with SQLAlchemy). To reduce your row count you could create a weekly batch process (yeah, I know, batch ain't dead despite what some people say) during non-peak hours to create summary records by week.
It would probably be easiest to create your summary records in a different table that is simply keyed by week and year, or start/end dates, depending on how you want to use it. After you've generated the weekly summary for a period, you can archive or delete the daily detail records for that period.