I have a aggregated data table in bigquery that has millions of rows. This table is growing everyday.
I need a way to get 1 row from this aggregate table in milliseconds to append data in real time event.
What is the best way to tackle this problem?
BigQuery is not build to respond in miliseconds, so you need an other solution in between. It is perfectly fine to use BigQuery to do the large aggregration calculation. But you should never serve directly from BQ where response time is an issue of miliseconds.
Also be aware, that, if this is an web application for example, many reloads of a page, could cost you lots of money.. as you pay per Query.
There are many architectual solution to fix such issues, but what you should use is hard to tell without any project context and objectives.
For realtime data we often use PubSub to connect somewhere in between, but that might be an issue if the (near) realtime demand is an aggregrate.
You could also use materialized views concept, by exporting the aggregrated data to a sub component. For example cloud storage -> pubsub , or a SQL Instance / Memory store.. or any other kind of microservice.
Related
I am using firebase python client to write data to firestore.Any read / write operation at least takes 1 second to complete.Firestore DB is in us-central and our server is in Singapore.
Is it what causing issues?
During read, I have used a where query with limit like below.
collection_ref.where(
u"field", u"==", u"field_value").limit(1).get()
During write, I use set and update(dict)
Sometimes the lag is around 10 to 12 sec
Did anyone face similar issues?
Any pointers will be appreciated
This article on why is Cloud Firestore query slow mentioned the lists of reasons
If you are downloading a bunch of data you probably don’t need to download all of them.The solution would be to limit the amount that comes back.
Your offline cache is too big. Cloud Firestore does some amazing offline caching but this local cache does not apply the same indexes that the server does. This means when you query documents in your offline cache cloud Firestore needs to pack every documents stored locally for the collection being queried and compare it against your query.The solution is limit how much data is being stored in offline cache.
Without composite indexing Firestore would have to do a lot of searching to get the results set.So instead, create a composite index so Firestore can do a quick lookup.
Used to Realtime Database.Realtime Database generally has a lower latency,you are not really going to notice the difference.But if app needs every second of latency you are probably better off using Realtime
Database in these scenarios.
The laws of physics are keeping you down. Your customer might be too far away from your Firestore Database and the actual latency is taking too long. To fix this use real time listeners which is a technique called latency compensation.
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
Background:
I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year.
(I also have access to the server that updates this data every minute.)
Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.
These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.
Problem:
The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.
Possible Solutions:
Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.
Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour
Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.
So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.
Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.
Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.
Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.
Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.
Hope this helps
My python app is storing result data in BigQuery. In the code I am generating JSON that reflects target BQ table structure and then insert it.
Generally it works fine, but fails to save rows, which size exceeds 1 MB. This is an limitation of using streaming inserts.
I checked Google API documentation: https://googleapis.dev/python/bigquery/latest/index.html
It seems, that Client methods like insert_rows or insert_rows_json are using insertAll method underneath - which uses streaming mechanism.
Is there a way to invoke "standard" BigQuery insert from python code to insert row larger than 1MB? It would be rather rare occurrence, so I am not concerned about quotas regarding daily table insert count limit.
The Client library cannot go around the API limits.
See current quotas, a row as of this writing cannot be larger than 1MB.
The workaround we used is to save records in NJSON to GCS in 100MB batches - we use gcsfs library - and then execute a bq.load() job.
I have actually just logged a feature request here to increase the limit as this is very limiting. If interested, make sure to "star" it to gain traction.
I'm using google cloudSQL for applying advance search on people data to fetch the list of users. In datastore, there are data already stored there with 2 model. First is used to track current data of users and other model is used to track historical timeline. The current data is stored on google cloudSQL are more than millions rows for all users. Now I want to implement advance search on historical data including between dates by adding all history data to cloud.
If anyone can suggest the better structure for this historical model as I've gone through many of the links and articles. But cannot find proper solution as I have to take care of the performance for search (In Current search, the time is taken to fetch result is normal but when history is fetched, It'll scan all the records which causes slowdown of queries because of complex JOINs as needed). The query that is used to fetch the data from cloudSQL are made dynamically based on the users' need. For example, A user want the employees list whose manager is "xyz.123#abc.in" , by using python code, the query will built accordingly. Now a user want to find users whose manager WAS "xyz.123#abc.in" with effectiveFrom 2016-05-02 to 2017-01-01.
As I've find some of the usecases for structure as below:
1) Same model as current structure with new column flag for isCurrentData (status of data whether it is history or active)
Disadv.:
- queries slowdown while fetching data as it will scan all records.
Duplication of data might increase.
These all disadv. will affect the performance of advance search by increasing time.
Solution to this problem is to partition whole table into diff tables.
2) Partition based on year.
As time passes, this will generate too many tables.
3) 2 tables might be maintained.
1st for current data and second one for history. But when user want to search data on both models will create complexity of build query.
So, need suggestions for structuring historical timeline with improved performance and effective data handling.
Thanks in advance.
Depending on how often you want to do live queries vs historical queries and the size of your data set, you might want to consider placing the historical data elsewhere.
For example, if you need quick queries for live data and do many of them, but can handle higher-latency queries and only execute them sometimes, you might consider periodically exporting data to Google BigQuery. BigQuery can be useful for searching a large corpus of data but has much higher latency and doesn't have a wire protocol that is MySQL-compatible (although it's query language will look familiar to those who know any flavor of SQL). In addition, while for Cloud SQL you pay for data storage and the amount of time your database is running, in BigQuery you mostly pay for data storage and the amount of data scanned during your query executions. Therefore, if you plan on executing many of these historical queries it may get a little expensive.
Also, if you don't have a very large data set, BigQuery may be a bit of an overkill. How large is your "live" data set and how large do you expect your "historical" data set to grow over time? Is it possible to just increase the size of the Cloud SQL instance as the historical data grows until the point at which it makes sense to start exporting to Big Query?
#Kevin Malachowski : Thanks for guiding me with your info and questions as It gave me new way of thinking.
Historical data records will be more than 0.3-0.5 million(maximum). Now I'll use BigQuery for historical advance search.
For live data-cloudSQL will be used as we must focus on perfomance for fetched data.
Some of performance issue will be there for historical search, when a user wants both results from live as well as historical data. (BigQuery is taking time near about 5-6 sec[or more] for worst case) But it will be optimized as per data and structure of the model.