I'm using google cloudSQL for applying advance search on people data to fetch the list of users. In datastore, there are data already stored there with 2 model. First is used to track current data of users and other model is used to track historical timeline. The current data is stored on google cloudSQL are more than millions rows for all users. Now I want to implement advance search on historical data including between dates by adding all history data to cloud.
If anyone can suggest the better structure for this historical model as I've gone through many of the links and articles. But cannot find proper solution as I have to take care of the performance for search (In Current search, the time is taken to fetch result is normal but when history is fetched, It'll scan all the records which causes slowdown of queries because of complex JOINs as needed). The query that is used to fetch the data from cloudSQL are made dynamically based on the users' need. For example, A user want the employees list whose manager is "xyz.123#abc.in" , by using python code, the query will built accordingly. Now a user want to find users whose manager WAS "xyz.123#abc.in" with effectiveFrom 2016-05-02 to 2017-01-01.
As I've find some of the usecases for structure as below:
1) Same model as current structure with new column flag for isCurrentData (status of data whether it is history or active)
Disadv.:
- queries slowdown while fetching data as it will scan all records.
Duplication of data might increase.
These all disadv. will affect the performance of advance search by increasing time.
Solution to this problem is to partition whole table into diff tables.
2) Partition based on year.
As time passes, this will generate too many tables.
3) 2 tables might be maintained.
1st for current data and second one for history. But when user want to search data on both models will create complexity of build query.
So, need suggestions for structuring historical timeline with improved performance and effective data handling.
Thanks in advance.
Depending on how often you want to do live queries vs historical queries and the size of your data set, you might want to consider placing the historical data elsewhere.
For example, if you need quick queries for live data and do many of them, but can handle higher-latency queries and only execute them sometimes, you might consider periodically exporting data to Google BigQuery. BigQuery can be useful for searching a large corpus of data but has much higher latency and doesn't have a wire protocol that is MySQL-compatible (although it's query language will look familiar to those who know any flavor of SQL). In addition, while for Cloud SQL you pay for data storage and the amount of time your database is running, in BigQuery you mostly pay for data storage and the amount of data scanned during your query executions. Therefore, if you plan on executing many of these historical queries it may get a little expensive.
Also, if you don't have a very large data set, BigQuery may be a bit of an overkill. How large is your "live" data set and how large do you expect your "historical" data set to grow over time? Is it possible to just increase the size of the Cloud SQL instance as the historical data grows until the point at which it makes sense to start exporting to Big Query?
#Kevin Malachowski : Thanks for guiding me with your info and questions as It gave me new way of thinking.
Historical data records will be more than 0.3-0.5 million(maximum). Now I'll use BigQuery for historical advance search.
For live data-cloudSQL will be used as we must focus on perfomance for fetched data.
Some of performance issue will be there for historical search, when a user wants both results from live as well as historical data. (BigQuery is taking time near about 5-6 sec[or more] for worst case) But it will be optimized as per data and structure of the model.
Related
Is there any solution we can retrieve Salesforce data from Python with more than 2000 records for each chunk? I have used REST API to retrieve data and check nextRecordsUrl for the next chunk. But if it is a million records, this solution will take time. I tried to find a Salesforce parameter to increase the number of records for each chunk (>2000 records) but didn't find it yet.
Another idea is if we know how many nextRecordsUrl, we can use multi-threading in Python to retrieve data. But it seems we need to submit each nextRecordsUrl to get the next one.
If you have others ideas, pls suggest them. Currently, I can't use some filter conditions in SQL to limit data.
You could look into using Bulk API query, it'd let you return data in 10K chunks. But it comes with a bit of thinking shift. Your normal API is synchronous (give me next chunk, wait, give me next chunk, wait). With Bulk API you submit the job and from time to time you ask "is it done yet".
There's even a feature called "PK chunking" (split the results by primary key)
Consider going through the trailhead: https://trailhead.salesforce.com/content/learn/modules/large-data-volumes
And maybe play with Salesforce's Data Loader. Query your stuff normal way and measure the time, then with bulk api option selected. should give you idea what's the bottleneck and whether the big rewrite will gain anything.
https://developer.salesforce.com/docs/atlas.en-us.230.0.api_asynch.meta/api_asynch/asynch_api_bulk_query_intro.htm
I'm trying to create some weather models and I want to store and retrieve data on my hard drive.
Data is in this format:
{'Date_Time':'2020-07-18 18:16:17','Temp':29.0, 'Humidity':45.3}
{'Date_Time':'2020-07-18 18:18:17','Temp':28.9, 'Humidity':45.4}
{'Date_Time':'2020-07-18 18:20:17 ','Temp':28.8, 'Humidity':48.3}
I have new data coming in every day, I have old data from ~5 years ago.
I would like to periodically merge the data sets and create one large data set to manipulate.
Things I need:
1. Check if the date-time pair already exists, else add new data
2. Change old data values
3. Add new data values to the database
4. Must be on a local storage, I have plenty of space.
Things I would like but do not need:
1. Fastest Read access possible, not so concerned about storage time as that happens in the background mostly.
2. Something that makes searching for all data from today, last 7 days etc easy to retrieve
Things I have tried:
Appending to a json file
Works for now but is slow because I have to load the entire data set every time I want to append/modify
Appending to a text file
Easy to store, but hard to modify/check values
SQLLite3
I looked into this and it seemed workable, just wanted to know if there was something better before I just go ahead and do this.
Thank you for your help!
Not sure whether it's "better" but json_database seems to do what you're looking for:
save and load from file
search recursively by key and key/value pairs
fuzzy search
supports arbitrary objects
The selection of JSON vs TXT vs SQL or NoSQL DB would be based on your current and future requirements.
From your inputs, you have data for last 5 years and the data from the example is for every 2 seconds. Based on this, it seems like you will have a large dataset or will need to prune the dataset frequently. For large datasets, using a SQL or NoSQL DB would be ideal so that you do not load all data to memory for every read/write operation.
Using the date-time as your primary key, you would be able to read-write pretty quickly using a database.
Using SQLLite is a good start but if your data is going to grow, you should plan to move to an external SQL/NoSQL database.
Seeing that your data is mostly time based, it would be good to evaluate Time Series database like InfluxDB or Graphite.
Background:
I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year.
(I also have access to the server that updates this data every minute.)
Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.
These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.
Problem:
The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.
Possible Solutions:
Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.
Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour
Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.
So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.
Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.
Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.
Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.
Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.
Hope this helps
I have a aggregated data table in bigquery that has millions of rows. This table is growing everyday.
I need a way to get 1 row from this aggregate table in milliseconds to append data in real time event.
What is the best way to tackle this problem?
BigQuery is not build to respond in miliseconds, so you need an other solution in between. It is perfectly fine to use BigQuery to do the large aggregration calculation. But you should never serve directly from BQ where response time is an issue of miliseconds.
Also be aware, that, if this is an web application for example, many reloads of a page, could cost you lots of money.. as you pay per Query.
There are many architectual solution to fix such issues, but what you should use is hard to tell without any project context and objectives.
For realtime data we often use PubSub to connect somewhere in between, but that might be an issue if the (near) realtime demand is an aggregrate.
You could also use materialized views concept, by exporting the aggregrated data to a sub component. For example cloud storage -> pubsub , or a SQL Instance / Memory store.. or any other kind of microservice.
We have a table in Azure Table Storage that is storing a LOT of data in it (IoT stuff). We are attempting a simple migration away from Azure Tables Storage to our own data services.
I'm hoping to get a rough idea of how much data we are migrating exactly.
EG: 2,000,000 records for IoT device #1234.
The problem I am facing is in getting a count of all the records that are present in the table with some constrains (EG: Count all records pertaining to one IoT device #1234 etc etc).
I did some fair amount of research to find posts that say that this count feature is not implemented in the ATS. These posts however, were circa 2010 to 2014.
I'm assumed (hoped) that this feature has been implemented now since it's now 2017 and I'm trying to find docs to it.
I'm using python to interact with out ATS.
Could someone please post the link to the docs here that show how I can get the count of records using python (or even HTTP / rest etc)?
Or if someone knows for sure that this feature is still unavailable, that would help me move on as well and figure another way to go about things!
Thanks in advance!
Returning number of entities in the table storage is for sure not available in Azure Table Storage SDK and service. You could make a table scan query to return all entities from your table but if you have millions of these entities the query will probably time out. it is also going to have pretty big perf impact on your table. Alternatively you could try making segmented queries in a loop until you reach the end of the table.
Or if someone knows for sure that this feature is still unavailable,
that would help me move on as well and figure another way to go about
things!
This feature is still not available or in other words as of today there's no API which will give you a count of total number of rows in a table. You would have to write your own code to do so.
Could someone please post the link to the docs here that show how I
can get the count of records using python (or even HTTP / rest etc)?
For this you would need to list all entities in a table. Since you're only interested in the count, you can reduce the size response data by making use of Query Projection and fetching just one or two attributes of the entities (may be PartitionKey and RowKey). Please see my answer here for more details: Count rows within partition in Azure table storage.