Is there an easy way or example to load Google Cloud Storage data into bigtable?
I have lots of json files generated by pyspark and i wish to load data into bigtable.
But I can not find an easy way to do that!
I have tried the python code from google-cloud-python and it work fined, but it just read data line by line into bigtable which was strange for me.
Any help would be greatly appreciated.
There is no simple tool to read data in Cloud Bigtable. Here are some options:
Import the files using Dataflow. This requires java development, and learning the Dataflow programming model.
Use Python (possibly with Pyspark) to read those json files, and write to Cloud Bigtable using a method called mutate_rows which write to Bigtable in bulk.
FYI, I work on the Cloud Bigtable team. I'm a Java developer, so I opt for #1. Our team has been working to improve our python experience. The extended team recently added some reliability improvements to make sure that mutate_rows is resilient for large jobs. We do not yet have any good examples of integrating with PySpark or Apache Beam's python SDK, but they are on our radar.
Related
I am currently trying to help a small business transition from using Google Sheets as a database to something more robust and scalable - preferably staying within Google services. I've looked into Google Cloud Storage and BigQuery, however - there are employees that need to manually update new data so anything in GCP won't be user friendly for non technical persons. I was thinking of employees still manually updating the Google sheets, and write a Python program to automatically update GCS or BigQuery, but the issue is that Google sheets is extremely slow and cannot handle the amount of data that's currently stored in there now.
Has anyone faced a similar issue and have any ideas/suggestions? Thank you so much in advance :)
What you might be able to do is to save the Google Sheet file as a .csv file and then import it in BigQuery. From there, maybe the employes can use simple commands to insert data. Please note that this question is very opinion based and anyone can suggest various ways to achieve what you want.
Create a web app hosted on App engine for the front end and data entry and connect it to Cloud SQL or BQ as a backend. You can create a UI in your web app where employees can access data from CloudSQL/BQ if needed. Alternatively you can use Google forms for data entry and connect it to Cloud SQL.
I am willing to show Google Analytics and Google Search Console data directly into Superset through their API.
Make direct queries to Google Analytics API in JSON (instead of storing the results into my database then showing them into Superset) and show the result in Superset
Make direct queries to Google Search Console API in JSON and show the result in Superset
Make direct queries to other amazing JSON APIs and show the result in Superset
How can I do so?
I couldn't find a Google Analytics datasource. I couldn't find a Google Search Console datasource either.
I can't find a way to display in Superset data retrieved from an API, only data stored in a database. I must be missing something, but I can't find anything in the docs related to authenticating & querying external APIs.
Superset can’t query external data API’s directly. Superset has to work with a supported database or data engine (https://superset.incubator.apache.org/installation.html#database-dependencies). This means that you need to find a way to fetch data out of the API and store it in a supported database / data engine. Some options:
Build a little Python pipeline that will query the data API, flatten the data to something tabular / relational, and upload that data to a supported data source - https://superset.incubator.apache.org/installation.html#database-dependencies - and set up Superset so it can talk to that database / data engine.
For more robust solutions, you may want to work with an devops / infrastructure to stand up a workflow scheduler like Apache Airflow (https://airflow.apache.org/) to regularly ping this API and store it in a database of some kind that Superset can talk to.
If you want to regularly query data from a popular 3rd party API, I also recommend checking out Meltano and learning more about Singer taps. These will handle some of the heavy lifting of fetching data from an API regularly and storing it in a database like Postgres. The good news is that there's a Singer tap for Google Analytics - https://github.com/singer-io/tap-google-analytics
Either way, Superset is just a thin layer above your database / data engine. So there’s no way around the reality that you need to find a way to extract data out of an API and store it in a compatible data source.
There is this project named shillelagh by one of Superset's contributors. This gives a SQL interface to REST APIs. This same package is used in Apache Superset to connect with gsheets.
New adapters are relatively easy to implement. There's a step-by-step tutorial that explains how to create a new adapter to an API or filetype in shillelagh.
The package shillelagh underlying uses SQLite Virtual Tables by using the SQLite wrapper APSW
Redash is an alternative to Superset for that task, but it doesn't have the same features. Here is a compared list of integrations for both tools: https://discuss.redash.io/t/a-comparison-of-redash-and-superset/1503
A quick alternative is paying for a third party service like: https://www.stitchdata.com/integrations/google-analytics/superset/
There is no such connector available by default.
A recommended solution would be storing your Google Analytics and Search Console data in a database, you could write a script that pulls data every 4 hours or whichever interval works for you.
Also, you shouldn't store all data but only the dimension/metrics you wish to see in your reports.
I'm researching a good way to get my SPSS model logic into my website in real time. Currently we built a shody python script that mimics what the SPSS model does with the data. The problem is whenever we make an update or optimization to the SPSS model we have to go in and adjust the python code. How do people solve for this usually?
I've gotten a suggestion to create a config file for all the frequently updated functions in SPSS to translate over to the current python script. We're open to completely generating the python script from the SPSS model though, if there's a way to do that.
I've looked into cursor method but these seem have their main value in automating SPSS with python, which isn't really what we need.
You may want to look into the use of the Watson Machine Learning service on IBM Bluemix.
https://console.bluemix.net/catalog/services/machine-learning?taxonomyNavigation=data
The SPSS Streams Service that is part of this allows you to take your stream, deploy it, and then call it via REST API.
Note the Streams Service does not support legacy R nodes, Extension nodes (R or Python code), or Predictive Extensions. However, it does support Text Analytics.
The service allows up to 5000 calls a month for free and then can be purchased on a per prediction basis.
I would like to do some machine learning tasks on data as it comes in through stream analytics from event hub. However, much of my data processing pipeline and prediction service is in python. Is there a way to send time chunked data into the python script for processing?
The Azure ML studio function does not suit my need because it appears to work on single rows of data, and the aggregation functions available in Stream Analytics don't seem to work for this data.
With the recently rolled out integration with Azure functions, you might be able to do that. Try that route.
This link describes creating an azure function. You will have to create a Http trigger and choose python as the language. There are also templates for several languages.
This question also has additional details about function.
Per my experience,you could put your data into Azure Storage.Then configure the Import Data component in Azure ML and connect Execute Python Script as input data.
Or you could use Azure Storage Python SDK to query data in your Execute Python Script directly.
However,the two methods mentioned above could only process part of the data at one time, so they should be used only at the experimental stage.
If you need to continue processing data, I suggest you use the web service component.
You could put logical code of querying data and processing result into web service.Please refer to this official tutorial to deploy your web service.
Hope it helps you.
I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.
I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.
My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.
Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.
What architecture/design/general factors would I need to consider while planning for this ETL process?
Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.
Thanks in advance.
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
Check out Oozie. It seems to fit your requirements. It has workflow engine, scheduling support and shell script and hive support.
In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. It has a dependency of db as persistence layer. That may require some extra efforts.
It has web UI and rest API. Managing and monitoring jobs could be automated if desired.