Loading JSON file in BigQuery using Google BigQuery Client API

Loading JSON file in BigQuery using Google BigQuery Client API - python

Is there a way to load a JSON file from local file system to BigQuery using Google BigQuery Client API?
All the options I found are:
1- Streaming the records one by one.
2- Loading JSON data from GCS.
3- Using raw POST requests to load the JSON (i.e. not through Google Client API).

I'm assuming from the python tag that you want to do this from python. There is a load example here that loads data from a local file (it uses CSV, but it is easy to adapt it to JSON... there is another json example in the same directory).
The basic flow is:
# Load configuration with the destination specified.
load_config = {
'destinationTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': TABLE_ID
}
}
load_config['schema'] = {
'fields': [
{'name':'string_f', 'type':'STRING'},
{'name':'boolean_f', 'type':'BOOLEAN'},
{'name':'integer_f', 'type':'INTEGER'},
{'name':'float_f', 'type':'FLOAT'},
{'name':'timestamp_f', 'type':'TIMESTAMP'}
]
}
load_config['sourceFormat'] = 'NEWLINE_DELIMITED_JSON'
# This tells it to perform a resumable upload of a local file
# called 'foo.json'
upload = MediaFileUpload('foo.json',
mimetype='application/octet-stream',
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
result = jobs.insert(
projectId=project_id,
body={
'jobReference': {
'jobId': job_id
},
'configuration': {
'load': load
}
},
media_body=upload).execute()
# Then you'd also want to wait for the result and check the status. (check out
# the example at the link for more info).

Related

Replacing BCP utility with microservice

My team want to implement a ASP.NET CORE Web API based micro service with a plan to replace bulk copy program utility. Currently we are using BCP utility to return 200,000 rows with 30 columns. The data is returned in csv format.
We created a restful endpoint and using ADO.NET we are connecting to SQL server to extract same volume of data. Here is the code:
using (SqlConnection myConnection = new SqlConnection(con))
{
string oString = "Select * from Employees where runid = 1";
SqlCommand oCmd = new SqlCommand(oString, myConnection);
myConnection.Open();
using (SqlDataReader oReader = oCmd.ExecuteReader())
{
while (oReader.Read())
{
//Read data here
}
}
}
With this code, I am getting memory exceptions..
What is the best way to fix this issue considering in future I will get request to return higher volume of data with more users making simultaneous request. I am open to implementing this solution using C#, Java, Python or NodeJs.

The following code streams the data directly from the database, so it should be quite memory efficient and performant. It is using the Sylvan Csv library functionality to create csv records directly from the SqlDataReader.
using Microsoft.AspNetCore.Mvc;
using Microsoft.AspNetCore.Mvc.Infrastructure;
using Microsoft.Data.SqlClient;
using Sylvan.Data.Csv;
...
[HttpGet(Name = "Get")]
public IActionResult Get()
{
return new FileCallbackResult("text/csv", async (outputStream, _) =>
{
using (var myConnection = new SqlConnection(_configuration["ConnectionStrings"]))
{
var cmdText = "Select * from Employees where runid = 1";
var command = new SqlCommand(cmdText, myConnection);
myConnection.Open();
using (SqlDataReader oReader = await command.ExecuteReaderAsync())
{
var streamWriter = new StreamWriter(outputStream);
var csvDataWriter = CsvDataWriter.Create(streamWriter);
await csvDataWriter.WriteAsync(oReader);
}
}
})
{
FileDownloadName = "employees.csv"
};
}
FileCallbackResult is from: https://github.com/StephenClearyExamples/AsyncDynamicZip/blob/net6-ziparchive/Example/src/WebApplication/FileCallbackResult.cs
You can read about it here: https://blog.stephencleary.com/2016/11/streaming-zip-on-aspnet-core.html

How to set the Doc Id with Firebase REST API

Is there a way to manually create a docId when inserting a document into Firestore?
The following Python3 code will insert a new document in Firestore with an auto-generated docId.
import requests
import json
project_id = 'MY_PROJECT_NAME'
web_api_key = 'MY_WEB_API_KEY'
collection_name = 'MY_COLLECTION_NAME'
url = f'https://firestore.googleapis.com/v1/projects/{project_id}/databases/(default)/documents/{collection_name}/?key={web_api_key}'
payload = {
'fields': {
'title': { 'stringValue': 'myTitle' },
'category': { 'stringValue': 'myCat' },
'temperature': { 'doubleValue': 75 }
}
}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
response = requests.post(url, data=json.dumps(payload), headers=headers)
response_dict = json.loads(response.content)
for i in response_dict:
print(f'{i}: {response_dict[i]}')
In case anyone else wants to use this code in the future, to get a Web API key, go to Google Cloud Platform > APIs & Services > Credentials > Create Credentials > API key, then copy the value it generates here.
Thanks,
Ryan

answered in Google Cloud Firestore REST API createDocument auto genarates ID documentId should be added as a query parameter when document is created
So for the code in the question, just url should be changed with the same body:
url = f'https://firestore.googleapis.com/v1/projects/{project_id}/databases/(default)/documents/{collection_name}?documentId={your_custom_doc_id}&key={web_api_key}'

To set a custom document ID, you only need to append the name to the URL path after the respective collection. This is similar to how the document reference works where the path must point to the desired location.
From the documentation:
https://firestore.googleapis.com/v1/projects/YOUR_PROJECT_ID/databases/(default)/documents/cities/LA

Firebase Realtime Database Adding/Removing/Updating/Retrieving Data

So, I wanted to move my local database to the Firebase by using Firebase's Realtime Database feature, however, I am struggling a bit as I am completely new to Firebase, and I am using the library called 'pyrebase'
What I am looking for:
database {
userid1 {
mail:"email1"
},
userid2 {
mail:"email2"
},
userid3 {
mail:"email3"}
...
}
My first question is regarding to how to create such structure using Firebase?
If such structure in the realtime database was accomplished, how to update any specific userid's data?
If wanted any of the user to be deleted from the system by just using their userid, how is it done?
And lastly, which is very important, if wanted to retrieve any of the user emails by looking through their userid, how is it retrieved?
What I have done so far:
I have created the realtime database so far
Downloaded and integrated the credentials
p.s literally in need of source related to Firebase.

So, I have finally figured out how to do all of these as shown below:
Inserting Data:
from firebase import firebase
firebase = firebase.FirebaseApplication('https://xxxxx.firebaseio.com/', None)
data = { 'Name': 'Vivek',
'RollNo': 1,
'Percentage': 76.02
}
result = firebase.post('/python-sample-ed7f7/Students/',data)
print(result)
Retrieving Data:
from firebase import firebase
firebase = firebase.FirebaseApplication('https://xxxx.firebaseio.com/', None)
result = firebase.get('/python-sample-ed7f7/Students/', '')
print(result)
Updating Data:
from firebase import firebase
firebase = firebase.FirebaseApplication('https://xxxx.firebaseio.com/', None)
firebase.put('/python-sample-ed7f7/Students/-LAgstkF0DT5l0IRucvm','Percentage',79)
print('updated')
Delete Data:
from firebase import firebase
firebase = firebase.FirebaseApplication('https://xxxx.firebaseio.com/', None)
firebase.delete('/python-sample-ed7f7/Students/', '-LAgt5rGRlPovwNhOsWK')
print('deleted')

Extracting BigQuery Data From a Shared Dataset

Is it possible to extract data (to google cloud storage) from a shared dataset (where I have only have view permissions) using the client APIs (python)?
I can do this manually using the web browser, but cannot get it to work using the APIs.
I have created a project (MyProject) and a service account for MyProject to use as credentials when creating the service using the API. This account has view permissions on a shared dataset (MySharedDataset) and write permissions on my google cloud storage bucket. If I attempt to run a job in my own project to extract data from the shared project:
job_data = {
'jobReference': {
'projectId': myProjectId,
'jobId': str(uuid.uuid4())
},
'configuration': {
'extract': {
'sourceTable': {
'projectId': sharedProjectId,
'datasetId': sharedDatasetId,
'tableId': sharedTableId,
},
'destinationUris': [cloud_storage_path],
'destinationFormat': 'AVRO'
}
}
}
I get the error:
googleapiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/sharedProjectId/jobs?alt=json
returned "Value 'myProjectId' in content does not agree with value
sharedProjectId'. This can happen when a value set through a parameter
is inconsistent with a value set in the request.">
Using the sharedProjectId in both the jobReference and sourceTable I get:
googleapiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/sharedProjectId/jobs?alt=json
returned "Access Denied: Job myJobId: The user myServiceAccountEmail
does not have permission to run a job in project sharedProjectId">
Using myProjectId for both the job immediately comes back with a status of 'DONE' and with no errors, but nothing has been exported. My GCS bucket is empty.
If this is indeed not possible using the API, is there another method/tool that can be used to automate the extraction of data from a shared dataset?
* UPDATE *
This works fine using the API explorer running under my GA login. In my code I use the following method:
service.jobs().insert(projectId=myProjectId, body=job_data).execute()
and removed the jobReference object containing the projectId
job_data = {
'configuration': {
'extract': {
'sourceTable': {
'projectId': sharedProjectId,
'datasetId': sharedDatasetId,
'tableId': sharedTableId,
},
'destinationUris': [cloud_storage_path],
'destinationFormat': 'AVRO'
}
}
}
but this returns the error
Access Denied: Table sharedProjectId:sharedDatasetId.sharedTableId: The user 'serviceAccountEmail' does not have permission to export a table in
dataset sharedProjectId:sharedDatasetId
My service account now is an owner on the shared dataset and has edit permissions on MyProject, where else do permissions need to be set or is it possible to use the python API using my GA login credentials rather than the service account?
* UPDATE *
Finally got it to work. How? Make sure the service account has permissions to view the dataset (and if you don't have access to check this yourself and someone tells you that it does, ask them to double check/send you a screenshot!)

After trying to reproduce the issue, I was running into the parse errors.
I did how ever play around with the API on the Developer Console [2] and it worked.
What I did notice is that the request code below had a different format than the documentation on the website as it has single quotes instead of double quotes.
Here is the code that I ran to get it to work.
{
'configuration': {
'extract': {
'sourceTable': {
'projectId': "sharedProjectID",
'datasetId': "sharedDataSetID",
'tableId': "sharedTableID"
},
'destinationUri': "gs://myBucket/myFile.csv"
}
}
}
HTTP Request
POST https://www.googleapis.com/bigquery/v2/projects/myProjectId/jobs
If you are still running into problems, you can try the you can try the jobs.insert API on the website [2] or try the bq command tool [3].
The following command can do the same thing:
bq extract sharedProjectId:sharedDataSetId.sharedTableId gs://myBucket/myFile.csv
Hope this helps.
[2] https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert
[3] https://cloud.google.com/bigquery/bq-command-line-tool

Make sure the service account has permissions to view the dataset (and if you don't have access to check this yourself and someone tells you that it does, ask them to double check/send you a screenshot!)

How to create a commit and push into repo with GitHub API v3?

I want to create a repository and Commit a few files to it via any Python package. How do I do?
I do not understand how to add files for commit.

Solution using the requests library:
NOTES: I use the requests library to do the calls to GitHub REST API v3.
1. Get the last commit SHA of a specific branch
# GET /repos/:owner/:repo/branches/:branch_name
last_commit_sha = response.json()['commit']['sha']
2. Create the blobs with the file's content (encoding base64 or utf-8)
# POST /repos/:owner/:repo/git/blobs
# {
# "content": "aGVsbG8gd29ybGQK",
# "encoding": "base64"
#}
base64_blob_sha = response.json()['sha']
# POST /repos/:owner/:repo/git/blobs
# {
# "content": "hello world",
# "encoding": "utf-8"
#}
utf8_blob_sha = response.json()['sha']
3. Create a tree which defines the folder structure
# POST repos/:owner/:repo/git/trees/
# {
# "base_tree": last_commit_sha,
# "tree": [
# {
# "path": "myfolder/base64file.txt",
# "mode": "100644",
# "type": "blob",
# "sha": base64_blob_sha
# },
# {
# "path": "file-utf8.txt",
# "mode": "100644",
# "type": "blob",
# "sha": utf8_blob_sha
# }
# ]
# }
tree_sha = response.json()['sha']
4. Create the commit
# POST /repos/:owner/:repo/git/commits
# {
# "message": "Add new files at once programatically",
# "author": {
# "name": "Jan-Michael Vincent",
# "email": "JanQuadrantVincent16#rickandmorty.com"
# },
# "parents": [
# last_commit_sha
# ],
# "tree": tree_sha
# }
new_commit_sha = response.json()['sha']
5. Update the reference of your branch to point to the new commit SHA (on master branch example)
# POST /repos/:owner/:repo/git/refs/heads/master
# {
# "ref": "refs/heads/master",
# "sha": new_commit_sha
# }
Finally, for a more advanced setup read the docs.

You can see if the new update GitHub CRUD API (May 2013) can help
The repository contents API has allowed reading files for a while. Now you can easily commit changes to single files, just like you can in the web UI.
Starting today, these methods are available to you:
File Create
File Update
File Delete

Here is a complete snippet:
def push_to_github(filename, repo, branch, token):
url="https://api.github.com/repos/"+repo+"/contents/"+filename
base64content=base64.b64encode(open(filename,"rb").read())
data = requests.get(url+'?ref='+branch, headers = {"Authorization": "token "+token}).json()
sha = data['sha']
if base64content.decode('utf-8')+"\n" != data['content']:
message = json.dumps({"message":"update",
"branch": branch,
"content": base64content.decode("utf-8") ,
"sha": sha
})
resp=requests.put(url, data = message, headers = {"Content-Type": "application/json", "Authorization": "token "+token})
print(resp)
else:
print("nothing to update")
token = "lskdlfszezeirzoherkzjehrkzjrzerzer"
filename="foo.txt"
repo = "you/test"
branch="master"
push_to_github(filename, repo, branch, token)

Github provides a Git database API that gives you access to read and write raw objects and to list and update your references (branch heads and tags). For a better understanding of the topic, I would highly recommend you reading Git Internals chapter of Pro Git book.
As per the documentation, it is a 7 steps process to commit a change to a file in your repository:
get the current commit object
retrieve the tree it points to
retrieve the content of the blob object that tree has for that particular file path
change the content somehow and post a new blob object with that new content, getting a blob SHA back
post a new tree object with that file path pointer replaced with your new blob SHA getting a tree SHA back
create a new commit object with the current commit SHA as the parent and the new tree SHA, getting a commit SHA back
update the reference of your branch to point to the new commit SHA
This blog does a great job at explaining this process using perl. For a python implementation, you can use PyGithub library.

Based on previous answer, here is a complete example. Note that you need to use POST if you upload the commit to a new branch, or PATCH to upload to an existing one.
import whatsneeded
GITHUB_TOKEN = "WHATEVERWILLBEWILLBE"
def github_request(method, url, headers=None, data=None, params=None):
"""Execute a request to the GitHUB API, handling redirect"""
if not headers:
headers = {}
headers.update({
"User-Agent": "Agent 007",
"Authorization": "Bearer " + GITHUB_TOKEN,
})
url_parsed = urllib.parse.urlparse(url)
url_path = url_parsed.path
if params:
url_path += "?" + urllib.parse.urlencode(params)
data = data and json.dumps(data)
conn = http.client.HTTPSConnection(url_parsed.hostname)
conn.request(method, url_path, body=data, headers=headers)
response = conn.getresponse()
if response.status == 302:
return github_request(method, response.headers["Location"])
if response.status >= 400:
headers.pop('Authorization', None)
raise Exception(
f"Error: {response.status} - {json.loads(response.read())} - {method} - {url} - {data} - {headers}"
)
return (response, json.loads(response.read().decode()))
def upload_to_github(repository, src, dst, author_name, author_email, git_message, branch="heads/master"):
# Get last commit SHA of a branch
resp, jeez = github_request("GET", f"/repos/{repository}/git/ref/{branch}")
last_commit_sha = jeez["object"]["sha"]
print("Last commit SHA: " + last_commit_sha)
base64content = base64.b64encode(open(src, "rb").read())
resp, jeez = github_request(
"POST",
f"/repos/{repository}/git/blobs",
data={
"content": base64content.decode(),
"encoding": "base64"
},
)
blob_content_sha = jeez["sha"]
resp, jeez = github_request(
"POST",
f"/repos/{repository}/git/trees",
data={
"base_tree":
last_commit_sha,
"tree": [{
"path": dst,
"mode": "100644",
"type": "blob",
"sha": blob_content_sha,
}],
},
)
tree_sha = jeez["sha"]
resp, jeez = github_request(
"POST",
f"/repos/{repository}/git/commits",
data={
"message": git_message,
"author": {
"name": author_name,
"email": author_email,
},
"parents": [last_commit_sha],
"tree": tree_sha,
},
)
new_commit_sha = jeez["sha"]
resp, jeez = github_request(
"PATCH",
f"/repos/{repository}/git/refs/{branch}",
data={"sha": new_commit_sha},
)
return (resp, jeez)

I'm on Google App Engine (GAE) so beside of python, I can create a new file, update it, even delete it via a commit and push into my repo in GitHub with GitHub API v3 in php, java and go.
Checking and reviewing some of the available third party libraries to create like the example script that presented in perl, I would recommend to use the following:
PyGithub (python)
GitHub API for php (php)
GitHub API for Java (java)
go-github (go)
As you aware, you can get one site per GitHub account and organization, and unlimited project sites where the websites are hosted directly from your repo and powered by Jekyll as default.
Combining Jekyll, Webhooks, and GitHub API Script on GAE, along with an appropriate GAE Setting, it will give you a wide possibility like calling external script and create a dynamic page on GitHub.
Other than GAE, there is also an option run it on Heroku. Use JekyllBot that lives on a (free) Heroku instance to silently generates JSON files for each post and pushing the changes back to GitHub.

I have created an example for committing with multiple files using Python:
import datetime
import os
import github
# If you run this example using your personal token the commit is not going to be verified.
# It only works for commits made using a token generated for a bot/app
# during the workflow job execution.
def main(repo_token, branch):
gh = github.Github(repo_token)
repository = "josecelano/pygithub"
remote_repo = gh.get_repo(repository)
# Update files:
# data/example-04/latest_datetime_01.txt
# data/example-04/latest_datetime_02.txt
# with the current date.
file_to_update_01 = "data/example-04/latest_datetime_01.txt"
file_to_update_02 = "data/example-04/latest_datetime_02.txt"
now = datetime.datetime.now()
file_to_update_01_content = str(now)
file_to_update_02_content = str(now)
blob1 = remote_repo.create_git_blob(file_to_update_01_content, "utf-8")
element1 = github.InputGitTreeElement(
path=file_to_update_01, mode='100644', type='blob', sha=blob1.sha)
blob2 = remote_repo.create_git_blob(file_to_update_02_content, "utf-8")
element2 = github.InputGitTreeElement(
path=file_to_update_02, mode='100644', type='blob', sha=blob2.sha)
commit_message = f'Example 04: update datetime to {now}'
branch_sha = remote_repo.get_branch(branch).commit.sha
base_tree = remote_repo.get_git_tree(sha=branch_sha)
tree = remote_repo.create_git_tree([element1, element2], base_tree)
parent = remote_repo.get_git_commit(sha=branch_sha)
commit = remote_repo.create_git_commit(commit_message, tree, [parent])
branch_refs = remote_repo.get_git_ref(f'heads/{branch}')
branch_refs.edit(sha=commit.sha)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading JSON file in BigQuery using Google BigQuery Client API - python

Is there a way to load a JSON file from local file system to BigQuery using Google BigQuery Client API? All the options I found are: 1- Streaming the records one by one. 2- Loading JSON data from GCS. 3- Using raw POST requests to load the JSON (i.e. not through Google Client API).

Related

Replacing BCP utility with microservice

How to set the Doc Id with Firebase REST API

Firebase Realtime Database Adding/Removing/Updating/Retrieving Data

Extracting BigQuery Data From a Shared Dataset

How to create a commit and push into repo with GitHub API v3?

Categories

Resources