How can I write a Pandas Dataframe containing a dictionary to Bigquery?

How can I write a Pandas Dataframe containing a dictionary to Bigquery? - python

I have JSON data from API which contain page number called 'offset' and 'items' which is a nested JSON like below:
{
"data": {
"offset": 0,
"pageSize": 20,
"items": [
{
"id": "6biewd5a",
"title": "AAAAAAAAAAAAAA"
},
{
"id": "er45ggg",
"title": "BBBBBBBBBBBBBBBB"
}
]
}
}
I am creating a dataframe to write to bigquery.
Here is my code.
import requests
from requests.auth import HTTPBasicAuth
import json
from google.cloud import bigquery
import pandas
import pandas_gbq
URL = 'xxxxxxxxxxxxxxxxxxxx'
auth = HTTPBasicAuth('name', 'password')
r = requests.get(url=URL, auth=auth)
# Extracting data in JSON format
data = json.loads(json.dumps(r.json()))
# data = r.json()
print(type(data))
offset = str(data['data']['offset'])
json_data = data['data']['items']
type(json_data)
df = pandas.DataFrame(
{
'offset': offset,
'json_data': json_data
)
# df['json_data'] = df['json_data'].astype('string')
client = bigquery.Client(project='ncau-data-newsquery-sit')
table_id = 'sdm_adpoint.testfapi1'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("offset", "STRING"),
bigquery.SchemaField("json_data", "STRING", "REPEATED")
# , index=[0]
],
autodetect=False
)
df.head()
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
I have made dataframe out of this like below:
Dataframe
But when I write to bigquery using
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
It throws the below error:
ArrowTypeError: Could not convert {'id': '6biewd5a',......
To overcome this, I tried this:
df['json_data'] = df['json_data'].astype('string')
It’s working, but it’s merging all rows as one, and writing each letter in a separate row.
I want data to be written as a row, the same way as displayed in a Pandas dataframe.
I tried to write to Bigquery using
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
But it throws an error
ArrowTypeError: Could not convert {'id': '6biewd5a',......
Then I tried
df['json_data'] = df['json_data'].astype('string')
But it’s merging all rows as one, and writing each letter in separate rows.
I want data to be written as a row, the same way as displayed in Pandas dataframe.

Your JSON schema is very similar to the schema given here in a Google Cloud Platform example:
Example schema
So you will have to make the below change to your schema definition in your code.
For the items schema, you will have to define the schema as below:
bigquery.SchemaField(
"items",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
bigquery.SchemaField("title", "STRING", mode="NULLABLE"),
],
)

Related

Python & Pandas: Parsing JSONs in a loop

With Python I'm pulling a nested json, and I'm seeking to parse it via a loop and write the data to a csv. The structure of the json is below. The values I'm after are in the "view" list, labeled "user_id" and `"message"'
{
"view": [
{
"id": 109205,
"user_id": 6354,
"parent_id": null,
"created_at": "2020-11-03T23:32:49Z",
"updated_at": "2020-11-03T23:32:49Z",
"rating_count": null,
"rating_sum": null,
**"message": "message text",**
"replies": [
# json continues
],
}
After some study and assistance from this helpful tutorial I was able to structure requests like this:
import requests
import json
import pandas as pd
url = "URL"
headers = {'Authorization' : 'Bearer KEY'}
r = requests.get(url, headers=headers)
data = r.json()
print(data['view'][0]['user_id'])
print(data['view'][0]['message'])
Which successfully prints the outputs 6354 and "message test".
Now....how would I approach capturing all the user id's and messages from the json to a csv with Pandas?

How can I organize JSON data from pandas dataframe

I can't figure out how to correctly organize the JSON data that is created from my pandas dataframe. This is my code:
with open (spreadsheetName, 'rb') as spreadsheet:
newSheet = spreadsheet.read()
newSheet = pd.read_excel(newSheet)
exportSheet = newSheet.to_json('file.json', orient = 'index')
And I'd like for the JSON data to look something like
{
"cars": [
{
"Model": "Camry",
"Year": "2015"
},
{
"Model": "Model S",
"Year": "2018"
}
]
}
But instead I'm getting a single line of JSON data from the code I have. Any ideas on how I can make it so that each row is a JSON 'object' with it's own keys and values from the column headers (like model and year)?

Set an indent argument to desired value in to_json function.
exportSheet = newSheet.to_json('file.json', orient='index', indent=4)

How can I filter API GET Request on multiple variables?

I am really struggling with this one. I'm new to python and I'm trying to extract data from an API.
I have managed to run the script below but I need to amend it to filter on multiple values for one column, lets say England and Scotland. Is there an equivelant to the SQL IN operator e.g. Area_Name IN ('England','Scotland').
from requests import get
from json import dumps
ENDPOINT = "https://api.coronavirus.data.gov.uk/v1/data"
AREA_TYPE = "nation"
AREA_NAME = "england"
filters = [
f"areaType={ AREA_TYPE }",
f"areaName={ AREA_NAME }"
]
structure = {
"date": "date",
"name": "areaName",
"code": "areaCode",
"dailyCases": "newCasesByPublishDate",
}
api_params = {
"filters": str.join(";", filters),
"structure": dumps(structure, separators=(",", ":")),
"latestBy": "cumCasesByPublishDate"
}
formats = [
"json",
"xml",
"csv"
]
for fmt in formats:
api_params["format"] = fmt
response = get(ENDPOINT, params=api_params, timeout=10)
assert response.status_code == 200, f"Failed request for {fmt}: {response.text}"
print(f"{fmt} data:")
print(response.content.decode())

I have tried the script, and dict is the easiest type to handle in this case.
Given your json data output
data = {"length":1,"maxPageLimit":1,"data":[{"date":"2020-09-17","name":"England","code":"E92000001","dailyCases":2788}],"pagination":{"current":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","next":null,"previous":null,"first":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","last":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1"}}
You can try something like this:
countries = ['England', 'France', 'Whatever']
return [country for country in data where country['name'] in countries]
I presume the data list is the only interesting key in the data dict since all others do not have any meaningful values.

Extracting JSON data into a relational table

I have a JSON file which resulted from YouTube's iframe API and needs to be preprocessed. I want to put this JSON data into a pandas dataframe, where each JSON key will be a column, and each recorded "event" should be a new row.
I was able to load the data as a dataframe using the read_json , but with this the keys for each event are shown as an array.
Here is what my JSON data looks like :
{
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
And this is what I did to convert it to a dataframe:
data=pd.read_json("file.json")
df=pd.DataFrame(data)
print(df)
The output looks like this:
0 {'timemillis': 1563469276604, 'date': '18.7.20...
1 {'timemillis': 1563469276694, 'date': '18.7.20...
...
How can I convert this output into a table where I have separate columns for these keys such as 'timemmillis','date','name' and so on? I never worked with JSONs before so I am a bit confused.

import pandas as pd
import json
data = {
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
# or read data from file
# rather than reading file directly to pandas dataframe read as json
# data=pd.read_json("file.json")
with open('file.json') as json_file:
data = json.load(json_file)
df=pd.DataFrame(data['events'])
print(df)
Result
data date name time timemillis
0 18.7.2019 Player is loading 18:31:03,580 1563467463580
1 5 18.7.2019 Player is loaded 18:31:03,668 1563467463668

import pandas as pd
df=pd.read_json("file.json",orient='columns')
rows = []
for i,r in df.iterrows():
rows.append({'eventid':i+1,'timemillis':r['events']['timemillis'],'name':r['events']['name']})
df = pd.DataFrame(rows)
print(df)
Now you can insert this df to database

Apply in multiple rows simultaneously

I have a Pandas dataframe and I want to call an API and pass some parameters from that dataframe. Then I get the results from the API and create a new column from that. This is my working code:
import http.client, urllib.request, urllib.parse, urllib.error, base64
import pandas as pd
import json
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': 'my-api-key-goes-here',
}
params = urllib.parse.urlencode({
})
df = pd.read_csv('mydata.csv',names=['id','text'])
def call_api(row):
try:
body = {
"documents": [
{
"language": "en",
"id": row['id'],
"text": row['text']
}
]
}
conn = http.client.HTTPSConnection('api-url')
conn.request("POST", "api-endpoint" % params, str(body), headers)
response = conn.getresponse()
data = response.read()
data = json.loads(data)
return data['documents'][0]['score']
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
df['score'] = df.apply(call_api,axis=1)
The above works quite well. However, I have a limit on the number of api requests I can do and the API let me send up to 100 documents in the same request, by adding more on the body['documents'] list.
The returned data follow this schema:
{
"documents": [
{
"score": 0.92,
"id": "1"
},
{
"score": 0.85,
"id": "2"
},
{
"score": 0.34,
"id": "3"
}
],
"errors": null
}
So, what I am looking for is to apply the same api call not row by row, but in batches of 100 rows each time. Is there any way to do this in Pandas or should I iterate on dataframe rows, create the batches myself and then iterate again to add the returned values on the new column?

DataFrame.apply() is slow; we can do better. This will create the "documents" list-of-dicts in one go:
df.to_dict('records')
Then all you need to do is split it into chunks of 100:
start = 0
while start < len(df):
documents = df.iloc[start:start+100].to_dict('records')
call_api(documents)
start += 100
Finally, you could use a single HTTP session with the requests library:
import requests
session = requests.Session()
call_api(session, documents)
Then inside call_api() you do session.post(...). This is more efficient than setting up a new connection each time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I write a Pandas Dataframe containing a dictionary to Bigquery? - python

Related

Python & Pandas: Parsing JSONs in a loop

How can I organize JSON data from pandas dataframe

How can I filter API GET Request on multiple variables?

Extracting JSON data into a relational table

Apply in multiple rows simultaneously

Categories

Resources