How can i create external tables (federated data source) in BigQuery using python (google-cloud-bigquery)?
I know you can use bq commands like this, but that is not how i want to do it:
bq mk --external_table_definition=path/to/json tablename
bq update tablename path/to/schemafile
with external_table_definition as:
{
"autodetect": true,
"maxBadRecords": 9999999,
"csvOptions": {
"skipLeadingRows": 1
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://bucketname/file_*.csv"
]
}
and a schemafile like this:
[
{
"mode": "NULLABLE",
"name": "mycolumn1",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "mycolumn2",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "mycolumn3",
"type": "STRING"
}
]
Thank you for your help!
Lars
table_id = 'table1'
table = bigquery.Table(dataset_ref.table(table_id), schema=schema)
external_config = bigquery.ExternalConfig('CSV')
external_config = {
"autodetect": true,
"options": {
"skip_leading_rows": 1
},
"source_uris": [
"gs://bucketname/file_*.csv"
]
}
table.external_data_configuration = external_config
table = client.create_table(table)
Schema Format is :
schema = [
bigquery.SchemaField(name='mycolumn1', field_type='INTEGER', is_nullable=True),
bigquery.SchemaField(name='mycolumn2', field_type='STRING', is_nullable=True),
bigquery.SchemaField(name='mycolumn3', field_type='STRING', is_nullable=True),
]
I know this is well after the question has been asked and answered, but the above accepted answer does not work. I attempted to do the same thing you are describing and additionally trying to use the same approach to update an existing external table who added some new columns. This would be the correct snippet to use assuming you have that JSON file stored somewhere like /tmp/schema.json
[
{
"mode": "NULLABLE",
"name": "mycolumn1",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "mycolumn2",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "mycolumn3",
"type": "STRING"
}
]
You should simply need to have the following if you already have the API representation of the options you want to add to the external table.
from google.cloud import bigquery
client = bigquery.Client()
# dataset must exist first
dataset_name = 'some_dataset'
dataset_ref = client.dataset(dataset_name)
table_name = 'tablename'
# Or wherever your json schema lives
schema = client.schema_from_json('/tmp/schema.json')
external_table_options = {
"autodetect": True,
"maxBadRecords": 9999999,
"csvOptions": {
"skipLeadingRows": 1
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://bucketname/file_*.csv"
]
}
external_config = client.ExternalConfig.from_api_repr(external_table_options)
table = bigquery.Table(dataset_ref.table(table_name), schema=schema)
table.external_data_configuration = external_config
client.create_table(
table,
# Now you can create the table safely with this option
# so that it does not fail if the table already exists
exists_od=True
)
# And if you seek to update the table's schema and/or its
# external options through the same script then use
client.update_table(
table,
# As a side note, this portion of the code had me confounded for hours.
# I could not for the life of me figure our that "fields" did not point
# to the table's columns, but pointed to the `google.cloud.bigquery.Table`
# object's attributes. IMHO, the naming of this parameter is horrible
# given "fields" are already a thing (i.e. `SchemaField`s).
fields=['schema', 'external_data_configuration'])
)
In addition to setting the external table configuration using the API representation, you can set all of the same attributes by calling the names of those attributes on the bigquery.ExternalConfig object itself. So this would be another approach surrounding just the external_config portion of the code above.
external_config = bigquery.ExternalConfig('CSV')
external_config.autodetect = True
external_config.max_bad_records = 9999999
external_config.options.skip_leading_rows = 1
external_config.source_uris = ["gs://bucketname/file_*.csv"]
I must again however raise some frustration with the Google documentation. The bigquery.ExternalConfig.options attribute claims that it can be set with a dictionary
>>> from google.cloud import bigquery
>>> help(bigquery.ExternalConfig.options)
Help on property:
Optional[Dict[str, Any]]: Source-specific options.
but that is completely false. As you can see above the python object attribute names and the API representation names of those same attributes are slightly different. Either way you try it though, if you had a dict of the source-specific options (e.g. CSVOptions, GoogleSheetsOptions, BigTableOptions, etc...) and attempted to pass that dict as the options attribute, it laughs in your face and says mean things like this.
>>> from google.cloud import bigquery
>>> external_config = bigquery.ExternalConfig('CSV')
>>> options = {'skip_leading_rows': 1}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'skipLeadingRows': 1}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skip_leading_rows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skipLeadingRows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
The workaround was iterating over the options dict and using the __setattr__() method on the options which worked well for me. Pick you favorite approach from above. I have tested all of this code and will be using it for some time.
Related
I am trying to duplicate a simple code in my reading material where I want to extract data from a JSON file and plot dots at the capitals of countries on a map.
Regarding my issue,
Traceback (most recent call last):
File "C:\Users\serta\Desktop\python\db\capitals.py", line 14, in <module>
lons.append(cp_dicts['geometries']['coordinates'][0])
TypeError: string indices must be integers
[Finished in 188ms]
I read similar posts here and I think I understand the "why" of the issue,I double checked my []'s and the depths of the nests but I cannot seem to fix it myself.
I am pretty sure I am looking at an integer where I target with my code (lons.append line) but I am still getting the "TypeError: string indices must be integers".
Here is the code:
import json
from plotly.graph_objs import Scattergeo, Layout
from plotly import offline
#Explore the structure of the data.
filename = 'data/capitals.topo.json'
with open (filename) as f:
all_cp_data = json.load(f)
all_cp_dicts=all_cp_data
lons,lats,hover_texts =[],[],[]
for cp_dicts in all_cp_dicts['objects']['capitals']:
lons.append(cp_dicts['geometries']['coordinates'][0])
lats.append(cp_dicts['geometries']['coordinates'][1])
hover_texts.append(cp_dicts['properties']['capital'])
#Map the earthquakes.
data = [{
'type':'scattergeo',
'lon':lons,
'lat':lats,
'text': hover_texts,
'marker':{
'size': [5],
#'color': mags,
#'colorscale': 'plasma',
#'reversescale':True,
#'colorbar':{'title':'Magnitude'},
},
}]
my_layout = Layout(title="Capital Cities")
fig ={'data':data, 'layout':my_layout}
offline.plot(fig, filename ='capital_cities.html')
Here is also the capitals.topo.json I am using:
{
"type": "Topology",
"objects": {
"capitals": {
"type": "GeometryCollection",
"geometries": [
{
"type": "Point",
"coordinates": [
90.24,
23.43
],
"id": "BD",
"properties": {
"country": "Bangladesh",
"city": "Dhaka",
"tld": "bd",
"iso3": "BGD",
"iso2": "BD"
}
},
On line 14, this is not valid, given the input data:
lons.append(cp_dicts['geometries']['coordinates'][0]
You need to update the loop along these lines:
for geometry in all_cp_dicts['objects']['capitals']['geometries']:
lons.append(geometry['coordinates'][0])
lats.append(geometry['coordinates'][1])
hover_texts.append(geometry['properties'].get('city', ""))
Note that for some of the locations, the 'cities' key is missing in the json. So you need to handle that when populating the hover_texts list, as shown.
Also, the 'data' variable was not working with Scattergeo. Below is a suggested revision to the syntax
data = Scattergeo(
lat = lats,
lon = lons,
text = hover_texts,
marker = dict(
size = 5
)
)
I searched quite a lot before asking this question and looks like I am stuck and therefore asking question here. I know such type of errors are encountered when Schema and object are not a match, maybe some datatype is missing or have other type of value for a field.
However, I believe my case is different.
My application is simple, which only serialize and deserialize an object into avro
My DataClass:
from time import time
from faker import Faker
from dataclasses import dataclass, field
from dataclasses_avroschema import AvroModel
Faker.seed(0)
fake = Faker()
#dataclass
class Head(AvroModel):
msgId: str = field()
msgCode: str = field()
#staticmethod
def fakeMe():
return Head(fake.md5(),
fake.pystr(min_chars=5, max_chars=5)
)
#dataclass
class Message(AvroModel):
head: Head = field()
status: bool = field()
class Meta:
namespace = "me.com.Message.v1"
def fakeMe(self):
self.head = Head.fakeMe()
self.bool = fake.pybool()
Now the script that runs the serialization:
import json, io as mainio
from dto.temp_schema import Message
from avro import schema, datafile, io as avroio
obj = Message(None, True)
obj.fakeMe()
schema_obj = schema.parse(json.dumps(Message.avro_schema_to_python()))
buf = mainio.BytesIO()
writer = datafile.DataFileWriter(buf, avroio.DatumWriter(), schema_obj)
writer.append(obj)
writer.flush()
buf.seek(0)
data = buf.read()
print("serialized avro: ", data)
When I run this I get following error:
Traceback (most recent call last):
File "/Users/office/Documents/projects/msg-bench/scrib.py", line 28, in <module>
writer.append(obj)
File "/Users/office/opt/anaconda3/envs/benchenv/lib/python3.9/site-packages/avro/datafile.py", line 329, in append
self.datum_writer.write(datum, self.buffer_encoder)
File "/Users/office/opt/anaconda3/envs/benchenv/lib/python3.9/site-packages/avro/io.py", line 771, in write
raise AvroTypeException(self.writer_schema, datum)
avro.io.AvroTypeException: The datum Message(head=Head(msgId='f112d652ecf13dacd9c78c11e1e7f987', msgCode='cYzVR'), status=True) is not an example of the schema {
"type": "record",
"name": "Message",
"namespace": "me.com.Message.v1",
"fields": [
{
"type": {
"type": "record",
"name": "Head",
"namespace": "me.com.Message.v1",
"fields": [
{
"type": "string",
"name": "msgId"
},
{
"type": "string",
"name": "msgCode"
}
],
"doc": "Head(msgId: str, msgCode: str)"
},
"name": "head"
},
{
"type": "boolean",
"name": "status"
}
],
"doc": "Message(head: dto.temp_schema.Head, status: bool)"
}
Please note I am generating the schema using Dataclass Object with help of a python library:
dataclasses-avroschema
And still after using the same schema I am not able to serialize data to Avro.
Currently I am not sure where I am going wrong and I am new to avro. Why this won't compile?
System and Library stats:
Python==3.9.7
avro==1.10.2
avro-python3==1.10.2
dataclasses-avroschema==0.25.1
Faker==9.3.1
fastavro==1.4.5
The problem is that you are trying to pass in the Message object to the standard avro library which doesn't expect that (instead it expects a dictionary). The library you are using has a section talking about serialization that you might want to take a look at: https://marcosschroh.github.io/dataclasses-avroschema/serialization/
So your script just needs to be something like this:
from dto.temp_schema import Message
obj = Message(None, True)
obj.fakeMe()
print("serialized avro: ", obj.serialize())
JSON file:https://1drv.ms/w/s!AizscpxS0QM4hJl99vVfUMvEjgXV3Q
i can extract TECH-XXX
#!/usr/bin/python
import sys
import json
sys.stdout = open('output.txt','wt')
datapath = sys.argv[1]
data = json.load(open(datapath))
for issue in data['issues']:
if len(issue['fields']['subtasks']) == 0:
print(issue['key'])
For every issue without subtasks (TECH-729
TECH-731) i want to extract TECH from
project": {
"avatarUrls": {
"16x16": "https://jira.corp.company.com/secure/projectavatar?size=xsmall&pid=10001&avatarId=10201",
"24x24": "https://jira.corp.company.com/secure/projectavatar?size=small&pid=10001&avatarId=10201",
"32x32": "https://jira.corp.company.com/secure/projectavatar?size=medium&pid=10001&avatarId=10201",
"48x48": "https://jira.corp.company.com/secure/projectavatar?pid=10001&avatarId=10201"
},
"id": "10001",
"key": "TECH",
"name": "Technology",
"self": "https://jira.corp.company.com/rest/api/2/project/10001"
},
and customfield_10107.id
i tried with print(issue['customfield_10107']['id']) and got
./tasks1.py 1.json
Traceback (most recent call last):
File "./tasks1.py", line 11, in <module>
print(issue['customfield_10107']['id'])
KeyError: 'customfield_10107'
key exists under issue and customfield_10107 exists in issue['fields']:
for issue in response["issues"]:
# For issues without subtasks
if len(issue['fields']['subtasks']) == 0:
# Print custom field id
if 'customfield_10107' in issue['fields']:
custom_field = issue['fields']['customfield_10107']
print custom_field['id']
# Print key
if 'key' in issue:
print issue['key']
I have to following bit of JSON data which is a snippet from a large file of JSON.
I'm basically just looking to expand this data.
I'll worry about adding it to the existing JSON file later.
The JSON data snippet is:
"Roles": [
{
"Role": "STACiWS_B",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTTstSTBWS-0001",
"TemplateName": "W2K16_BETA_4CPU",
"Hypervisor": "sys2Director-pool4",
"InCloud": false
}
}
],
So what I want to do is to make many more datasets of "role" (for lack of a better term)
So something like this:
"Roles": [
{
"Role": "Clients",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTClients-0001",
"TemplateName": "Win10_RTM_64_EN_1511",
"Hypervisor": "sys2director-pool3",
"InCloud": false
}
},
{
"Role": "Clients",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTClients-0002",
"TemplateName": "Win10_RTM_64_EN_1511",
"Hypervisor": "sys2director-pool3",
"InCloud": false
}
},
I started with some python code that looks like so, but, it seems I'm fairly far off the mark
import json
import pprint
Roles = ["STACiTS","STACiWS","STACiWS_B"]
RoleData = dict()
RoleData['Role'] = dict()
RoleData['Role']['Setttings'] = dict()
ASFHostType = "AsfManaged"
ASFBaseHostname = ["JTSTACiTS","JTSTACiWS","JTSTACiWS_"]
HypTemplateName = "W2K12R2_4CPU"
HypPoolName = "sys2director"
def CreateASF_Roles(Roles):
for SingleRole in Roles:
print SingleRole #debug purposes
if SingleRole == 'STACiTS':
print ("We found STACiTS!!!") #debug purposes
NumOfHosts = 1
for NumOfHosts in range(20): #Hardcoded for STACiTS - Generate 20 STACiTS datasets
RoleData['Role']=SingleRole
RoleData['Role']['Settings']['HostType']=ASFHostType
ASFHostname = ASFBaseHostname + '-' + NumOfHosts.zfill(4)
RoleData['Role']['Settings']['Hostname']=ASFHostname
RoleData['Role']['Settings']['TemplateName']=HypTemplateName
RoleData['Role']['Settings']['Hypervisor']=HypPoolName
RoleData['Role']['Settings']['InCloud']="false"
CreateASF_Roles(Roles)
pprint.pprint(RoleData)
I keep getting this error, which is confusing, because I thought dictionaries could have named indices.
Traceback (most recent call last):
File ".\CreateASFRoles.py", line 34, in <module>
CreateASF_Roles(Roles)
File ".\CreateASFRoles.py", line 26, in CreateASF_Roles
RoleData['Role']['Settings']['HostType']=ASFHostType
TypeError: string indices must be integers, not str
Any thoughts are appreciated. thanks.
Right here:
RoleData['Role']=SingleRole
You set RoleData to be the string 'STACiTS'. So then the next command evaluates to:
'STACiTS'['Settings']['HostType']=ASFHostType
Which of course is trying to index into a string with another string, which is your error. Dictionaries can have named indices, but you overwrote the dictionary you created with a string.
You likely intended to create RoleData["Settings"] as a dictionary then assign to that, rather than RoleData["Role"]["Settings"]
Also on another note, you have another syntax error up here:
RoleData['Role']['Setttings'] = dict()
With a mispelling of "settings" that will probably cause similar problems for you later on unless fixed.
I have a Mongo Collection that I need to update, and I'm trying to use the collection.update command to no avail.
Code below:
import pymongo
from pymongo import MongoClient
client = MongoClient()
db = client.SensorDB
sensors = db.Sensor
for sensor in sensors.find():
lat = sensor['location']['latitude']
lng = sensor['location']['longitude']
sensor['location'] = {
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [lat ,lng]
},
"properties": {
"name": sensor['name']
}
}
sensors.update({'webid': sensor['webid']} , {"$set": sensor}, upsert=True)
However, running this gets me the following:
Traceback (most recent call last):
File "purgeDB.py", line 21, in <module>
cameras.update({'webid': sensor['webid']} , {"$set": sensor}, upsert=True)
File "C:\Anaconda\lib\site-packages\pymongo\collection.py", line 561, in update
check_keys, self.uuid_subtype), safe)
File "C:\Anaconda\lib\site-packages\pymongo\mongo_client.py", line 1118, in _send_message
rv = self.__check_response_to_last_error(response, command)
File "C:\Anaconda\lib\site-packages\pymongo\mongo_client.py", line 1060, in __check_response_to_last_error
raise OperationFailure(details["err"], code, result)
pymongo.errors.OperationFailure: Mod on _id not allowed
Change this line:
for sensor in sensors.find():
to this:
for sensor in sensors.find({}, {'_id': 0}):
What this does is prevent Mongo from returning the _id field, since you aren't using it, and it's causing your problem later in your update() call since you cannot "update" _id.
An even better solution (Only write the data that is needed)
for sensor in sensors.find():
lat = sensor['location']['latitude']
lng = sensor['location']['longitude']
location = {
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [lat ,lng]
},
"properties": {
"name": sensor['name']
}
}
sensors.update({'webid': sensor['webid']} , {"$set": {'location': location}})
Edit:
As mentioned by Loïc Faure-Lacroix, you also do not need the upsert flag in your case - your code in this case is always updating, and never inserting.
Edit2:
Surrounded _id in quotes for first solution.