Wikidata - get labels for a large number of ids - python

I have a list of around 300.000 wikidata ids (e.g. Q1347065, Q731635 etc.) in an ndjson file as
{"Q1347065": ""}
{"Q731635": ""}
{"Q191789": ""} ... etc
What I would like is to get the label of each id, and form a dictionary of key values, such as
{"Q1347065":"epiglottitis", "Q731635":"Mount Vernon", ...} etc.
What I've used before the list of ids got so large, was a Wikidata python library (https://pypi.org/project/Wikidata/)
from wikidata.client import Client
import ndjson
client = Client()
with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
claims = ndjson.load(f)
l = {}
for d in claims:
l.update(d)
for key in l:
v = client.get(key)
l[key] = str(v.label)
json.dumps(l, out)
But it is too slow (around 15 hours for 1000 ids). Is there another way to achieve this that is faster than what I have been doing?

Before answering: I don't know what do you mean with json.dumps(r, out); I'm assuming you want json.dump(l, out) instead.
My answer consists in using the following SPARQL query to Wikidata Query Service:
SELECT ?item ?itemLabel WHERE {
VALUES ?item { wd:Q1347065 wd:Q731635 wd:Q105492052 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
for asking multiple labels at same time.
This speeds up a lot your execution time, because your bottleneck is the number of connections, and with this method the id -> label mapping is entirely done at server side.
import json
import ndjson
import re
import requests
def wikidata_query(query):
url = 'https://query.wikidata.org/sparql'
try:
r = requests.get(url, params = {'format': 'json', 'query': query})
return r.json()['results']['bindings']
except json.JSONDecodeError as e:
raise Exception('Invalid query')
with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
claims = ndjson.load(f)
l = {}
for d in claims:
l.update(d)
item_ids = l.keys()
sparql_values = list(map(lambda id: "wd:" + id, item_ids))
item2label = wikidata_query('''
SELECT ?item ?itemLabel WHERE {
VALUES ?item { %s }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}''' % " ".join(sparql_values))
for result in item2label :
item = re.sub(r".*[#/\\]", "", result['item']['value'])
label = result['itemLabel']['value']
l[item] = label
json.dump(l, out)
I guess you cannot do a single query for all 300.000 items, but you can easily find a maximum supported number of accepted ids and split your original id list according to that number.

Related

How to use msearch() with "q" in ElasticSearch?

I've been using the standard Python ElasticSearch client to make single requests in the following format:
es.search(index='my_index', q=query, size=5, search_type='dfs_query_then_fetch')
I now want to make queries in batch for multiple strings q.
I've seen this question explaining how to use the msearch() functionality to do queries in batch. However, msearch requires the full json-formatted request body for each request. I'm not sure which parameters in the query API correspond to just the q parameter from search(), or size, or search_type, which seem to be API shortcuts specific to the single-example search().
How can I use msearch but specify q, size, and search_type?
I read through the API and figured out how to batch simple search queries:
from typing import List
from elasticsearch import Elasticsearch
import json
def msearch(
es: Elasticsearch,
max_hits: int,
query_strings: List[str],
index: str
):
search_arr = []
for q in query_strings:
search_arr.append({'index': index })
search_arr.append(
{
"query": {
"query_string": {
"query": q
}
},
'size': max_hits
})
request = ''
request = ' \n'.join([json.dumps(x) for x in search_arr])
resp = es.msearch(body = request)
return resp
msearch(es, query_strings=['query 1', 'query 2'], max_hits=1, index='my_index')
EDIT: For my use case, I made one more improvement, which was because I didn't want to return the entire document in the result– for my purpose, I just needed the document ID and its score.
So the final search request object part looked like this, including the '_source': False bit:
search_arr.append(
{
# Queries `q` using Lucene syntax.
"query": {
"query_string": {
"query": q
},
},
# Don't return the full profile string, etc. with the result.
# We just want the ID and the score.
'_source': False,
# Only return `max_hits` documents.
'size': max_hits
}
)

How to keep running the same GraphQL query until x is null?

I have a Tableau GraphQL query that requires pagination:
test_1 = """
{
fieldsConnection (
first: 10,
orderBy: {field: NAME, direction: ASC}) {
nodes {
name
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
"""
2nd query:
test_2 = """
{
fieldsConnection (
first: 10,
next: SOME_STRING
orderBy: {field: NAME, direction: ASC}) {
nodes {
name
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
"""
This first query will have hasNextPage = true and endCursor = "huge-ass-string". What I saw in my server is that to extract all fields of interest I need to run the query 13 times!
What I want to do is in Python, using from tableau_api_lib import TableauServerConnection as tsc, write a function that runs the first query (test_1). If hasNextPage is true, than run the second query (test_2) updating the next value to be the value we got from endCursor.
This is how I get the JSON response from my query:
response = conn.metadata_graphql_query(query = test_1)
Is this possible in Python?
I just implemented pagination in the query and kept looping and storing the extracted data in a DataFrame

How can I filter API GET Request on multiple variables?

I am really struggling with this one. I'm new to python and I'm trying to extract data from an API.
I have managed to run the script below but I need to amend it to filter on multiple values for one column, lets say England and Scotland. Is there an equivelant to the SQL IN operator e.g. Area_Name IN ('England','Scotland').
from requests import get
from json import dumps
ENDPOINT = "https://api.coronavirus.data.gov.uk/v1/data"
AREA_TYPE = "nation"
AREA_NAME = "england"
filters = [
f"areaType={ AREA_TYPE }",
f"areaName={ AREA_NAME }"
]
structure = {
"date": "date",
"name": "areaName",
"code": "areaCode",
"dailyCases": "newCasesByPublishDate",
}
api_params = {
"filters": str.join(";", filters),
"structure": dumps(structure, separators=(",", ":")),
"latestBy": "cumCasesByPublishDate"
}
formats = [
"json",
"xml",
"csv"
]
for fmt in formats:
api_params["format"] = fmt
response = get(ENDPOINT, params=api_params, timeout=10)
assert response.status_code == 200, f"Failed request for {fmt}: {response.text}"
print(f"{fmt} data:")
print(response.content.decode())
I have tried the script, and dict is the easiest type to handle in this case.
Given your json data output
data = {"length":1,"maxPageLimit":1,"data":[{"date":"2020-09-17","name":"England","code":"E92000001","dailyCases":2788}],"pagination":{"current":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","next":null,"previous":null,"first":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","last":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1"}}
You can try something like this:
countries = ['England', 'France', 'Whatever']
return [country for country in data where country['name'] in countries]
I presume the data list is the only interesting key in the data dict since all others do not have any meaningful values.

How to manipulate an object of Google Ads API's Enum class - python

I am using the python client library to connect to Google Ads's API.
ga_service = client_service.get_service('GoogleAdsService')
query = ('SELECT campaign.id, campaign.name, campaign.advertising_channel_type '
'FROM campaign WHERE date BETWEEN \''+fecha+'\' AND \''+fecha+'\'')
response = ga_service.search(<client_id>, query=query,page_size=1000)
result = {}
result['campanas'] = []
try:
for row in response:
print row
info = {}
info['id'] = row.campaign.id.value
info['name'] = row.campaign.name.value
info['type'] = row.campaign.advertising_channel_type
When I parse the values this is the result I get:
{
"campanas": [
{
"id": <campaign_id>,
"name": "Lanzamiento SIKU",
"type": 2
},
{
"id": <campaign_id>,
"name": "lvl1 - website traffic",
"type": 2
},
{
"id": <campaign_id>,
"name": "Lvl 2 - display",
"type": 3
}
]
}
Why am I getting an integer for result["type"] ? When I check the traceback call I can see a string:
campaign {
resource_name: "customers/<customer_id>/campaigns/<campaign_id>"
id {
value: 397083380
}
name {
value: "Lanzamiento SIKU"
}
advertising_channel_type: SEARCH
}
campaign {
resource_name: "customers/<customer_id>/campaigns/<campaign_id>"
id {
value: 1590766475
}
name {
value: "lvl1 - website traffic"
}
advertising_channel_type: SEARCH
}
campaign {
resource_name: "customers/<customer_id>/campaigns/<campaign_id>"
id {
value: 1590784940
}
name {
value: "Lvl 2 - display"
}
advertising_channel_type: DISPLAY
}
I've searched on the Documentation for the API and found out that it's because the field: advertising_channel_type is of Data Type: Enum. How can I manipulate this object of the Enum class to get the string value? There is no helpful information about this on their Documentation.
Please help !!
The Enum's come with some methods to translate between index and string
channel_types = client_service.get_type('AdvertisingChannelTypeEnum')
channel_types.AdvertisingChannelType.Value('SEARCH')
# => 2
channel_types.AdvertisingChannelType.Name(2)
# => 'SEARCH'
This was found by looking at docstrings, e.g.
channel_types.AdvertisingChannelType.__doc__
# => 'A utility for finding the names of enum values.'
I think the best way to do this is this one liner code:
import proto
row_dict = proto.Message.to_dict(google_ads_row, use_integers_for_enums=False)
This will convert the entire google ads row into a dictionary in just one go and automatically get the ENUM values instead of the numbers.
#Vijaysinh Parmar try following
from google.protobuf import json_format
row_dict = json_format.MessageToJson(row, use_integers_for_enums=False)
Just work around it by, create a list
lookup_list = ['DISPLAY', 'HOTEL', 'SEARCH', 'SHOPPING', 'UNKNOWN', 'UNSPECIFIED', 'VIDEO']
and change the assignment in your last row to
info['type'] = lookup_list[row.campaign.advertising_channel_type]

How do I loop through UID's in Firebase

I am using Python to go through a Firebase DB to return an array of objects, then pick one at random and return its values. I've been using a small test JSON DB that I manually build and import into Firebase. When I do this, the DB's child node's are 0, 1, 2 etc... Using the code below - I can iterate through the data and grab what I need.
I've been building a CMS that will let me input data directly into Firebase (instead of importing a local JSON doc) using the push() method.
Accordingly, the child nodes become obfuscated timestamps that look like this: K036VOR90fh8sd80, KO698fhs7Hf8sfds etc...
Now when I attempt to for-loop through the nodes, I get the following error at ln9 caption=....:
TypeError: string indices must be integers
I'm assuming this is happening because the child nodes are strings now. Since I have to use the CMS - how do I now loop through these nodes?
Code:
if 'Item' in intent['slots']:
chosen_item = intent['slots']['Item']['value']
result = firebase.get(chosen_item, None)
if result:
item_object = []
for item in result:
item_object.append(item)
random_item = (random.choice(item_object))
caption = random_item['caption']
audio_src = random_item['audioPath']
Here is an approximation of what the Firebase looks like:
{
"1001" : {
"-K036VOR90fh8sd80EQ" : {
"audioPath" : "https://s3.amazonaws.com/bucket-output/audio/audio_0_1001.mp3",
"caption" : "Record 0 from 1001"
},
"-KO698fhs7Hf8sfdsWJS" : {
"audioPath" : "https://s3.amazonaws.com/bucket-output/audio/audio_1_1001.mp3",
"caption" : "Record 1 from 1001"
}
},
"2001" : {
"-KOFsPBMKVtwHSOfiDJ" : {
"audioPath" : "https://s3.amazonaws.com/bucket-output/audio/audio_0_2001.mp3",
"caption" : "Record 0 from 2001"
},
"-KOFsQvwgF9icSIolU" : {
"audioPath" : "https://s3.amazonaws.com/bucket-output/audio/audio_1_2001.mp3",
"caption" : "Record 1 from 2001"
}
}
}
What needs to be done is to strip the dictionary result of the parent nodes using a for-loop and the .items() Python method - through the key-values (k,v) and .append the values (v).
This strips the result of the parent Firebase dictionary keys i.e -KOFsQvwgF9icSIolU.
if 'Item' in intent['slots']:
chosen_item = intent['slots']['Item']['value']
result = firebase.get(chosen_item, None)
if result:
item_object = []
for k,v in result.items():
item_object.append(v)
random_item = (random.choice(item_object))
caption = random_item['caption']
audio_src = random_item['audioPath']

Categories

Resources