Why is this SPARQL query missing so many results? - python

(First off, my apologies as this is a blatant cross-post. I thought opendata.SE would be the place for this, but it's gotten barely any views there and it appears to not be a very active site in general, so I figure I ought to try it here as it's programming-related.)
I'm trying to get a list of major cities in the world: their name, population, and location. I found what looked like a good query on Wikidata, slightly tweaking one of their built-in query examples:
SELECT DISTINCT ?cityLabel ?population ?gps WHERE {
?city (wdt:P31/wdt:P279*) wd:Q515.
?city wdt:P1082 ?population.
?city wdt:P625 ?gps.
FILTER (?population >= 500000) .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)
The results, at first glance, appear to be good, but it's missing a ton of important cities. For example, San Francisco (population 800,000+) and Seattle (population 650,000+) are not in the list, when I specifically asked for all cities with a population greater than 500,000.
Is there something wrong with my query? If not, there must be something wrong with the data Wikidata is using. Either way, how can I get a valid data set, with an API I can query from a Python script? (I've got the script all working for this; I'm just not getting back valid data.)
from SPARQLWrapper import SPARQLWrapper, JSON
from geopy.distance import great_circle
def parseCoords(gps):
base = gps[6:-1]
coords=base.split()
return (float(coords[1]), float(coords[0]))
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setReturnFormat(JSON)
sparql.setQuery("""SELECT DISTINCT ?cityLabel ?population ?gps WHERE {
?city (wdt:P31/wdt:P279*) wd:Q515.
?city wdt:P1082 ?population.
?city wdt:P625 ?gps.
FILTER (?population >= 500000) .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)""")
queryResults = sparql.query().convert()
cities = [(city["cityLabel"]["value"], int(city["population"]["value"]), parseCoords(city["gps"]["value"])) for city in queryResults["results"]["bindings"]]
print (cities)

The population of seattle is simply not in this database.
If you execute:
#Largest cities of the world
#defaultView:BubbleChart
SELECT * WHERE {
wd:Q5083 wdt:P1082 ?population.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
You get zero results. Altought the instance wd:Q5083(seattle) exists, it does not have a predicate wdt:P1082(population).

Related

Find all unique values for field in Elasticsearch through python

I've been scouring the web for some good python documentation for Elasticsearch. I've got a query term that I know returns the information I need, but I'm struggling to convert the raw string into something Python can interpret.
This will return a list of all unique 'VALUE's in the dataset.
{"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
Which I have taken from a dashboarding tool which accesses this data.
But I don't seem to be able to convert this into correct python.
I've tried this:
body_test = {"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
es = Elasticsearch(SETUP CONNECTION)
es.search(
index="INDEX_NAME",
body = body_test
)
but it doesn't like the find value. I can't find anything in the documentation about find.
RequestError: RequestError(400, 'parsing_exception', 'Unknown key for
a VALUE_STRING in [find].')
The only way I've got it to slightly work is with
es_search = (
Search(
using=es,
index=db_index
).source(['hierarchy1.hierarchy2.VALUE'])
)
But I think this is pulling the entire dataset and then filtering (which I obviously don't want to be doing each time I run this code). This needs to be done through python and so I cannot simply POST the query I know works.
I am completely new to ES and so this is all a little confusing. Thanks in advance!
So it turns out that the find in this case was specific to Grafana (the dashboarding tool I took the query from.
In the end I used this site and used the code from there. It's a LOT more complicated than I thought it was going to be. But it works very quickly and doesn't put a strain on the database (which my alternative method was doing).
In case the link dies in future years, here's the code I used:
from elasticsearch import Elasticsearch
es = Elasticsearch()
def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
"""
Helper to get all distinct values from ElasticSearch
(ordered by number of occurrences)
"""
compositeQuery = {
"size": pagesize,
"sources": [{
fieldname: {
"terms": {
"field": fieldname
}
}
}
]
}
# Iterate over pages
while True:
result = es.search(**kwargs, body={
"aggs": {
"values": {
"composite": compositeQuery
}
}
})
# Yield each bucket
for aggregation in result["aggregations"]["values"]["buckets"]:
yield aggregation
# Set "after" field
if "after_key" in result["aggregations"]["values"]:
compositeQuery["after"] = \
result["aggregations"]["values"]["after_key"]
else: # Finished!
break
# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}

SPARQL + owlready2 : least common ancestor of a class

My final goal is to get the least common ancestor of a class from an owl file ( for example if i choose AP and TP the result should be base )
base-----=AP
|----TP
Iam trying to subclass of a class but no result !
for information i tried this and it doent work for me : How to get Least common subsumer in ontology using SPARQL Query?
I need help please
python file :
from owlready2 import *
onto = get_ontology("file://ess.owl").load()
owlready2.JAVA_EXE = "C:\\Program Files (x86)\\Java\jre1.8.0_311\\bin"
print(list(default_world.sparql("""
SELECT (?x AS ?nb)
{ ?x a owl:Class . }
""")))
print(list(default_world.sparql("""
SELECT ?y
{ "Star" rdfs:subClassOf ?y }
"""))) ```

Elasticsearch dsl what does q('match', path = ) do?

In python using elasticsearch_dsl.query there is a helper function Q that does the DSL query. However, i do not understand what this query is trying to say in a code i found:
ES_dsl.Q('match', path=path_to_file)
What exactly is Q('match', path = path_to_file) doing?
Where path_to_file is a valid path to a file in the system in the index.
Isn't path only in nested queries? There is no path in 'match' queries? I'm guessing it is to detokenize the path_to_file to find an exact match? An explanation to what is happening would be appreciated.
the approach it takes is the query type as the first value, then what you want to query next. so that is saying;
run a match query - https://www.elastic.co/guide/en/elasticsearch/reference/7.15/query-dsl-match-query.html
use the path field and search for the value path_to_file
so matching that back to the docs page from above, it'd look like this in direct DSL;
GET /_search
{
"query": {
"match": {
"path": {
"query": "path_to_file"
}
}
}
}

How to get Wikipedia page id from Wikidata id?

I want to get the Wikipedia page id from the Wikidata id, how can I get it from the Wikidata Query Service or other methods with python? Because I do not see any attribute in wikidata called something like wikipedia id.
I'm not sure, if DBpedia always contains both wikiPageID and Wikidata ID, but you can try the folowing query on DBpedia:
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?wikipedia_id WHERE {
?dbpedia_id owl:sameAs ?wikidata_id .
?dbpedia_id dbo:wikiPageID ?wikipedia_id .
VALUES (?wikidata_id) {(wd:Q123)}
}
Try it!
Or you can try the following federated query on Wikidata:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?wikipedia_id where {
VALUES (?wikidata_id) {(wd:Q123)}
SERVICE <http://dbpedia.org/sparql> {
?dbpedia_id owl:sameAs ?wikidata_id .
?dbpedia_id dbo:wikiPageID ?wikipedia_id
}
}
Try it!
Update
You can call out to the Wikipedia API using MWAPI on Wikidata:
SELECT ?pageid WHERE {
VALUES (?item) {(wd:Q123)}
[ schema:about ?item ; schema:name ?name ;
schema:isPartOf <https://en.wikipedia.org/> ]
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:endpoint "en.wikipedia.org" .
bd:serviceParam wikibase:api "Generator" .
bd:serviceParam mwapi:generator "allpages" .
bd:serviceParam mwapi:gapfrom ?name .
bd:serviceParam mwapi:gapto ?name .
?pageid wikibase:apiOutput "#pageid" .
}
}
Try it!
Unfortunately, it seems you have to use a generator; allpages appears to be the most suitable one.
First, you need to get the Wikipedia page title from the Wikidata id, which can be done with a request to Wikidata API wbgetentities module, like so: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q123&format=json&props=sitelinks
Then, once you found the Wikipedia title from the desired Wikipedia edition, you can get the associated page id from that Wikipedia API: https://en.wikipedia.org/w/api.php?action=query&titles=September&format=json
So from those example URLs you can get that:
Wikidata id = Q123
=> English Wikipedia (enwiki) title = September
=> pageid = 15580374
Use below URL in your CURL call. You have to change WikiDataID Q243 in below link.
For Example if you want wikiPageID of Taj_Mahal then replace Q243 with Q9141 in below link and do CURL call.
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=PREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E+%0D%0ASELECT+%3FwikiPageID+WHERE+%7B%0D%0A%3Fdbpedia_id+owl%3AsameAs+%3Fwikidata_id++.%0D%0A%3Fdbpedia_id+dbo%3AwikiPageID+%3FwikiPageID+.%0D%0AVALUES+%28%3Fwikidata_id%29+%7B%28wd%3AQ243%29%7D+%0D%0A%7D&format=application%2Fsparql-results%2Bjson&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on&run=+Run+Query
To Get WikiPageID through wikiDataId you have to modify above link or by replacing wikiDataID of your choice in above link.
Note:
1) To Get WikiPageID with Label use this URL in CURL Call
2) Find Q243 and Replace with your wikiDataID

How to query dbpedia resource ontology 'wikiPageExternalLink'

Using sparql\sparqlwrapper in python, how will I be able to query for the values of a certain dbpedia resource? For example, how will I be able to get the dbpedia-owl:wikiPageExternalLink values of http://dbpedia.org/page/Asturias?
Here's a simple example on how will I be able to query for the rdfs:label of Asturias. But I don't know how to modify the query/query parameters to get values of property/ontology other than those included on rdfs schema. Here's the sample:
from SPARQLWrapper import SPARQLWrapper, JSON, XML, N3, RDF
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label
WHERE { <http://dbpedia.org/resource/Asturias> rdfs:label ?label }
""")
print '\n\n*** JSON Example'
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print result["label"]["value"]
Hoping to receive feedback. Thanks in advance!
Not sure where you're stuck—this is really easy:
SELECT ?label
WHERE { <http://dbpedia.org/resource/Asturias>
dbpedia-owl:wikiPageExternalLink ?label }
Usually you need to declare the namespace prefixes like rdfs: or dbpedia-owl: if you want to use them in the query, but on the DBpedia endpoint this works even without. If you want, you can declare them anyways:
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?label
WHERE { <http://dbpedia.org/resource/Asturias>
dbpedia-owl:wikiPageExternalLink ?label }
You can find out the full URI corresponding to the prefix by going to http://dbpedia.org/sparql and clicking on “Namespace Prefixes” near the top right corner.
If you want to rename the variable (for example, from ?label to ?link) then do it like this:
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?link
WHERE { <http://dbpedia.org/resource/Asturias>
dbpedia-owl:wikiPageExternalLink ?link }
and you also have to change "label" to "link" in the Python code that gets the value out of the JSON result.

Categories

Resources