I'm looking for a way to get all the movies on English Wikipedia, with their creation date.
A movie for me is a page with IMDB ID attached to it.
So, this is my query so far:
SELECT DISTINCT ?item_label ?imdb_id (year(?dateCreation) as ?AnneeCreation) WHERE {
?item wdt:P345 $imdb_id.
filter STRSTARTS(?imdb_id,"tt")
OPTIONAL{
?item wdt:P571 ?dateCreation.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en".}
?article schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ; schema:name ?item_label
}
The problem with this is that most of the pages don't have a P571 property,
So I was wondering whether there is a better way to get the creation date?
Maybe by the revision history or something, I couldn't find such an option.
Any help will be appreciated!
So, as the comments have noted, Wikidata properties (with some rare examples like featured-article flags) describe the underlying concept, not the Wikipedia page metadata. There is some limited ability to talk to the Wikipedia API as #AKSW points out, but my understanding is that this doesn't work very well for large numbers of articles (note the example code has a LIMIT 50 in it)
However, all is not lost! I worked out a methodology to do this at scale for very large numbers of articles recently in Gender and Deletion on Wikipedia, using a bit of lateral thinking.
First step: figure out your Wikidata query. tt-prefixed IMDB tags may apply to things other than films (eg TV episodes, sports broadcasts), so another approach might be to do a P31/P279 type/class search to find all things that are "films, or subclasses of films". You will also want to add a filter that explicitly says "and only has an article in English Wikipedia", which I see you've already done. Note that this gives you the name of the WP article, not the "label" of the Wikidata item, which is distinct, so you can drop the (time-consuming) label service clause. You'll end up with something like https://w.wiki/FH4 (this still uses the tt- prefix approach and gets 180k results) or https://w.wiki/FH8 (P31/P279 filter plus tt- prefix, 136k results)
Run this query, save the results TSV somewhere, and move on to step 2. The tool we will use here is PetScan, which is designed to link up data from Wikipedia categories, Wikipedia metadata, Wikidata queries, etc.
Feed the SPARQL query into tab 4 ("Other sources") and say "Use wiki: enwiki" at the bottom of this tab. This will force it to output data on the Wikipedia articles linked from this query.
Now hit "do it", wait a little while, (it took ~100s when I tested it) and examine the results. You will see that we get title (the WP article), page ID, namespace (hopefully always "(Article)", size in bytes, and last-touched date. None of these are creation date...
...except one of them kind of is. PageIDs are assigned sequentially, so they are essentially time-of-creation timestamps. There are some nuances here about edge cases - eg if I created a redirect called "Example (film)" in 2010, and in 2015 manually edited the redirect to become a real article called "Example (film)", it would show up as created in 2010. There may also be odd results for pages deleted and recreated, or ones that have had complicated page-move histories (straightforward page moves should maintain IDs, though). But, in general, for 95% of items, the pageID will reflect the time at which it was first created onwiki. For example, 431900000 was created at 11.14am on 1 July 2014; 531900000 was created at 6.29pm on 14 February 2017; and so on.
Back to PetScan - let's pull down all these items. In PetScan, go to the last tab and select TSV. Re-run the search and save the resulting file.
Now, we have one TSV with Wikidata IDs, IMDB IDs, and WP page titles (plus anything else you want to recover from WD queries); we have another with WP page titles and page IDs. You can link them together using WP page titles, letting you go from "results in Wikidata" to "page ID". Clean these up and link them however you prefer - I did it in bash, you might want to use something more sensible like python.
Now you can convert PageID to creation date. For the work I did I was only interested in six-month bins so I just worked out an arbitrary pageID created on 1 January and 1 July each year, and counted IDs between them. You could do the same thing, or use the API to look up individual pageIDs and get creation timestamps back - depends exactly what you're wanting to get.
This is all a bit more complicated than just using the query service, and it will ''probably'' give spurious results for one or two articles with complicated histories, but it will basically let you do what you originally asked for.
Related
I'm doing a work-related project in which I should study whether we could extract certain fields of information (e.g. contract parties, start and end dates) from contracts automatically.
I am quite new to working with text data and am wondering if those pieces of information could be extracted with ML by having the whole contract as input and the information as output without tagging or annotating the whole text?
I understand that the extraction should be ran separately for each targeted field.
Thanks!
First question - how are the contracts stored? Are they PDFs or text-based?
If they're PDFs, there are a handful of packages that can extract text from a PDF (e.g. pdftotext).
Second question - is the data you're looking for in the same place in every document?
If so, you can extract the information you're looking for (like start and end dates) from a known location in the contract. If not, you'll have to do something more sophisticated. For example you may need to do a text search for "start date", if the same terminology is used in every contract. If different terminology is used from contract to contract, you may need to work to extract meaning from the text, which can be done using some sophisticated natural language processing (NLP).
Without more knowledge of your problem or a concrete example, it's hard to say what your best option may be.
I was searching for an already answered question about this but couldn't find one so please forgive me if I somehow missed it.
I'm using Google Books API and I know I can search a book by specific category.
My question is, how can I get all the available categories from the API?
I looked in the API documentation but couldn't find any mention of this.
The Google books api does not have an end point for returning Categories that are not associated with a book itself.
The Google Books api is only there to list books. You can
search and browse through the list of books that match a given query.
view information about a book, including metadata, availability and price, links to the preview page.
manage your own bookshelves.
You can see the category of a book you can not get a list of available categories in the whole system
You may be interested to know this has been on their todo list since 2012 category list
We have numerous requests for this and we're investigating how we can properly provide the data. One issue is Google does not own all the category information. "New York Times Bestsellers" is one obvious example. We need to first identify what we can publish through the API.
work around
i worked around it by implementing my own category list mechanism so i can pull all categories that exists in my app's database.
(unfortunately, the newly announced ScriptDb deprecation means my whole system will go to waste in a couple of monthes anyway... but that's another story)
https://support.google.com/books/partner/answer/3237055?hl=en
Scroll down to subject/genres and you will see this link.
https://bisg.org/page/bisacedition
This list is apparently a list of subjects AKA categories for North American Books. I am making various GET requests with an API testing tool and getting for the most part, perfect matches (you may have to drop a word from the query string. ex: "criticism" instead of "literary criticism") for whatever subject I choose from the BISG subjects list, and what comes back in the json response under the "categories" key.
Ex: GET https://www.googleapis.com/books/v1/volumes?q=business+subject:juvenile+fiction
Long story short, the BISG link is where I'm pretty sure Google got all the options for their "categories" key from.
So, I'm working on a python web application, it's a search engine for sporting goods (sport outfits, tools ....etc) . Basically it should search for a given keyword on multiple stores and compare results to return the 20 best results .
I was thinking that the best and easiest way to do this is to write a json file wich contains rules for the scraper on how to extract data on each website . For ex:
[{"www.decathlon.com" : { "rules" : { "productTag" : "div['.product']",
"priceTag" : "span[".price"]" } }]
So for decathlon, to get product item we search for div tags with the product class .
I have a list of around 10 - 15 websites to scrape . So for each website it goes to rules.json, see the related rules and use them to extract data .
Pros for this Method :
Very Easy to write, we need a minimal python script for the logic on how to read and map urls to their rules and extract the data through BeautifulSoup + It's also very easy to add, delete new urls and their rules .
Cons of this method : For each search we launch a request to each website, so making around 10 requests at the same time, then compare results, so if 20 users search at the same time we will have around 200 requests which will slow down our app a lot !
Another Method :
I thought that we could have a huge list of keywords, then at 00:00, a script launch requests to all the urls for each keyword in the list, compare them, then store the results in CouchDB, to be used through the day, and It will be updated daily . The only problem with this method is that it's nearly impossible to have a list of all possible keywords .
So please, help me on how should I proceed with this ? Given that I don't have a lot of time .
Along the lines of your "keyword" list: rather than keeping a list of all possible keywords, perhaps you could maintain a priority queue of keywords with importance based on how often a keyword is searched. When a new keyword is encountered, add it to the list, otherwise update it's importance every time it's searched. Launch a script to request urls for the top, say, 30 keywords each day (more or less depending on how often words are searched and what you want to do).
This doesn't necessarily solve your problem of having too many requests, but may decrease the likelihood of it becoming too much of a problem.
HTTP requests can be very expensive. That's why you want to make sure you parallelize your requests and for that you can use something like Celery. This way you will reduce total time to the time of slowest responding website.
It may be a good idea to set request timeout to shorter time (5 seconds?) in case one of the website is not responding to your request.
Have the ability to flag domain as "down/not responding" and be able to handle those exceptions.
Other optimization would be to store page contents after each search for some time in case same search keyword comes in so you can skip expensive requests.
I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?
There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());
There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.
I'm using python to build an application which functions in a similar way to an RSS aggregator. I'm using the feedparser library to do this. However, I'm struggling to get the program to correctly detect if there is new content.
I'm mainly concerned with news-related feeds. Besides seeing if a new item has been added to the feed, I also want to be able to detect if a previous article has been updated. Does anybody know how I can use feedparser to do this, bearing in mind that the only compulsory item elements are either the title or the description? I'm willing to assume that the link element will always be present as well.
Feedparser's "id" attribute associated with each item seems to simply be the link to the article so this may help with detecting new articles on the feed, but not with detecting updates to previous articles since the "id" for those will not have changed.
I've looked on previous threads on stackoverflow and some people have suggested hashing the content or hashing title+url but I'm not really sure what that means or how one would go about it (if indeed it is the right approach).
Hashing in this context means to calculate a shorter value to represent each combination of url and title. This approach works when you use a hash function that ensures the odds of a collision (two different items generate the same value) are low.
Traditionally, MD5 has been a good function for this (but be careful not to use it for cryptographic operations - it's deprecated for that purpose).
So for example.
>>> import hashlib
>>> url = "http://www.example.com/article/001"
>>> title = "The Article's Title"
>>> id = hashlib.md5(url + title).hexdigest()
>>> print id
785cbba05a2929a9f76a06d834140439
>>>
This will provide an id that will change if the URL or title changes - indicating that it is a new article.
You can download and add the content of the article to the hash if you also want to detect edits to the article content.
Note, if you do intend to pull entire pages down, you may want to learn about HTTP conditional GET with Python in order to save bandwidth and be a little friendlier to the sites you are hitting.