Does anybody have experience in getting a wikipedia page using wikitools for python (and django)? I am trying to get the article but I get a few first lines and that's it. I need to fetch the whole article and I can't seem to figure it out. The documentation is not very helpful either. My code is:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php?title=Some_Title&action=raw&maxlag=-1")
wikipage = page.Page(wikiobj, url, section='content')
wikidata = wikipage.getWikiText(True).decode('utf-8', 'replace')
Any help will be appreciated.
I'm using wikitools im my project, not for getting text on the page, but I initialize wiki object in a different way:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php")
wikipage = page.Page(wikiobj, title="Some_Title")
You don't need to supply any query to after api.php in the Wiki class.
Next, look at the definition of Page class:
__init__(self, site, title=False, check=True, followRedir=True, section=False, sectionnumber=False, pageid=False, namespace=False)
So you need to supply title to the constructor of the Page class (you supplied some unknown url param).
Related
I'm trying to get the authors of a publication by using scopus. For that I got an API key and startet. I searched for the DOI and got a response. Everything is fine, there is also an entry "authors", but for each request this field is simply empty. My code in python is below:
import pyscopus
key = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXX'
doi = '10.1016/0270-0255(87)90003-0'
scopus = pyscopus.Scopus(key)
response_json = json.loads(scopus.search(f'doi({doi})', view='STANDARD').to_json(orient="records"))
So a s I sayed, you can call response_json['authors'] but it is always empty. There are authors given on the website, but webscraping is forbidden. Am I doing something wrong or do they simply not provide these information (which is confusing, since there is a field)? So far I couldn't find an answer.
I know there are other ways like crossref to get these information, but for reasons I want to do it with scopus.
Thanks!
I have recently started learning Web scraping using Scrapy in python and am facing issues with scraping data from AccuWeather.org site (https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020).
Basically I am capturing dates and its weather temperature for my reporting purpose.
When inspected the site I found too many div tags so getting confused to write the code. Hence thought I would seek experts help on this.
Here is my code for your reference.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020']
def parse(self, response):
All_div_tags = response.css('div.content-module')[0]
#Grid_tag = All_div_tags.css('div.monthly-grid')
Date_tag = All_div_tags.css('div.date::text').extract()
yield {
'Date' : Date_tag}
I wrote this in PyCharm and am getting error as "code is not handled or not allowed".
please could someone help me with this?
I've tried to read some websites that gave me the same error. It happens because some websites don't allow web scraping on them. To get data from these websites, you would probably need to use their API if they have one.
Fortunately, AccuWeather has made it easy to use their API (unlike other APIs):
You first need to create an account at their developers' website: https://developer.accuweather.com/
Now, create a new app by going to My Apps > Add a new app.
You will probably see some information about your app (if you don't, press its name and it will probably show up). The only information you will need is your API Key, which is essential for APIs.
AccuWeather has pretty good documentation about their API here, yet I will show you how to use the most useful ones. You will need to have the location key of the city you want to get the weather from, that is shown in the URL of its weather page, for example, London's URL is www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328, so its location key is 328328.
When you have the location key of the city/cities you want to get the weather from, open a file, and type:
import requests
import json
If you want the daily weather (as shown here), type:
response = requests.get(url="http://dataservice.accuweather.com/forecasts/v1/daily/1day/LOCATIONKEY?apikey=APIKEY")
print(response.status_code)
Replacing APIKEY with your API key, and LOCATIONKEY with the city's location key. It should now display 200 when you run it (meaning the request was successful)
Now, load it as a JSON file:
response_json = json.loads(response.content)
And you can now get some information from it, such as the day's "definition":
print(response_json["Headline"]["Text"])
The minimum temperature:
min_temperature = response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]
print(f"Minimum Temperature: {min_temperature}")
The maximum temperature
max_temperature = response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]
print(f"Maximum Temperature: {max_temperature}")
The minimum temperature and maximum temperature with the unit:
min_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Unit"]
print(f"Minimum Temperature: {min_temperature}")
max_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Unit"]
print(f"Maximum Temperature: {max_temperature}")
And more.
If you have any questions, let me know. I hope I could help you!
Let me get there straight, I'm trying to make reader web app alike google reader, feedly etc... Hence i'm trying get rss by python using feedparser library. The thing is all website's rss is not in same format i mean some of them has no title, some of them has no publish date in RSS. However, i found that digg.com/reader is very useful digg's reader get rss with publish date and title too i wonder how this thing is work? Anyone got a clue or any little help would be appreciated
I've recently done some projects with the feed parser library and it can be very frustrating since many rss feeds are different. What works the most for me is something like this:
#to get posts from hackaday.com
import feedparser
feed = feedparser.parse("http://www.hackaday.com/blog/feed/") #get feed from hackaday
feed = feed['items'] #Get items in feed (this is the best way I've found)
print feed[0]['title'] #print post title
print feed[0]['summary'] #print post summary
print feed[0]['published'] #print date published
These are just a few of the different "fields" that feed parser has. To find the one you want just run these commands in the python shell and see what fits your needs.
you can use feedparser to know if a website have atom or rss, and then deal with each type.If a website has not a publish date or title, you can extract them using other librairies like goose-extractor (As an example :
from newspaper import Article
import feedparser
def extract_date(url):
article = Article(url)
article.download()
article.parse()
date=article.publish_date
return date
d=feedparser.parse("http://feeds.feedburner.com/webnewsit") #an italian website
d.entries[0] # the last entry
try :
d.entries[0].published
except AttributeError:
link_last_entry=d.entries[0].link
publish_date=extract_date(link_last_entry)
Let me know if you still don't get the publication date
I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!
I'm currently adding search functionality to my Django application using django-haystack v2.0.0-beta and Whoosh as the back end. Creating the index and returning the search results works fine so far. Now I want to enable the highlighting feature but I don't get it to work.
I'm using a highly customized setup for which the haystack documentation is not a great help. My Django application is a pure AJAX application, i.e., all requests between client and server are handled asynchronously by using jQuery and $.ajax(). That's why I have written a custom Django view that creates the haystack search queryset manually and dumps the search results into a JSON object. All of this works fine, but the addition of highlighting does not work. Here is my code that I have so far:
search_indexes.py
class CrawledWebpageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return CrawledWebpage # This is my Django model
forms.py
class HaystackSearchForm(forms.Form):
q = forms.CharField(
max_length=100,
label='Enter your search query')
views.py (I adopted some code from this post as it looked reasonable to me but it's probably wrong)
def return_search_results_ajax(request):
haystack_search_form = HaystackSearchForm(request.POST)
response = {}
if haystack_search_form.is_valid():
search_query = haystack_search_form.cleaned_data['q']
sqs = SearchQuerySet().filter(content=search_query)
highlighted_search_form = HighlightedSearchForm(request.POST, searchqueryset=sqs, load_all=True)
search_results = highlighted_search_form.search()
# Here I extract those fields of my model that should be displayed as results
webpage_urls = [result.object.url for result in search_results[:10]]
response['webpage_urls'] = webpage_urls
return HttpResponse(json.dumps(response), mimetype='application/json')
This code works fine as far as the search results are returned properly. But when I try to access the highlighted text snippet for a search result, for example for the first one:
print search_results[0].highlighted
Then I always get an empty string as the result: {'text': ['']}
Can anyone help me to get the highlighting feature working? Thank you very much in advance.
It looks like this is possibly a Haystack bug that has gone unresolved for a long time: http://github.com/toastdriven/django-haystack/issues/310
http://github.com/toastdriven/django-haystack/issues/273
http://github.com/toastdriven/django-haystack/issues/582
As an alternative, you could use Haystack's highlighting functionality instead of Whoosh's to highlight the results yourself. For example, once you get your search results in sqs, you could do
from haystack.utils import Highlighter
highlighter = Highlighter(search_query)
print highlighter.highlight(sqs[0].text)
to get the highlighted text of the first result. See http://django-haystack.readthedocs.org/en/latest/highlighting.html for the documentation.
I'm not familiar with Haystack but could it be because you're using HaystackSearchForm in one place and HighlightedSearchForm in another?