mail ID can't be scraped - python

I am trying to scrape the mail ID with Scrapy, Python and RegEx from this page: https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163 .
For that purpose, I wrote the following commands, each of which returned an empty list:
response.xpath('//a/*[#href = "#"]/text()').extract()
response.xpath('//a/#onclick').extract()
response.xpath('//a/#onclick/text()').extract()
response.xpath('//span/*[#class = ""]/a/text()').extract()
response.xpath('//a/#onclick/text()').extract()
Apart from these, I had a plan to scrape the email ID from the description using RegEx. For that I wrote command to scrape the description which scraped everything except the email Id at the end of the description:
response.xpath('//*[#property = "schema:description"]/text()').extract()
The output of the above command is:
[u'\n\t\t\t\t\t\t\t "Your Future is created by what you do today Let\'s shape it With Summer Training Program \u2026\u2026\u2026 ."', u'\n', u'\nWith ever changing technologies & methodologies, the competition today is much greater than ever before. The industrial scenario needs constant technical enhancements to cater to the rapid demands.', u'\nHT India Labs is presenting Summer Training Program to acquire and clear your concepts about your respective fields. ', u'\nEnroll on ', u' and avail Early bird Discounts.', u'\n', u'\nFor Registration or Enquiry call 9911330807, 7065657373 or write us at ', u'\t\t\t\t\t\t']

I don't have much knowledge on onclick event attribute. I suppose, when it is set to return false then the request usually skips that portion. However, if you try the way I showed below, you may get the result very close to what you want.
import requests
from scrapy import Selector
res = requests.get("https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163")
sel = Selector(res)
for items in sel.css("div[property='schema:description']"):
emailid = items.css("span::text").extract_first()
print(emailid)
Output:
htindialabsworkshops | gmail ! com

Related

Web Scraping AccuWeather site

I have recently started learning Web scraping using Scrapy in python and am facing issues with scraping data from AccuWeather.org site (https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020).
Basically I am capturing dates and its weather temperature for my reporting purpose.
When inspected the site I found too many div tags so getting confused to write the code. Hence thought I would seek experts help on this.
Here is my code for your reference.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020']
def parse(self, response):
All_div_tags = response.css('div.content-module')[0]
#Grid_tag = All_div_tags.css('div.monthly-grid')
Date_tag = All_div_tags.css('div.date::text').extract()
yield {
'Date' : Date_tag}
I wrote this in PyCharm and am getting error as "code is not handled or not allowed".
please could someone help me with this?
I've tried to read some websites that gave me the same error. It happens because some websites don't allow web scraping on them. To get data from these websites, you would probably need to use their API if they have one.
Fortunately, AccuWeather has made it easy to use their API (unlike other APIs):
You first need to create an account at their developers' website: https://developer.accuweather.com/
Now, create a new app by going to My Apps > Add a new app.
You will probably see some information about your app (if you don't, press its name and it will probably show up). The only information you will need is your API Key, which is essential for APIs.
AccuWeather has pretty good documentation about their API here, yet I will show you how to use the most useful ones. You will need to have the location key of the city you want to get the weather from, that is shown in the URL of its weather page, for example, London's URL is www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328, so its location key is 328328.
When you have the location key of the city/cities you want to get the weather from, open a file, and type:
import requests
import json
If you want the daily weather (as shown here), type:
response = requests.get(url="http://dataservice.accuweather.com/forecasts/v1/daily/1day/LOCATIONKEY?apikey=APIKEY")
print(response.status_code)
Replacing APIKEY with your API key, and LOCATIONKEY with the city's location key. It should now display 200 when you run it (meaning the request was successful)
Now, load it as a JSON file:
response_json = json.loads(response.content)
And you can now get some information from it, such as the day's "definition":
print(response_json["Headline"]["Text"])
The minimum temperature:
min_temperature = response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]
print(f"Minimum Temperature: {min_temperature}")
The maximum temperature
max_temperature = response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]
print(f"Maximum Temperature: {max_temperature}")
The minimum temperature and maximum temperature with the unit:
min_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Unit"]
print(f"Minimum Temperature: {min_temperature}")
max_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Unit"]
print(f"Maximum Temperature: {max_temperature}")
And more.
If you have any questions, let me know. I hope I could help you!

Trying to scrape this website using python but unable to get the required data

I am trying to get the company names in this website https://siftery.com/microsoft-outlook
Basically it lists some companies that use Microsoft Outlook.
I used BeautifulSoup,requests,urllib and urllib2 but I still am not getting the names of the companies that use Microsoft Outlook not even in the first page of the website.
The code I wrote is below -
r = requests.get('http://siftery.com/microsoft-outlook')
print(str(r.content))
f=open('abc.txt','w')
f.write(r.content)
f.close()
and part of the output that looks interesting is this -
({"name":"Marketing","handle":"marketing","categories":[{"name":"Marketing Automation","handle":"marketing-automation","external_id":"tgJ_49k7v4J-wV","parent_handle":null,"categories":[{"name":"Marketing Automation Platforms","handle":"marketing-automation-platforms","external_id":"tgJLE9aHoLdneT","parent_handle":"marketing-automation"},
BeautifulSoup also gives me the same output, so do the other libraries.
It seems like "external_id" is where the company name is ? I'm not sure. I also tried to manually find the name of a company for example Acxiom using gedit but couldn't find any occurrence.
That site loads the information using javascript that means that when you do the requests, the DOM is rendered without the information because it is loaded asynchronously, for sites like that you should use selenium.
Note:
Before you build a scraper you should look if the site has an api or end-points with CORS disabled, in your case you can get the information doing a post request to https://siftery.com/product-json/<product_name>
The data is available directly as JSON. You can use requests to get it like this:
import requests
r = requests.post('https://siftery.com/product-json/microsoft-outlook')
data = r.json()['content']
companies = data['companies']
for company in companies:
print(companies[company]['name'])
Output
Public Technologies
Consalta
PagesJaunes.ca
Chumbak
Media Classified
P.I. Works
Saatchi & Saatchi Pro
Tribeck Strategies
Marketecture Solutions, LLC
Trinity Ventures
ARGOS
CFN Services
Last.Backend
Saatchi & Saatchi USA
Netcad
Central Element
NextGear Capital
Masao
Avalon
Motiwe
Bilge Adam
Impakt Athletics
SOZO Design
ThroughTek
Abovo42
Acxiom
ICEPAY
Connexta
Clearview
Mortgage Coach
There are other categories of information which you might want to investigate:
>>> data.keys()
[u'product', u'vendor', u'users', u'group_members', u'companies', u'customers', u'other_categories', u'current_user', u'page_info', u'portfolio_products', u'primary_category', u'metadata']

Scrapy, Crawling Reviews on Tripadvisor: extract more hotel and user information

in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)

Preserving line breaks when parsing with Scrapy in Python

I've written a Scrapy spider that extracts text from a page. The spider parses and outputs correctly on many of the pages, but is thrown off by a few. I'm trying to maintain line breaks and formatting in the document. Pages such as http://www.state.gov/r/pa/prs/dpb/2011/04/160298.htm are formatted properly like such:
April 7, 2011
Mark C. Toner
2:03 p.m. EDT
MR. TONER: Good afternoon, everyone. A couple of things at the top,
and then I’ll take your questions. We condemn the attack on innocent
civilians in southern Israel in the strongest possible terms, as well
as ongoing rocket fire from Gaza. As we have reiterated many times,
there’s no justification for the targeting of innocent civilians,
and those responsible for these terrorist acts should be held
accountable. We are particularly concerned about reports that indicate
the use of an advanced anti-tank weapon in an attack against civilians
and reiterate that all countries have obligations under relevant
United Nations Security Council resolutions to prevent illicit
trafficking in arms and ammunition. Also just a brief statement --
QUESTION: Can we stay on that just for one second?
MR. TONER: Yeah. Go ahead, Matt.
QUESTION: Apparently, the target of that was a school bus. Does that
add to your outrage?
MR. TONER: Well, any attack on innocent civilians is abhorrent, but
certainly the nature of the attack is particularly so.
While pages like http://www.state.gov/r/pa/prs/dpb/2009/04/121223.htm have output like this with no line breaks:
April 2, 2009
Robert Wood
11:53 a.m. EDTMR. WOOD: Good morning, everyone. I think it’s just
about still morning. Welcome to the briefing. I don’t have anything,
so – sir.QUESTION: The North Koreans have moved fueling tankers, or
whatever, close to the site. They may or may not be fueling this
missile. What words of wisdom do you have for the North Koreans at
this moment?MR. WOOD: Well, Matt, I’m not going to comment on, you
know, intelligence matters. But let me just say again, we call on the
North to desist from launching any type of missile. It would be
counterproductive. It’s provocative. It further inflames tensions in
the region. We want to see the North get back to the Six-Party
framework and focus on denuclearization.Yes.QUESTION: Japan has also
said they’re going to call for an emergency meeting in the Security
Council, you know, should this launch go ahead. Is this something that
you would also be looking for?MR. WOOD: Well, let’s see if this test
happens. We certainly hope it doesn’t. Again, calling on the North
not to do it. But certainly, we will – if that test does go forward,
we will be having discussions with our allies.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
speaker = hxs.select("//span[contains(#class, 'official_s_name')]") #gets the speaker
speaker = speaker.select('string()').extract()[0] #extracts speaker text
date = hxs.select('//*[#id="date_long"]') #gets the date
date = date.select('string()').extract()[0] #extracts the date
content = hxs.select('//*[#id="centerblock"]') #gets the content
content = content.select('string()').extract()[0] #extracts the content
texts = "%s\n\n%s\n\n%s" % (date, speaker, content) #puts everything together in a string
filename = ("/path/StateDailyBriefing-" + '%s' ".txt") % (date) #creates a file using the date
#opens the file defined above and writes 'texts' using utf-8
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
I think they problem lies in the formatting of the HTML of the page. On the pages that output the text incorrectly, the paragraphs are separated by <br> <p></p>, while on the pages that output correctly the paragraphs are contained within <p align="left" dir="ltr">. So, while I've identified this, I'm not sure how to make everything output consistently in the correct form.
The problem is that when you getting text() or string(), <br> tags are not converted to newline.
Workaround - replace <br> tags before doing XPath requests. Code:
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
And let me give some advice, if you know, that there is only one node, you can use text() instead string():
date = hxs.select('//*[#id="date_long"]/text()').extract()[0]
Try this xpath:
//*[#id="centerblock"]//text()

How to get plain text out of Wikipedia

I'd like to write a script that gets the Wikipedia description section only. That is, when I say
/wiki bla bla bla
it will go to the Wikipedia page for bla bla bla, get the following, and return it to the chatroom:
"Bla Bla Bla" is the name of a song
made by Gigi D'Agostino. He described
this song as "a piece I wrote thinking
of all the people who talk and talk
without saying anything". The
prominent but nonsensical vocal
samples are taken from UK band
Stretch's song "Why Did You Do It"
How can I do this?
Here are a few different possible approaches; use whichever works for you. All my code examples below use requests for HTTP requests to the API; you can install requests with pip install requests if you have Pip. They also all use the Mediawiki API, and two use the query endpoint; follow those links if you want documentation.
1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the extracts prop
Note that this approach only works on MediaWiki sites with the TextExtracts extension. This notably includes Wikipedia, but not some smaller Mediawiki sites like, say, http://www.wikia.com/
You want to hit a URL like
https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext
Breaking that down, we've got the following parameters in there (documented at https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):
action=query, format=json, and title=Bla_Bla_Bla are all standard MediaWiki API parameters
prop=extracts makes us use the TextExtracts extension
exintro limits the response to content before the first section heading
explaintext makes the extract in the response be plain text instead of HTML
Then parse the JSON response and extract the extract:
>>> import requests
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'extracts',
... 'exintro': True,
... 'explaintext': True,
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
2. Get the full HTML of the page using the parse endpoint, parse it, and extract the first paragraph
MediaWiki has a parse endpoint that you can hit with a URL like https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla to get the HTML of a page. You can then parse it with an HTML parser like lxml (install it first with pip install lxml) to extract the first paragraph.
For example:
>>> import requests
>>> from lxml import html
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'parse',
... 'page': 'Bla Bla Bla',
... 'format': 'json',
... }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
3. Parse wikitext yourself
You can use the query API to get the page's wikitext, parse it using mwparserfromhell (install it first using pip install mwparserfromhell), then reduce it down to human-readable text using strip_code. strip_code doesn't work perfectly at the time of writing (as shown clearly in the example below) but will hopefully improve.
>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'revisions',
... 'rvprop': 'content',
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.
Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.
Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23
References
External links
Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino
Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself.
For instance:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla
means
fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json).
You will probably want to search for the query and use the first result, to handle spelling errors and the like.
You can get wiki data in Text formats. If you need to access many title's informations, you can get all title's wiki data in a single call. Use pipe character ( | ) to separate each titles.
http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=
Here this api call return both Googles and Yahoos data.
explaintext => Return extracts as plain text instead of limited HTML.
exlimit = max (now its 20); Otherwise only one result will return.
exintro => Return only content before the first section. If you want full data, just remove this.
redirects= Resolve redirect issues.
You can fetch just the first section using the API:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvsection=0&titles=Bla%20Bla%20Bla&rvprop=content
This will give you raw wikitext, you'll have to deal with templates and markup.
Or you can fetch the whole page rendered into HTML which has its own pros and cons as far as parsing:
http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Bla_Bla_Bla
I can't see an easy way to get parsed HTML of the first section in a single call but you can do it with two calls by passing the wikitext you receive from the first URL back with text= in place of the page= in the second URL.
UPDATE
Sorry I neglected the "plain text" part of your question. Get the part of the article you want as HTML. It's much easier to strip HTML than to strip wikitext!
DBPedia is the perfect solution for this problem. Here: http://dbpedia.org/page/Metallica, look at the perfectly organised data using RDF. One can query for anything here at http://dbpedia.org/sparql using SPARQL, the query language for the RDF. There's always a way to find the pageID so as to get descriptive text but this should do for the most part.
There will be a learning curve for RDF and SPARQL for writing any useful code but this is the perfect solution.
For example, a query run for Metallica returns an HTML table with the abstract in several different languages:
<table class="sparql" border="1">
<tr>
<th>abstract</th>
</tr>
<tr>
<td><pre>"Metallica is an American heavy metal band formed..."#en</pre></td>
</tr>
<tr>
<td><pre>"Metallica es una banda de thrash metal estadounidense..."#es</pre></td>
...
SPARQL QUERY :
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbres: <http://dbpedia.org/resource/>
SELECT ?abstract WHERE {
dbres:Metallica dbpedia-owl:abstract ?abstract.
}
Change "Metallica" to any resource name (resource name as in wikipedia.org/resourcename) for queries pertaining to abstract.
Alternatively, you can try to load any of the text of wiki pages simply like this
https://bn.wikipedia.org/w/index.php?title=User:ShohagS&action=raw&ctype=text
where change bn to you your wiki language and User:ShohagS will be the page name. In your case use:
https://en.wikipedia.org/w/index.php?title=Bla_bla_bla&action=raw&ctype=text
in browsers, this will return a php formated text file.
You can use the wikipedia package of Python, and specifically the content attribute for the given page.
From the documentation:
>>> import wikipedia
>>> print wikipedia.summary("Wikipedia")
# Wikipedia (/ˌwɪkɨˈpiːdiə/ or /ˌwɪkiˈpiːdiə/ WIK-i-PEE-dee-ə) is a collaboratively edited, multilingual, free Internet encyclopedia supported by the non-profit Wikimedia Foundation...
>>> wikipedia.search("Barack")
# [u'Barak (given name)', u'Barack Obama', u'Barack (brandy)', u'Presidency of Barack Obama', u'Family of Barack Obama', u'First inauguration of Barack Obama', u'Barack Obama presidential campaign, 2008', u'Barack Obama, Sr.', u'Barack Obama citizenship conspiracy theories', u'Presidential transition of Barack Obama']
>>> ny = wikipedia.page("New York")
>>> ny.title
# u'New York'
>>> ny.url
# u'http://en.wikipedia.org/wiki/New_York'
>>> ny.content
# u'New York is a state in the Northeastern region of the United States. New York is the 27th-most exten'...
I think the better option is to use the extracts prop that provides you MediaWiki API. It returns you only some tags (b, i, h#, span, ul, li) and removes tables, infoboxes, references, etc.
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Bla%20Bla%20Bla&format=xml
gives you something very simple:
<api><query><pages><page pageid="4456737" ns="0" title="Bla Bla Bla"><extract xml:space="preserve">
<p>"<b>Bla Bla Bla</b>" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, <i>L'Amour Toujours</i>. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with <i>L'Amour Toujours (I'll Fly With You)</i> in its US radio version.</p> <p></p> <h2><span id="Background_and_writing">Background and writing</span></h2> <p>He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song <i>"Why Did You Do It"</i>.</p> <h2><span id="Music_video">Music video</span></h2> <p>The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.</p> <h2><span id="Chart_performance">Chart performance</span></h2> <h2><span id="References">References</span></h2> <h2><span id="External_links">External links</span></h2> <ul><li>Full lyrics of this song at MetroLyrics</li> </ul>
</extract></page></pages></query></api>
You can then run it through a regular expression, in JavaScript would be something like this (maybe you have to do some minor modifications:
/^.*<\s*extract[^>]*\s*>\s*((?:[^<]*|<\s*\/?\s*[^>hH][^>]*\s*>)*).*<\s*(?:h|H).*$/.exec(data)
Which gives you (only paragrphs, bold and italic):
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
"...a script that gets the Wikipedia description section only..."
For your application you might what to look on the dumps, e.g.: http://dumps.wikimedia.org/enwiki/20120702/
The particular files you need are 'abstract' XML files, e.g., this small one (22.7MB):
http://dumps.wikimedia.org/enwiki/20120702/enwiki-20120702-abstract19.xml
The XML has a tag called 'abstract' which contain the first part of each article.
Otherwise wikipedia2text uses, e.g., w3m to download the page with templates expanded and formatted to text. From that you might be able to pick out the abstract via a regular expression.
You can try WikiExtractor: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
It's for Python 2.7 and 3.3+.
First check here.
There are a lot of invalid syntaxes in MediaWiki's text markup.
(Mistakes made by users...)
Only MediaWiki can parse this hellish text.
But still there are some alternatives to try in the link above.
Not perfect, but better than nothing!
You can try the BeautifulSoup HTML parsing library for python,but you'll have to write a simple parser.
There is also the opportunity to consume Wikipedia pages through a wrapper API like JSONpedia, it works both live (ask for the current JSON representation of a Wiki page) and storage based (query multiple pages previously ingested in Elasticsearch and MongoDB).
The output JSON also include plain rendered page text.

Categories

Resources