I am attempting to extract the URLs from within a HTML ordered list using the BeautifulSoup python module. My code returns a list of NONE values equal in number to the number of items from the ordered list so I know I'm in the right place in the document. What am I doing wrong?
The URL I am scraping from is http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV
Here are 5 of 50 lines from the HTML list (apologies for the length):
> `<div id="body" class="article-body">
<ol>
<li>WACO, TX, 3/18/13: Police responding to a domestic disturbance call found a man struggling to restrain his grandson, who was agitated and holding an AR-15. The cops shot grandpa. But that would totally never happen in a crowded theater.</li>
<li>GROSSE POINTE PARK, MI, 4/06/13: Grosse Pointe Park police arrested a 20-year-old Detroit man April 6 after he accidentally shot a 9mm handgun into the floor of a home in the 1000 block of Beaconsfield. The man was trying to make the gun safe when it discharged.</li>
<li>OTTAWA, KS, 4/13/13: No one was injured when a “negligent” rifle shot rang out Saturday night inside a residence in the 1600 block of South Cedar Street in Ottawa. Dylan Spencer, 22, Ottawa, was arrested by Ottawa police about 7 p.m. on suspicion of unlawfully discharging an AR-15 rifle in his apartment, according to a police report. The bullet exited his apartment, passed through both walls of an occupied apartment and lodged into a utility pole. But of course, Dylan didn't think the gun was loaded. So it's cool.</li>
<li>KLAMATH FALLS, OR, 4/13/13: An investigation into the shooting death of Lee Roy Myers, 47, has been ruled accidental. The Klamath County Major Crimes Team was called to investigate a shooting on Saturday, April 13. An autopsy concluded the cause of death was an accidental, self-inflicted handgun wound.</li>
<li>SOUTHAMPTON, NY, 4/13/13: The report states that the detective visited the home and interviewed the man, who legally owned the Ruger 10/22 rifle. The man said he was cleaning the rifle when it accidentally discharged into his big toe. When the rifle was pointed in a downward angle, inertia caused the firing pin to strike the primer, which caused the rifle to fire, according to the incident report. The detective advised the man on safety techniques while cleaning his rifle. (Step one: unload it.)</li>`
And here is my code:
page= urllib2.urlopen(url)
soup = BeautifulSoup(page)
li=soup.select("ol > li")
for link in li:
print (link.get('href'))
You're iterating over li elements which don't have href attribute. a tags inside them do:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
li = soup.select("ol > li > a")
for link in li:
print(link.get('href'))
Related
I have such a html page inside the content_list variable
<h3 class="sds-heading--7 title">Problems with battery capacity long-term</h3>
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
<div class="review-type"><strong>Owns this car</strong></div>
</div>
<div class="review-section">
<p class="review-body">We have owned our Leaf since May 2011. We have loved the car but are now getting quite concerned. My husband drives the car, on average, 20-40 miles/day to and from work and running errands, mostly 100% on city roads. We live in San Diego, so no issue with winter weather and we live 7 miles from the ocean so seldom have daytime temperatures above 85. Originally, we would get 65-70 miles per 80-90% charge. Last fall we noticed that there was considerably less remaining charge left after a day of driving. He began to track daily miles, remaining "bars", as well as started charging it 100%. For 9 months we have only been getting 40-45 miles on a full charge with only 1-2 "bars" remaining at the end of the day. Sometimes it will be blinking and "talking" to us to get to a charging place ASAP. We just had it into the dealership. Though on a full charge, the car gauge shows 12 bars, the dealership states that the batteries have lost 2 bars via the computer diagnostics (which we are told is a different reading from the car gauge itself) and, that they say, is average and excepted for the car at this age. Everything else (software, diagnostics, etc.) shows 100%, so the dealership thinks that the car is functioning as it should. They are unable to explain why we can only go 40-45 miles on a charge, but keep saying that the car tests out fine. If the distance one is able to drive on a full charge decreases any further, it will begin to render the car useless. As someone else recommended, in retrospect, the best way to go is to lease the Leaf so that battery life is not an issue.</p>
</div>
First I used this code to get to the collection of reviews
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
Now I would like to take the value of date of the review and the name of the reviewer which in this case would be
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
The problem is I can't separate those two divs
My code:
data = []
for e in content_list:
data.append({
'review_date':e.find_all("div", {"class":"review-byline"})[0].text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
The result of my code
{'overall_rating': '4.7',
'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
'review_date': '\nFebruary 24, 2020\nBy EVs are the future from Tucson, AZ\nOwns this car\n',
'review_title': 'Great Electric Car!'}
For the first one you could the <div> directly:
'review_date':e.find("div", {"class":"review-byline"}).div.text,
for the second one use e.g. css selector:
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
Example
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
data = []
for e in content_list:
data.append({
'review_date':e.find("div", {"class":"review-byline"}).div.text,
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
data
I am scraping a wiki page, but there are some empty <td> elements in some rows, therefore I used :
for tr in table1.tbody:
list = []
for td in tr:
try:
if(td.text is None): list.append('NA')
else: list.append(td.text.strip())
except:
list.append(td.strip())
to store those rows element in a list, but when I print row_list.
Those rows_list with empty <td> value, which should now have append 'NA' value, those are still empty, i.e, 'NA' have not appended in list.
How could I fix this?
Note Question needs improvment - While you update here just two option to fix
Option#1
Use pandas to get the tables in a quick and propper way:
import pandas as pd
pd.concat(pd.read_html('https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches')[2:11])
Option #2
Put the list outside before your loops, so you avoid overwriting and check your indentation:
data = []
for tr in table1.tbody:
for td in tr:
try:
if(td.text is None): data.append('NA')
else: data.append(td.text.strip())
except:
data.append(td.strip())
Few things here:
don't use list as a variable. It's a predefined method in python.
the td.text is not None. There is actually an string as content (Ie: ' ')
You are not iterating through the tr tags and td tags (or atleast in the code you are providing here). You need to create your list of tr tags, and td elements to use in your for loop.
Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table1 = soup.find_all('table')[2]
stored_list = []
for tr in table1.tbody.find_all('tr'):
for td in tr.find_all('td'):
if td.text.strip() == '':
stored_list.append('NA')
else:
stored_list.append(td.text.strip())
Output:
print(stored_list)
['4 June 2010,18:45', 'F9 v1.0[7]B0003[8]', 'CCAFS,SLC-40', 'Dragon Spacecraft Qualification Unit', 'NA', 'LEO', 'SpaceX', 'Success', 'Failure[9][10](parachute)', 'First flight of Falcon 9 v1.0.[11] Used a boilerplate version of Dragon capsule which was not designed to separate from the second stage.(more details below) Attempted to recover the first stage by parachuting it into the ocean, but it burned up on reentry, before the parachutes even go to deploy.[12]', '8 December 2010,15:43[13]', 'F9 v1.0[7]B0004[8]', 'CCAFS,SLC-40', 'Dragon demo flight C1(Dragon C101)', 'NA', 'LEO (ISS)', 'NASA (COTS)\nNRO', 'Success[9]', 'Failure[9][14](parachute)', "Maiden flight of SpaceX's Dragon capsule, consisting of over 3 hours of testing thruster maneuvering and then reentry.[15] Attempted to recover the first stage by parachuting it into the ocean, but it disintegrated upon reentry, again before the parachutes were deployed.[12] (more details below) It also included two CubeSats,[16] and a wheel of Brouère cheese. Before the launch, SpaceX discovered that there was a crack in the nozzle of the 2nd stage's Merlin vacuum engine. So Elon just had them cut off the end of the nozzle with a pair of shears and launched the rocket a few days later. After SpaceX had trimmed the nozzle, NASA was notified of the change and they agreed to it.[17]", '22 May 2012,07:44[18]', 'F9 v1.0[7]B0005[8]', 'CCAFS,SLC-40', 'Dragon demo flight C2+[19](Dragon C102)', '525\xa0kg (1,157\xa0lb)[20] (excl. Dragon mass)', 'LEO (ISS)', 'NASA (COTS)', 'Success[21]', 'No attempt', 'The Dragon spacecraft demonstrated a series of tests before it was allowed to approach the International Space Station. Two days later, it became the first commercial spacecraft to board the ISS.[18] (more details below)', '8 October 2012,00:35[22]', 'F9 v1.0[7]B0006[8]', 'CCAFS,SLC-40', 'SpaceX CRS-1[23](Dragon C103)', '4,700\xa0kg (10,400\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Orbcomm-OG2[24]', '172\xa0kg (379\xa0lb)[25]', 'LEO', 'Orbcomm', 'Partial failure[26]', "CRS-1 was successful, but the secondary payload was inserted into an abnormally low orbit and subsequently lost. This was due to one of the nine Merlin engines shutting down during the launch, and NASA declining a second reignition, as per ISS visiting vehicle safety rules, the primary payload owner is contractually allowed to decline a second reignition. NASA stated that this was because SpaceX could not guarantee a high enough likelihood of the second stage completing the second burn successfully which was required to avoid any risk of secondary payload's collision with the ISS.[27][28][29]", '1 March 2013,15:10', 'F9 v1.0[7]B0007[8]', 'CCAFS,SLC-40', 'SpaceX CRS-2[23](Dragon C104)', '4,877\xa0kg (10,752\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Last launch of the original Falcon 9 v1.0 launch vehicle, first use of the unpressurized trunk section of Dragon.[30]', '29 September 2013,16:00[31]', 'F9 v1.1[7]B1003[8]', 'VAFB,SLC-4E', 'CASSIOPE[23][32]', '500\xa0kg (1,100\xa0lb)', 'Polar orbit LEO', 'MDA', 'Success[31]', 'Uncontrolled(ocean)[d]', 'First commercial mission with a private customer, first launch from Vandenberg, and demonstration flight of Falcon 9 v1.1 with an improved 13-tonne to LEO capacity.[30] After separation from the second stage carrying Canadian commercial and scientific satellites, the first stage booster performed a controlled reentry,[33] and an ocean touchdown test for the first time. This provided good test data, even though the booster started rolling as it neared the ocean, leading to the shutdown of the central engine as the roll depleted it of fuel, resulting in a hard impact with the ocean.[31] This was the first known attempt of a rocket engine being lit to perform a supersonic retro propulsion, and allowed SpaceX to enter a public-private partnership with NASA and its Mars entry, descent, and landing technologies research projects.[34] (more details below)', '3 December 2013,22:41[35]', 'F9 v1.1B1004', 'CCAFS,SLC-40', 'SES-8[23][36][37]', '3,170\xa0kg (6,990\xa0lb)', 'GTO', 'SES', 'Success[38]', 'No attempt[39]', 'First Geostationary transfer orbit (GTO) launch for Falcon 9,[36] and first successful reignition of the second stage.[40] SES-8 was inserted into a Super-Synchronous Transfer Orbit of 79,341\xa0km (49,300\xa0mi) in apogee with an inclination of 20.55° to the equator.']
So I am working on a beautifulsoup scraper that would scrape 100 names from the ranker.com page list. The code is as follows
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ranker.com/crowdranked-list/best-anime-series-all-time')
soup = BeautifulSoup(r.text, 'html.parser')
for p in soup.find_all('a', class_='gridItem_name__3zasT gridItem_nameLink__3jE6V'):
print(p.text)
This works and gives the output as
Attack on Titan
My Hero Academia
Naruto: Shippuden
Hunter x Hunter (2011)
One-Punch Man
Fullmetal Alchemist: Brotherhood
One Piece
Naruto
Tokyo Ghoul
Assassination Classroom
The Seven Deadly Sins
Parasyte: The Maxim
Code Geass
Haikyuu!!
Your Lie in April
Noragami
Akame ga Kill!
Dragon Ball
No Game No Life
Fullmetal Alchemist
Dragon Ball Z
Cowboy Bebop
Steins;Gate
Mob Psycho 100
Fairy Tail
I wanted the program to fetch 100 items from the list, but it just gives 25 items. Can someone pls help me with this.
Additional items come from API call with offset and limit params to determine next batch of 25 results to return. You can simply remove both of these and get a max 200 results, or leave in limit and set to 100. You can ignore everything else in the API call apart from the endpoint.
import requests
r = requests.get('https://api.ranker.com/lists/538997/items?limit=100')
data = r.json()['listItems']
ranked_titles = {i['rank']:i['name'] for i in data}
print(ranked_titles)
a function which gives statements of commentary, the problem is they contain <br> and </br> tags, I want to arrange these in a new line
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if(match['mchstate'] != 'nextlive'):
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3[ "commentary2"] = my_str
commentary1.append(current_game3)
current_game3 = {}
print(commentary1)
when I print this I get output as below
{'commentary2': 'Preview by Tristan Lavalette<br/><br/>The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.<br/><br/>In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.<br/><br/>Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.<br/><br/>Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.<br/><br/>Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.<br/><br/>If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.<br/><br/>Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.<br/><br/>New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonisingly lost consecutive matches.<br/><br/>Despite their struggles, New Zealand know one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.<br/><br/>With all to play for, the stage is set for a memorably entertaining finish for this inaugural tri-series tournament.<br/><br/>When: Wednesday, February 21, 2018; 7PM local, 11.30AM IST<br/><br/>Where: Eden Park, Auckland<br/><br/>What to expect: There is a chance of showers intervening. Once again, there should be plenty of runs on offer on the small ground although the pitch is tipped to produce some turn.<br/><br/>Team News<br/><br/>New Zealand: Despite agonisingly losing their last couple of games, New Zealand are set to stick with the same line-up.<br/><br/>Probable XI: Martin Guptill, Colin Munro, Kane Williamson (c), Colin de Grandhomme, Mark Chapman, Ross Taylor, Tim Seifert (wk), Mitchell Santner, Tim Southee, Ish Sodhi, Trent Boult<br/><br/>Australia: Zampa could be in line to play with the pitch possibly providing some turn. However, a red hot Australia may not want to disturb a winning combination.<br/><br/>Probable XI: David Warner, D\'Arcy Short, Chris Lynn, Glenn Maxwell, Aaron Finch, Marcus Stoinis, Alex Carey (wk), Ashton Agar, Kane Richardson, Andrew Tye, Billy Stanlake<br/><br/>Did you know<br/><br/>- Australia\'s greatest winning streak in T20Is is their six straight victories at the 2010 World T20 before losing the final to England<br/><br/>- David Warner has won 8 of 9 as T20 captain. The best record overall - minimum 10 matches - is Pakistan\'s Sarfraz Ahmed\'s 14 wins from 17 matches<br/><br/>- New Zealand have lost their last four T20I matches at Eden Park<br/><br/>What they said<br/><br/>"We\'ve had three pretty close T20 games, Australia batting exceptionally well at Eden Park and chasing down a score that was pretty formidable. But you\'ve got to be in the final and give yourself a chance" - Mike Hesson, the New Zealand coach.<br/><br/>"You\'ve just got to find a way to get one or two wickets in the first six (overs), it\'s as simple as that" - David Warner, the Australia captain, said about bowling at the tiny Eden Park.'},
I want to arrange like this
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is set to finish with a bang at the tiny Eden Park on Wednesday (February 21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday and produced a run-fest with the rampaging Australia successfully chasing down a record target of 244. The unbeaten Australia head into the final as favourites after a dazzling campaign from their new look side brimming with in-form Big Bash League players and headed by skipper David Warner, whose inventive captaincy has been inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1 T20 ranking having started the tournament a lowly No.7. A victory would be their sixth straight in the format equalling their best ever streak.<br/><br/>Australia\'s hard-hitting batting has relished chasing in every match and New Zealand\'s brains trust might deeply consider bowling first if skipper Kane Williamson wins the toss. Packed with firepower, Australia ooze with match-winners and chased down the record target with relative ease, confirming their penchant to chase. At the comically miniature Eden Park ground, Australia\'s powerful batting will be confident no matter the situation of the match.
Of course, the beleaguered bowlers aren\'t quite as cheery after copping a flogging last start especially to New Zealand dynamo Martin Guptill. Much like their counterparts, the Black Caps boast a high-octane batting order that has been inconsistent throughout the tournament but, ominously, has the artillery to spearhead New Zealand to a triumph.
Australia\'s attack has been settled throughout the tri-series but selectors might be tempted to tweak it in a bid to ruffle the Black Caps. Legspinner Adam Zampa could be given a call-up on the wearing pitch - the same one used for Friday\'s encounter - which is set to be helpful for spin.
If Zampa gets the nod, Australia will be faced with a dilemma of culling one of their frontline quicks of Billy Stanlake, Kane Richardson and Andrew Tye, who have each starred at various stages during the tri-series. Australia\'s fresh team has matured quickly but the pressure will be intensified in an away final amid an electrifying atmosphere.
Even they though endured a rocky tournament yielding just one win, New Zealand squeaked past England to reach the decider but will need to lift their game if they are to cause an upset. The Black Caps have been unable to consistently recapture their best after coming into the tri-series ranked No. 2 in the world.
New Zealand\'s eclectic bowling has struggled although the spin combination of Mitchell Santner and Ish Sodhi could prove a handful on this deck. For such a composed and experienced team, New Zealand has looked occasionally rattled having agonizingly lost consecutive matches.
Despite their struggles, New Zealand knows one strong performance is enough for them to claim glory in front of their parochial home crowd desperate for some revelry.
Assuming you want to print each commentary dictionary in the commentary1 list, you want to replace the
print(commentary1)
line with
print("\n".join([" ".join(i.values()).replace("<br/><br/>", "\n") for i in commentary1]))
That will take all the dictionaries in the commentary1 list, then take all of their values, append them with a space, replace the <br/><br/> tags with \n, then join them.
Use this:
from pycricbuzz import Cricbuzz
c = Cricbuzz()
commentary1 = []
current_game3 = {}
matches = c.matches()
for match in matches:
if match['mchstate'] != 'nextlive':
col= (c.commentary(match['id']))
for my_str in col['commentary']:
current_game3["commentary2"] = my_str.replace('<br/>', '\n')
commentary1.append(current_game3)
current_game3 = {}
for comment in commentary1:
print(comment['commentary2'])
Partial Output:
Preview by Tristan Lavalette
The Twenty20 tri-series decider between Australia and New Zealand is
set to finish with a bang at the tiny Eden Park on Wednesday (February
21), as another bout of belligerent batting is expected in Auckland.
In a preview of the final, the teams clashed at Eden Park last Friday
and produced a run-fest with the rampaging Australia successfully
chasing down a record target of 244. The unbeaten Australia head into
the final as favourites after a dazzling campaign from their new look
side brimming with in-form Big Bash League players and headed by
skipper David Warner, whose inventive captaincy has been
inspirational.
Astoundingly, Australia is on the brink of leapfrogging into the No.1
T20 ranking having started the tournament a lowly No.7. A victory
would be their sixth straight in the format equalling their best ever
streak.
I wanted to get a paragraph from a site but ive done it this way.
i get the texts of the webpage removing all html tags and i wanted to find out if its possible ta get a certain paragraph form all the text it returned.
heres my code
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Aras_(river)")
txt = response.content
soup = BeautifulSoup(txt,'lxml')
filtered = soup.get_text()
print(filtered)
heres part of the text it printed out
>>>>Basin
Main source
Erzurum Province, Turkey
River mouth
Kura river
Physical characteristics
Length
1,072 km (666 mi)
The Aras or Araxes is a river in and along the countries of Turkey,
Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser
Caucasus Mountains and then joins the Kura River which drains the north
side of those mountains. Its total length is 1,072 kilometres (666 mi).
Given its length and a basin that covers an area of 102,000 square
kilometres (39,000 sq mi), it is one of the largest rivers of the
Caucasus.
Contents
1 Names
2 Description
3 Etymology and history
4 Iğdır Aras Valley Bird Paradise
5 Gallery
6 See also
7 Footnotes
And i only want to get this paragraph
The Aras or Araxes is a river in and along the countries of Turkey,
Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser
Caucasus Mountains and then joins the Kura River which drains the north
side of those mountains. Its total length is 1,072 kilometres (666 mi).
Given its length and a basin that covers an area of 102,000 square
kilometres (39,000 sq mi), it is one of the largest rivers of the
Caucasus.
is it possible to filter out this paragraph?
soup = BeautifulSoup(txt,'lxml')
filtered = soup.p.get_text() # get the first p tag.
print(filtered)
out:
The Aras or Araxes is a river in and along the countries of Turkey, Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser Caucasus Mountains and then joins the Kura River which drains the north side of those mountains. Its total length is 1,072 kilometres (666 mi). Given its length and a basin that covers an area of 102,000 square kilometres (39,000 sq mi), it is one of the largest rivers of the Caucasus.
Use XPath instead! It is much easier, more accurate, and it has designed specifically for these use cases. Unfortunately BeautifulSoup does not support XPath directly. You need to use lxml package instead
import urllib2
from lxml import etree
response = urllib2.urlopen("https://en.wikipedia.org/wiki/Aras_(river)")
parser = etree.HTMLParser()
tree = etree.parse(response, parser)
tree.xpath('string(//*[#id="mw-content-text"]/p[1])')
Explanation on XPath:
// refers to the root element in the document.
* matches any tag
[#id="mw-content-text"] specify a condition.
p[1] selects first element of type p inside the container.
string function that gives you the string representation of element(s)
By the way, If you use Google Chrome or Firefox you can test the XPath expression inside DevTools using $x function:
$x('string(//*[#id="mw-content-text"]/p[1])')