Using BeautifulSoup to collect urls from html code

Using BeautifulSoup to collect urls from html code - python

i have collected a list of links from a folder of documents that essentially is wikipedia pages. I eventually realized that my list of links is incomplete, because my code only collects a few of the links from each wikipedia page. My goal is to get all links and then filter it afterwards. I should end up with a list of links to train related accidents. Keywords for such accidents in the links varies between disaster, tragedy etc. i dont know them beforehand.
My input is
list_of_urls = []
for file in files:
text = open('files_overview/'+file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):
url = item.find('a', attrs={'class':'href'=="accident"}):
#If i dont add something, like "accident" it gives me a syntax error..
urls= url.get("href")
urls1="https://en.wikipedia.org"+urls
list_of_urls.append(urls1)
HTML code from one of my documents, wherein multiple links lies are given below:
</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (learn more).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li>Atherstone rail accident</li></ul><h3>B</h3>
<ul><li>Bull bridge accident</li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category">Helmshore rail accident</span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div></div>
<div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks"
From the above, i manage to get Atherstone_rail_accident, but not bull_bridge nor helmshore.
Does anyone have a better approach?
Thank you for your time

What happens?
You just iterate over one element from result set of soup.findAll("div", attrs={'class':'mw-content-ltr'}), thats why you only get the first link.
Example
list_of_urls = []
for file in files:
text = open('files_overview/'+file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
How to fix?
Instead of selecting the <div> select all the links in your <div> and iterate over it:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
Output
['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
'https://en.wikipedia.org/wiki/Bull_bridge_accident',
'https://en.wikipedia.org/wiki/Helmshore_rail_accident']
EDIT
Adding the prefix https://en.wikipedia.org later in the process just skip this task while appending the href to your list:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(a["href"])
If you like to request the urls in a second step you can do it like this:
for url in list_of_urls:
response = requests.get(f'https://en.wikipedia.org{url}')
Or if just need a list with full urls you append it with list comprehension:
list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]

You can do like this.
First find all the <div> with class name as mw-content-ltr using .find_all()
For each <div> obtained above, find all the <a> tags using .find_all(). This will give you a list of <a> for each <div>.
Iterate over and extract the href from the above list of <a> tags.
Here is the code.
from bs4 import BeautifulSoup
s = """
<div class="mw-category-generated" lang="en" dir="ltr">
<div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (learn more).</p>
<div lang="en" dir="ltr" class="mw-content-ltr">
<h3>A</h3>
<ul>
<li>Atherstone rail accident</li>
</ul>
<h3>B</h3>
<ul>
<li>Bull bridge accident</li>
</ul>
<h3>H</h3>
<ul>
<li><span class="redirect-in-category">Helmshore rail accident</span></li>
</ul>
</div>
</div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div>
</div>
<div id="catlinks" class="catlinks" data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')
divs = soup.find_all('div', class_='mw-content-ltr')
for div in divs:
for a in div.find_all('a'):
print(a['href'])
/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything

Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:

Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?

I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>

To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

How to remove surplus tags from beautiful soup result

I want to get only the content in <p> tag and remove the surplus div tags.
My code is:
page = """
<p style="text-align: justify">content that I want
<div ><!-- /316485075/agk_116000_pos_3_sidebar_mobile -->
<div id="agk_116000_pos_3_sidebar_mobile">
<script>
script code
</script>
</div>
<div class="nopadding clearfix hidden-print">
<div align="center" class="col-md-12">
<!-- /316485075/agk_116000_pos_4_conteudo_desktop -->
<div id="agk_116000_pos_4_conteudo_desktop" style="height:90px; width:728px;">
<script>
script code
</script>
</div>
</div>
</div>
</div>
</p>
"""
soup = BeautifulSoup(page, 'html.parser')
p = soup.find_all('p', {'style' : 'text-align: justify'})
And I just want to get the string <p>content that I want</p> and remove all the divs

You can use the replace_with() function to remove the tags along with its contents.
soup = BeautifulSoup(html, 'html.parser') # html is HTML you've provided in question
soup.find('div').replace_with('')
print(soup)
Output:
<p style="text-align: justify">content that I want
</p>
Note: I'm using soup.find('div') here as all the unwanted tags are inside the first div tag. Hence, if you remove that tag, all others will get removed. But, if you want to remove all the tags other than the p tags in a HTML where the format is not like this, you'll have to use this:
for tag in soup.find_all():
if tag.name == 'p':
continue
tag.replace_with('')
Which is equivalent to:
[tag.replace_with('') for tag in soup.find_all(lambda t: t.name != 'p')]
If you simply want the content that I want text, you can use this:
print(soup.find('p').contents[0])
# content that I want

Capture group 2 contains your content <(.*?)(?:\s.+?>)(.*?)</\1[>]?
See https://regex101.com/r/m8DQic/1

Exclude hidden tags while scraping using b4

I have a website that has plenty of hidden tags in the html.
I have pasted the source code below.
The challenge is that there are 2 types on hidden tags,
1. Ones with style="display:none"
2. They have list of styles mentioned under every td tag.
And it changes with every td tag.
for the example below it has the following styles,
hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
So the elements with class=hLcj, kUC, mXJU, rr9s,etc are hidden elements
I want to extract the text of entire tr but exclude these hidden tags.
I have been scratching my head for hours and still no success.
Any help would be much appreciated. Thanks
I am using bs4 and python 2.7
<td class="leftborder timestamp" rel="1416853322">
<td>
<span>
<style>
.hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
</style>
<span class="rr9s">35</span>
<span></span>
<div style="display:none">121</div>
<span class="226">199</span>
.
<span class="rr9s">116</span>
<div style="display:none">116</div>
<span></span>
<span class="Dzkb">200</span>
<span style="display: inline">.</span>
<span style="display:none">86</span>
<span class="kUC-">86</span>
<span></span>
120
<span class="kUC-">134</span>
<div style="display:none">134</div>
<span class="mXJU">151</span>
<div style="display:none">151</div>
<span class="rr9s">154</span>
<span class="Dzkb">.</span>
<span class="119">36</span>
<span class="kUC-">157</span>
<div style="display:none">157</div>
<span class="rr9s">249</span>
<div style="display:none">249</div>
</span>
</td>
<td> 7808</td>

Using selenium would make the task much easier since it knows what elements are hidden and which aren't.
But, anyway, here's a basic code that you would probably need to improve more. The idea here is to parse the style tag and get the list of classes to exclude, have a list of tags to exclude and check the style attribute of each child element in tr:
import re
from bs4 import BeautifulSoup
data = """ your html here """
soup = BeautifulSoup(data)
tr = soup.tr
# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
match = re.match(r'^\.(.*?)\{display:none\}', line)
if match:
classes_to_exclude.append(match.group(1))
tags_to_exclude = ['style', 'script']
texts = []
for item in tr.find_all(text=True):
if item.parent.name in tags_to_exclude:
continue
class_ = item.parent.get('class')
if class_ and class_[0] in classes_to_exclude:
continue
if item.parent.get('style') == 'display:none':
continue
texts.append(item)
print ''.join(texts.strip())
Prints:
199.200.120.36
Also see:
BeautifulSoup Grab Visible Webpage Text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using BeautifulSoup to collect urls from html code - python

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

Missing parts in Beautiful Soup results

Delete block in HTML based on text

How to remove surplus tags from beautiful soup result

Exclude hidden tags while scraping using b4

Categories

Resources