I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?
Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)
from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23
You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)
Related
I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.
.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on
I want to extract that specific "English" text under (li-label-span) tags. How should I do that with beautifulsoup? if anyone is here to help...can you write me some code for this specific problem?
<div class="biblio-info-wrap">
<h2 class="biblio-title">
Product details</h2>
<ul class="biblio-info">
<li>
<label>Publication date</label>
<span itemprop="datePublished">18 Feb 2021</span>
</li>
<li>
<label>Publication City/Country</label>
<span>
Edinburgh, United Kingdom</span>
</li>
***<li>
<label>Language</label>
<span>
English</span>
</li>***
<li>
<label>Edition Statement</label>
<span>Main</span>
</li>
<li>
<label>ISBN10</label>
<span>1786892731</span>
</li>
</ul>
</div>
If html_doc contains the HTML code from your question, you can do:
soup = BeautifulSoup(html_doc, "html.parser")
print(
soup.find("label", text="Language").find_next("span").get_text(strip=True)
)
Prints:
English
Or using CSS selectors:
print(
soup.select_one('label:-soup-contains("Language") + span').get_text(
strip=True
)
)
import BeautifulSoup
from bs4 import BeautifulSoup
and then
soup = BeautifulSoup(html_doc, "html.parser")
name_tag =soup.find("label", text="Language").find_next("span").get_text(strip=True)
print(name_tag)
I stuck in getting all data within span tag. My code gives me only every first value in every a() within the span tag and ignore other values. In my example: (NB I reduced the span contents here, but it lot of inside)
<span class="block-niveaux-sponsors">
<a href="http://www.keolis.com/" id="a47-logo-part-keolis" target="_blank">
<img src="images/visuels_footer/footer/part_keolis.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
<span class="block-niveaux-sponsors">
<a href="http://www.cg47.fr/" id="a47-logo-part-cg47" target="_blank">
<img src="images/visuels_footer/footer/part_cg47.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
<span class="block-niveaux-sponsors">
<a href="http://www.errea.it/fr/" id="a47-logo-part-errea" target="_blank">
<img src="images/visuels_footer/footer/part_errea.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
My code is:
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
title_box_agen = soup.find_all('div', attrs={'id':'autres'})
for tag in title_box_agen:
for each_row in tag.find_all('span'):
links = each_row.find('a', href=True)
title = links.get('id')
print(title)
This give me only the first id values in .
I want all id.
You should try:
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
title_box_agen = soup.find_all('div', attrs={'id':'autres'})
for tag in title_box_agen:
for each_row in tag.find_all('span'):
links = each_row.find_all('a', href=True)
for link in links:
title = link.get('id')
print(title)
You can get all the link ids for each of the niveux class like this.
(not tested)
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
spans_niveux = soup.find_all('span' class_='block-niveaux-sponsors')
for span in spans_niveux:
link = span.find('a', href=True)
id = link.id
print(id)
Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".
I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size
I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>