python loop for web-scraping - python

I'am trying to create a loop to show all values within the li tags to create a DataFrame. Moreover, I can only isolate the code using: new = soup.find("div", class_="PlayerList"). If I use a standard for loop it only shows one value not all values.
The output I would like to show is:
Messi,
Shooting 9,
Passing 9,
Tackle 4,
<pre>
import requests
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
main_url = 'https://examplelistpython.000webhostapp.com/messi.html'
result = requests.get(main_url)
result.text
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
new = soup.find("div", class_="PlayerList")
new
</pre>
<ul class="List">
<li>
<div class="PlayerList">
<div class="HeaderList">
<span class="player">Messi</span>
</div>
<div class="PlayerStat">
<span class="stat">Shooting <span class="allStatContainer statShooting" data-stat="Shooting">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Passing <span class="allStatContainer statPassing" data-stat="Passing">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Tackle <span class="allStatContainer statTackle" data-stat="Tackle">
4
</span>
</span>
</div>
</li>
</ul>

player = [i.text.strip() for i in soup.find_all("span", class_="player")]
shooting = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statShooting")]
passing = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statPassing")]
tackle = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statTackle")]
df = pd.DataFrame({'Player': player, 'Shooting': shooting, 'Passing': passing, 'Tackle': tackle})
Result:
Player
Shooting
Passing
Tackle
0
Messi
9
9
4

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.
.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

GitHub get commits number using python and beautiful soup

I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?
Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)
from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23
You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)

beautiful soup misses text when scraping html

Here is a sample of my HTML. I need to parse all content for each record. This is one sample record.
<div class="list-group-item card-contact">
<div class="card-base">
<div class="card-name hot">ronald <b>loudia</b></div>
<div class="card-last-seen"><i>Inquired:</i> 8 day ago, Value $205,000 Engaged </div>
</div>
<a class="card-expand" href="#"></a>
<div class="card-mobile">
<ul class="card-activities" data-score="27">
<li class="card-update"><a class="btn" href="wrap.php?leaduuid=e888888e&aid=0">Update</a></li>
<li class="card-notes"><a class="btn" href="contact-notes.php?leaduuid=e888888e">Notes</a></li>
<li class="card-email">
<a class="btn" href="mailto:lou#ad.fr" title="leald#icloud.fr">
<div class="activity-label">Email:</div>
<div class="activity-value">lou#ad.fr</div>
</a>
</li>
<li class="card-sms"><a class="btn" href="sms:(222) 125-4444">SMS</a></li>
<li class="card-phone">
<a class="btn" href="tel:(222) 125-4444">
<div class="activity-label">Phone:</div>
<div class="activity-value">(222) 125-4444</div>
</a>
</li>
</ul>
</div>
</div>
I can't seem to get all the content between the tags.
Here is my code
from bs4 import BeautifulSoup
import re
import mechanize
import pandas as pd
br = mechanize.Browser()
br.open(url)
mylist = []
html_doc = br.response().read()
soup = BeautifulSoup(html_doc, 'html.parser')
mydivs = soup.find_all('div', ['card-name',
'card-last-seen',
'activity-value',
'card-activities',
'card-last-seen',
'activity-value',
'card-mobile'])
st = ''
for div in mydivs:
if re.search('^\([0-9][0-9][0-9]\)', div.text):
st += f'{div.text}\n'
else:
st += f'{div.text}, '
mylist.append(st)
#print(mylist)
smallerlist = [l.split(', ') for l in ', '.join(mylist).split('\n')]
smallerlist
df = pd.DataFrame(smallerlist)
I only get some of the content, missing data-score and other content.
Not sure how to get both div and ul content.
Html has many records that I need to loop through and write to pandas DataFrame.
The expected output in dataframe:
FirstName LastName LastLogin Value Score Email sms phone-activity-Value
ronald Lecloudia 8 day ago $205,000 Engaged 27 lou#ad.fr (222) 125-4444 (222) 125-4444
You can access the data-score attribute like this:
soup.ul.attrs['data-score']
Output:
'88888827'
More data:
import re
new_line_re = re.compile('\n{2,}')
new_line_re.sub('\n', soup.div.text).strip().split('\n')
Output:
['ronald Lecloudia',
'Inquired: 888888 day ago, Value $205,000 Engaged ',
'Update',
'Notes',
'Email:',
'leald#icloud.fr',
'SMS',
'Phone:',
'(232) 78888885-4444']

Get number from span which is inside span with beautifulsoup

So I have this piece from html
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
And I want to get that '5 300' out of it.
My code to get that:
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span'))
but it only prints out this:
<span></span>
I hope somebody can help
Edit: already tried to write .text to the end but it gives nothing ' '.
You almost got it, you just need to add .text to the last find function.
from bs4 import BeautifulSoup
html = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
item = BeautifulSoup(html, "lxml")
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span').text)
Outputs:
5 300
You can try this:
from bs4 import BeautifulSoup as soup
import re
s = """
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
"""
final_result = re.sub('^\s+|[a-zA-Z\s]+$', '', soup(s, 'lxml').find('span', {'class':'p'}).text)
Output:
u'5 300'
Here's one with select, which doesn't give you as many options but is quite readable
import bs4
s = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
soup = bs4.BeautifulSoup(s, 'xml')
soup.select_one("#_productX_label > span > span").text
Output: '5 300'
For your other issue of not being able to use the text property, perhaps the data is being filled out by a js function, or stored in an attribute?

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Categories

Resources