python loop for web-scraping

python loop for web-scraping - python

I'am trying to create a loop to show all values within the li tags to create a DataFrame. Moreover, I can only isolate the code using: new = soup.find("div", class_="PlayerList"). If I use a standard for loop it only shows one value not all values.
The output I would like to show is:
Messi,
Shooting 9,
Passing 9,
Tackle 4,
<pre>
import requests
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
main_url = 'https://examplelistpython.000webhostapp.com/messi.html'
result = requests.get(main_url)
result.text
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
new = soup.find("div", class_="PlayerList")
new
</pre>
<ul class="List">
<li>
<div class="PlayerList">
<div class="HeaderList">
<span class="player">Messi</span>
</div>
<div class="PlayerStat">
<span class="stat">Shooting <span class="allStatContainer statShooting" data-stat="Shooting">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Passing <span class="allStatContainer statPassing" data-stat="Passing">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Tackle <span class="allStatContainer statTackle" data-stat="Tackle">
4
</span>
</span>
</div>
</li>
</ul>

player = [i.text.strip() for i in soup.find_all("span", class_="player")]
shooting = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statShooting")]
passing = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statPassing")]
tackle = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statTackle")]
df = pd.DataFrame({'Player': player, 'Shooting': shooting, 'Passing': passing, 'Tackle': tackle})
Result:
Player
Shooting
Passing
Tackle
0
Messi
9
9
4

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.

.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

GitHub get commits number using python and beautiful soup

I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?

Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)

from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23

You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)

beautiful soup misses text when scraping html

Here is a sample of my HTML. I need to parse all content for each record. This is one sample record.
<div class="list-group-item card-contact">
<div class="card-base">
<div class="card-name hot">ronald <b>loudia</b></div>
<div class="card-last-seen"><i>Inquired:</i> 8 day ago, Value $205,000 Engaged </div>
</div>
<a class="card-expand" href="#"></a>
<div class="card-mobile">
<ul class="card-activities" data-score="27">
<li class="card-update"><a class="btn" href="wrap.php?leaduuid=e888888e&aid=0">Update</a></li>
<li class="card-notes"><a class="btn" href="contact-notes.php?leaduuid=e888888e">Notes</a></li>
<li class="card-email">
<a class="btn" href="mailto:lou#ad.fr" title="leald#icloud.fr">
<div class="activity-label">Email:</div>
<div class="activity-value">lou#ad.fr</div>
</a>
</li>
<li class="card-sms"><a class="btn" href="sms:(222) 125-4444">SMS</a></li>
<li class="card-phone">
<a class="btn" href="tel:(222) 125-4444">
<div class="activity-label">Phone:</div>
<div class="activity-value">(222) 125-4444</div>
</a>
</li>
</ul>
</div>
</div>
I can't seem to get all the content between the tags.
Here is my code
from bs4 import BeautifulSoup
import re
import mechanize
import pandas as pd
br = mechanize.Browser()
br.open(url)
mylist = []
html_doc = br.response().read()
soup = BeautifulSoup(html_doc, 'html.parser')
mydivs = soup.find_all('div', ['card-name',
'card-last-seen',
'activity-value',
'card-activities',
'card-last-seen',
'activity-value',
'card-mobile'])
st = ''
for div in mydivs:
if re.search('^\([0-9][0-9][0-9]\)', div.text):
st += f'{div.text}\n'
else:
st += f'{div.text}, '
mylist.append(st)
#print(mylist)
smallerlist = [l.split(', ') for l in ', '.join(mylist).split('\n')]
smallerlist
df = pd.DataFrame(smallerlist)
I only get some of the content, missing data-score and other content.
Not sure how to get both div and ul content.
Html has many records that I need to loop through and write to pandas DataFrame.
The expected output in dataframe:
FirstName LastName LastLogin Value Score Email sms phone-activity-Value
ronald Lecloudia 8 day ago $205,000 Engaged 27 lou#ad.fr (222) 125-4444 (222) 125-4444

You can access the data-score attribute like this:
soup.ul.attrs['data-score']
Output:
'88888827'
More data:
import re
new_line_re = re.compile('\n{2,}')
new_line_re.sub('\n', soup.div.text).strip().split('\n')
Output:
['ronald Lecloudia',
'Inquired: 888888 day ago, Value $205,000 Engaged ',
'Update',
'Notes',
'Email:',
'leald#icloud.fr',
'SMS',
'Phone:',
'(232) 78888885-4444']

Get number from span which is inside span with beautifulsoup

So I have this piece from html
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
And I want to get that '5 300' out of it.
My code to get that:
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span'))
but it only prints out this:
<span></span>
I hope somebody can help
Edit: already tried to write .text to the end but it gives nothing ' '.

You almost got it, you just need to add .text to the last find function.
from bs4 import BeautifulSoup
html = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
item = BeautifulSoup(html, "lxml")
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span').text)
Outputs:
5 300

You can try this:
from bs4 import BeautifulSoup as soup
import re
s = """
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
"""
final_result = re.sub('^\s+|[a-zA-Z\s]+$', '', soup(s, 'lxml').find('span', {'class':'p'}).text)
Output:
u'5 300'

Here's one with select, which doesn't give you as many options but is quite readable
import bs4
s = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
soup = bs4.BeautifulSoup(s, 'xml')
soup.select_one("#_productX_label > span > span").text
Output: '5 300'
For your other issue of not being able to use the text property, perhaps the data is being filled out by a js function, or stored in an attribute?

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?

Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation

How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python loop for web-scraping - python

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

GitHub get commits number using python and beautiful soup

beautiful soup misses text when scraping html

Get number from span which is inside span with beautifulsoup

BeautifulSoup: find all tags before stopping condition is met

Categories

Resources