Beautifulsoup remove pages numbers at bottom

Beautifulsoup remove pages numbers at bottom - python

I'm trying to remove the page numbers from this html. It seems to follow the pattern '\n','number','\n' if you look at the list texts. Would I be able to do it with BeautifulSoup? If not, how do I remove that pattern from the list?
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
url='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
texts = soup.findAll(text=True)
### could remove ['\n','number','\n']
visible_texts = filter(tag_visible, texts)

You can try to extract tags containing page numbers from soup before getting text.
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
This extracts tags with page numbers that are in style:
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">57</p>
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">58</p>
... etc.

Related

find all a href from table

I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help
I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks

Just grab the main table and then extract all the <a> tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi

try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]

Scraping Issue. Why no data is being scraped?

import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p', class_="entry-content")
print(len(words))
for word in words:
print(word.text)
// Nothing is being displayed on my console and the length variable returns 0 which means nothing is being scraped.

If you see html there is div in which all p tags are present so you can take that div tag with associate class and then take p tag from it so you will get your output
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
main_div = soup.find('div', attrs={"class":"entry-content"})
words=main_div.find_all("p")
for word in words:
print(word.text)
Output:
testing
The slang used in Dominican Republic.
C – ce
*Caballo – person similar to a tigre but a little more decent
*Cabron – a large male goat,also means displeased
*Cacaito – candy
....

There were no p tags with entry-content? Remove that and it should work
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p')
print(len(words))
for word in words:
print(word.text)
But if you want content of p inside the div with class "entry-content" then
entryContent=soup.find('div',attrs={"class":"entry-content"})
words=entryContent.find_all("p")

you are passing additional parameter to words as entry-content, but there is no need to pass additional parameter.
words = soup.find_all('p', class_="entry-content")
try instead of that,
words = soup.find_all('p')
then it will get all the content with p and it gives you the length.
print(len(words))
i hope it will help to you..

How to extract text from 'a' element with BeautifulSoup?

I'm trying to get the text from a 'a' html element I got with beautifulsoup.
I am able to print the whole thing and what I want to find is right there:
-1
Tensei Shitara Slime Datta Ken Manga
-1
But when I want to be more specific and get the text from that it gives me this error:
File "C:\python\manga\manga.py", line 15, in <module>
print(title.text)
AttributeError: 'int' object has no attribute 'text'
Here is the code I'm running:
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title:
title = m_title.find('a')
print(title.text)
I've searched for my problem but I couldn't find something that helps.

Beautiful soup returns -1 as a value when it doesn't find something in a search
This isn't a very common way in python to show that no values exist but it is a common one for other languages.
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title.children:
title = m_title.find('a')
# Beautiful soup returns -1 as a value when it doesn't find something in a search
# This isn't a very pythonic way to show non existent values but it is a common one
if title != -1:
print(title.text)
Output
Tensei Shitara Slime Datta Ken Manga

Extracting data from HTML bulleted lists in Python

I have an html document with the following bulleted list:
Body=<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>
(Alternative View):
PreconditionsPC1PC2Use Case TriggersT1T2PostconditionsPO1PO2
I'm trying to write a function in Python that will disect this list and pull out groups of data. The goal is to put this data in a matrix that would look like the following:
[[Preconditions, PC1],[Preconditions, PC2],[Use Case Triggers, T1],[Use Case Triggers, T2],[Postconditions, PO1],[Postconditions,PO2]]
The other hurdle to cross is the fact that I need this sort of matrix to generate regardless of the number of ul and li elements.
Any guidance is appreciated!

You can write a function that takes raw html and deletes all html tags
def cleanhtml(raw_html):
cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
cleantext = re.sub(cleanr, " ", raw_html)
return cleantext
Some other cleanr options:
cleanr = re.compile("<[A-Za-z\/][^>]*>")
cleanr = re.compile("<[^>]*>")
cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")
But there is a better and easier way with Beautifulsoup.
from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
r = requests.get(url).text
soup = BeautifulSoup(r, "html.parser")
return soup.get_text()

a good library for parse html - beautifulsoup. code example:
html = "<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>"
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
uls = bs.findAll("ul")
for ul in uls:
print(ul.findAll("li"))

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you

To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup remove pages numbers at bottom - python

Related

find all a href from table

Scraping Issue. Why no data is being scraped?

How to extract text from 'a' element with BeautifulSoup?

Extracting data from HTML bulleted lists in Python

How to get just links of articles in list using BeautifulSoup

Categories

Resources