I'm trying to remove the page numbers from this html. It seems to follow the pattern '\n','number','\n' if you look at the list texts. Would I be able to do it with BeautifulSoup? If not, how do I remove that pattern from the list?
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
url='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
texts = soup.findAll(text=True)
### could remove ['\n','number','\n']
visible_texts = filter(tag_visible, texts)
You can try to extract tags containing page numbers from soup before getting text.
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
This extracts tags with page numbers that are in style:
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">57</p>
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">58</p>
... etc.
Related
I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help
I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks
Just grab the main table and then extract all the <a> tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi
try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p', class_="entry-content")
print(len(words))
for word in words:
print(word.text)
// Nothing is being displayed on my console and the length variable returns 0 which means nothing is being scraped.
If you see html there is div in which all p tags are present so you can take that div tag with associate class and then take p tag from it so you will get your output
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
main_div = soup.find('div', attrs={"class":"entry-content"})
words=main_div.find_all("p")
for word in words:
print(word.text)
Output:
testing
The slang used in Dominican Republic.
C – ce
*Caballo – person similar to a tigre but a little more decent
*Cabron – a large male goat,also means displeased
*Cacaito – candy
....
There were no p tags with entry-content? Remove that and it should work
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p')
print(len(words))
for word in words:
print(word.text)
But if you want content of p inside the div with class "entry-content" then
entryContent=soup.find('div',attrs={"class":"entry-content"})
words=entryContent.find_all("p")
you are passing additional parameter to words as entry-content, but there is no need to pass additional parameter.
words = soup.find_all('p', class_="entry-content")
try instead of that,
words = soup.find_all('p')
then it will get all the content with p and it gives you the length.
print(len(words))
i hope it will help to you..
I'm trying to get the text from a 'a' html element I got with beautifulsoup.
I am able to print the whole thing and what I want to find is right there:
-1
Tensei Shitara Slime Datta Ken Manga
-1
But when I want to be more specific and get the text from that it gives me this error:
File "C:\python\manga\manga.py", line 15, in <module>
print(title.text)
AttributeError: 'int' object has no attribute 'text'
Here is the code I'm running:
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title:
title = m_title.find('a')
print(title.text)
I've searched for my problem but I couldn't find something that helps.
Beautiful soup returns -1 as a value when it doesn't find something in a search
This isn't a very common way in python to show that no values exist but it is a common one for other languages.
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title.children:
title = m_title.find('a')
# Beautiful soup returns -1 as a value when it doesn't find something in a search
# This isn't a very pythonic way to show non existent values but it is a common one
if title != -1:
print(title.text)
Output
Tensei Shitara Slime Datta Ken Manga
I have an html document with the following bulleted list:
Body=<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>
(Alternative View):
PreconditionsPC1PC2Use Case TriggersT1T2PostconditionsPO1PO2
I'm trying to write a function in Python that will disect this list and pull out groups of data. The goal is to put this data in a matrix that would look like the following:
[[Preconditions, PC1],[Preconditions, PC2],[Use Case Triggers, T1],[Use Case Triggers, T2],[Postconditions, PO1],[Postconditions,PO2]]
The other hurdle to cross is the fact that I need this sort of matrix to generate regardless of the number of ul and li elements.
Any guidance is appreciated!
You can write a function that takes raw html and deletes all html tags
def cleanhtml(raw_html):
cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
cleantext = re.sub(cleanr, " ", raw_html)
return cleantext
Some other cleanr options:
cleanr = re.compile("<[A-Za-z\/][^>]*>")
cleanr = re.compile("<[^>]*>")
cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")
But there is a better and easier way with Beautifulsoup.
from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
r = requests.get(url).text
soup = BeautifulSoup(r, "html.parser")
return soup.get_text()
a good library for parse html - beautifulsoup. code example:
html = "<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>"
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
uls = bs.findAll("ul")
for ul in uls:
print(ul.findAll("li"))
Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html