Python 3 - Extract content between <td></td> [duplicate] - python

This question already has an answer here:
How to get inner text value of an HTML tag with BeautifulSoup bs4?
(1 answer)
Closed 5 years ago.
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('#'))
for line in emails:
print(line)
I have the script above that works perfect in Python 2.7 with Beautifulsoup for extracting content between several in a HTML-file. When I run the same script in Python 3.6.4, however, I get the following results:
<td>xxx#xxx.com</td>
<td>xxx#xxx.com</td>
I want the content between without the TD stuff...
Why is this happening in Python 3?

I found the answer...
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('#'))
for td in emails:
print(td.get_text())
Look close at the two last lines :)

Related

Web scrape links with Python, then turn them into a string [duplicate]

This question already has answers here:
Print string to text file
(8 answers)
Closed 3 months ago.
With Python I'm having issues turning web scrapped links into strings so I can save them as either a txt or csv file. I would really like them as a txt file. This is what I have at the moment.
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
type(link)
print(link, file=open('example.txt','w'))
I've tried all sort of things with no luck. I'm pretty much at a lose.
print(link, file=open('example.txt','w'))
Will write the link variable, but that's only the last one.
To write them all, use:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
with open("example.txt", "w") as file:
for link in soup.find_all('a'):
file.write(link.get('href') + '\n')
Which uses a context manager to open the file, then write each href with a newline.

Extract text with Beautiful Soup [duplicate]

This question already has answers here:
BS4 Beautiful Soup extract text from find_all
(2 answers)
Closed 2 years ago.
I'm trying to learn how to use beautiful soup
using this website as a very simple example.
https://www.espncricinfo.com/ci/content/ground/56490.html#Profile
Lets say I want to extract the capacity of the ground. I have so far written the following code which gives me the field names, but I can't seem to understand how to get the actual value of 18,000
Can anyone help?
url="https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup.findAll('label')
Perhaps something like
from bs4 import BeautifulSoup
import requests
url = "https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stats = soup.find('div', {'id': 'stats'})
for e in stats.findAll('label'):
print(f"{e.text}: {e.nextSibling}")
demo

How can I print web scraped text onto a single line?

I'm trying to scrape tracking information from a shipper website using beautifulsoup. However, the format of the html is not conducive to what I'm trying to do. There is unnecessary spacing included in the source code text which is cluttering up my output. Ideally I'd just like to grab the date here but I'll take "Shipped" and the date at this point as long as it's on the same line.
I've tried using .replace(" ","") & .strip() with no success.
Python Script:
from bs4 import BeautifulSoup
import requests
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
ShipDate = soup.find('p', class_="Track-meter-itemLabel text--center").text
print(ShipDate)
HTML Source Code:
<p class="Track-meter-itemLabel text--center">
<strong class="text--bold">
Shipped
</strong>
5/23/2019
</p>
This is what's being returned. Additional spaces and blank lines.
Shipped
5/23/2019
Try:
trac = [your html code above]
soup = BeautifulSoup(trac, "lxml")
soup.text.replace(' ','').replace('\n',' ').strip()
Output:
'Shipped 5/23/2019'
You are looking for the stripped_strings generator which is already built into BeautifulSoup but it's not common knowledge.
### Your code
for ShipDate in soup.find('p', class_="Track-meter-itemLabel text--center").stripped_strings:
print(ShipDate)
Output:
Shipped
5/23/2019
Use regex
from bs4 import BeautifulSoup
import requests
import re
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
print(' '.join(re.sub(r'\s+',' ', soup.select_one('.Track-meter-itemLabel').text.strip()).split('\n')))

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

beautiful soup and requests not getting full page [duplicate]

This question already has an answer here:
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
(1 answer)
Closed 8 years ago.
My code looks like this.
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data)
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
what I am getting is the property name till IDYLLIC SUITES,after that I get error "list index out of range".What is the problem?
I am not sure what is exactly bothering you. Because when I tried your code (as it is) it worked for me.
Still, try changing the parser, may be to html5lib
So do,
pip install html5lib
And then change your code to,
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data,'html5lib') # Change of Parser
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
Let me know if it helps

Categories

Resources