How can I print web scraped text onto a single line? - python

I'm trying to scrape tracking information from a shipper website using beautifulsoup. However, the format of the html is not conducive to what I'm trying to do. There is unnecessary spacing included in the source code text which is cluttering up my output. Ideally I'd just like to grab the date here but I'll take "Shipped" and the date at this point as long as it's on the same line.
I've tried using .replace(" ","") & .strip() with no success.
Python Script:
from bs4 import BeautifulSoup
import requests
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
ShipDate = soup.find('p', class_="Track-meter-itemLabel text--center").text
print(ShipDate)
HTML Source Code:
<p class="Track-meter-itemLabel text--center">
<strong class="text--bold">
Shipped
</strong>
5/23/2019
</p>
This is what's being returned. Additional spaces and blank lines.
Shipped
5/23/2019

Try:
trac = [your html code above]
soup = BeautifulSoup(trac, "lxml")
soup.text.replace(' ','').replace('\n',' ').strip()
Output:
'Shipped 5/23/2019'

You are looking for the stripped_strings generator which is already built into BeautifulSoup but it's not common knowledge.
### Your code
for ShipDate in soup.find('p', class_="Track-meter-itemLabel text--center").stripped_strings:
print(ShipDate)
Output:
Shipped
5/23/2019

Use regex
from bs4 import BeautifulSoup
import requests
import re
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
print(' '.join(re.sub(r'\s+',' ', soup.select_one('.Track-meter-itemLabel').text.strip()).split('\n')))

Related

BeautifulSoup - can't find attribute

I'm trying to scrape this link.
I want to get to this part here:
I can see where this part of the website is when I inspect the page:
But I can't get to it from BeautifulSoup.
Here is the code that I'm using and all the ways I've tried to access it:
from bs4 import BeautifulSoup
import requests
link = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
html_text = requests.get(link).text
soup = BeautifulSoup(html_text, 'html.parser')
soup.find_all(class_='data_grid')
soup.find_all(string="data_grid")
soup.find_all(attrs={"class": "data_grid"})
Also, when I just look at the html I can see that it is there:
You need to look at the actual source html code that you get in response (not the html you inspect, which you have shown to have done), you'll notice those tables are within the comments of the html Ie. <!-- and -->. BeautifulSoup ignores comments.
There are a few ways to go about it. BeautifulSoup does have a method to search and pull out comments, however with this particular site, I find it just easier to remove the comment tags.
Once you do that, you can easily parse the html with BeautifulSoup to get the desired <div> tag, then just let pandas parse the <table> tag within there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
response = requests.get(url)
html = response.text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
leaderboard_pts = soup.find('div', {'id':'leaderboard_pts'})
df = pd.read_html(str(leaderboard_pts))[0]
Output:
print(df)
0
0 2017-18 OVC 405 (18th)
1 2018-19 NCAA 808 (9th)
2 2018-19 OVC 808 (1st)
if you re looking for the point section i suggest to search with id like this:
point_section=soup.find("div",{"id":"leaderboard_pts"})

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

How to extract value from span tag

I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.
The site is dynamic, thus, you need to use selenium:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium documentation here.
An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.
You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)

BeautifulSoup scraping Bitcoin Price issue

I am still new to python, and especially BeautifulSoup. I've been reading up on this stuff for a few days and playing around with bunch of different codes and getting mix results. However, on this page is the Bitcoin Price I would like to scrape. The price is located in:
<span class="text-large2" data-currency-value="">$16,569.40</span>
Meaning that, I'd like to have my script print only that line where the value is. My current code prints the whole page and it doesn't look very nice, since it's printing a lot of data. Could anybody please help to improve my code?
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find('text-large2', attrs={'class': 'stripe'})
for row in soup.findAll('div'):
for cell in row.findAll('tr'):
print cell.text
And this is a snip of the output I get after running the code. It doesn't look very nice or readable.
#SourcePairVolume (24h)PriceVolume (%)Updated
1BitMEXBTC/USD$3,280,130,000$15930.0016.30%Recently
2BithumbBTC/KRW$2,200,380,000$17477.6010.94%Recently
3BitfinexBTC/USD$1,893,760,000$15677.009.41%Recently
4GDAXBTC/USD$1,057,230,000$16085.005.25%Recently
5bitFlyerBTC/JPY$636,896,000$17184.403.17%Recently
6CoinoneBTC/KRW$554,063,000$17803.502.75%Recently
7BitstampBTC/USD$385,450,000$15400.101.92%Recently
8GeminiBTC/USD$345,746,000$16151.001.72%Recently
9HitBTCBCH/BTC$305,554,000$15601.901.52%Recently
Try this:
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find("div", {"class" : "col-xs-6 col-sm-8 col-md-4 text-left"
}).find("span", {"class" : "text-large2"})
for i in div:
print i
This prints 16051.20 for me.
Later Edit: and if you put the above code in a function and loop it it will constantly update. I get different values now.
This works. But I think you use older version of BeautifulSoup, try pip install bs4 in command prompt or PowerShell
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
value = soup.find('span', {'class': 'text-large2'})
print(''.join(value.stripped_strings))

Python trying to get paragraph from weather website

I'm pretty new to python 2.7 but I am trying to get a simple paragraph from a website but python outputs []. I've managed to extract numbers but not text.
Any help would be great, thanks.
import urllib
import re
HTML_File = urllib.urlopen("http://uk.weather.com/weather/10day/New+Romney+KEN+United+Kingdom+UKXX1121:1:UK")
HTML_Text = HTML_File.read()
LastUpdate_Pattern = re.compile('<div class="wx-24hour-title"> <h2>New Romney 10-Day Forecast</h2> <p class="wx-timestamp"> (.*?) </p>')
LastUpdate = re.findall(LastUpdate_Pattern, HTML_Text)
print LastUpdate
Use BeautifulSoup
import urllib
from bs4 import BeautifulSoup
HTML_File = urllib.urlopen("http://uk.weather.com/weather/10day/New+Romney+KEN+United+Kingdom+UKXX1121:1:UK")
HTML_Text = HTML_File.read()
soup = BeautifulSoup(HTML_Text, 'html.parser')
print soup.select('.wx-timestamp')[0].text
Output:
Updated:
last updated about 20 minutes ago

Categories

Resources