Python trying to get paragraph from weather website - python

I'm pretty new to python 2.7 but I am trying to get a simple paragraph from a website but python outputs []. I've managed to extract numbers but not text.
Any help would be great, thanks.
import urllib
import re
HTML_File = urllib.urlopen("http://uk.weather.com/weather/10day/New+Romney+KEN+United+Kingdom+UKXX1121:1:UK")
HTML_Text = HTML_File.read()
LastUpdate_Pattern = re.compile('<div class="wx-24hour-title"> <h2>New Romney 10-Day Forecast</h2> <p class="wx-timestamp"> (.*?) </p>')
LastUpdate = re.findall(LastUpdate_Pattern, HTML_Text)
print LastUpdate

Use BeautifulSoup
import urllib
from bs4 import BeautifulSoup
HTML_File = urllib.urlopen("http://uk.weather.com/weather/10day/New+Romney+KEN+United+Kingdom+UKXX1121:1:UK")
HTML_Text = HTML_File.read()
soup = BeautifulSoup(HTML_Text, 'html.parser')
print soup.select('.wx-timestamp')[0].text
Output:
Updated:
last updated about 20 minutes ago

Related

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

BS4 returns [] instead of the wanted HTML tag

I want to parse the given website and scrape the table. To me the code looks right. New to python and web parsing
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'lxml-xml')
cases = doc.find_all('div', {"class": "cell"})
print(cases)
doing this returns
[]
Change your parser and the class and there you have it.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://delhifightscorona.in/').text, 'html.parser').find('div', {"class": "grid-x grid-padding-x small-up-2"})
print(soup.find("h3").getText())
Output:
423,831
You can choose to print only the cases or the total stats with the date.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'html.parser')
stats = doc.find('div', {"class": "cell medium-5"})
print(stats.text) #Print the whole block with dates and the figures
cases = stats.find('h3')
print(cases.text) #Print the cases only

How can I print web scraped text onto a single line?

I'm trying to scrape tracking information from a shipper website using beautifulsoup. However, the format of the html is not conducive to what I'm trying to do. There is unnecessary spacing included in the source code text which is cluttering up my output. Ideally I'd just like to grab the date here but I'll take "Shipped" and the date at this point as long as it's on the same line.
I've tried using .replace(" ","") & .strip() with no success.
Python Script:
from bs4 import BeautifulSoup
import requests
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
ShipDate = soup.find('p', class_="Track-meter-itemLabel text--center").text
print(ShipDate)
HTML Source Code:
<p class="Track-meter-itemLabel text--center">
<strong class="text--bold">
Shipped
</strong>
5/23/2019
</p>
This is what's being returned. Additional spaces and blank lines.
Shipped
5/23/2019
Try:
trac = [your html code above]
soup = BeautifulSoup(trac, "lxml")
soup.text.replace(' ','').replace('\n',' ').strip()
Output:
'Shipped 5/23/2019'
You are looking for the stripped_strings generator which is already built into BeautifulSoup but it's not common knowledge.
### Your code
for ShipDate in soup.find('p', class_="Track-meter-itemLabel text--center").stripped_strings:
print(ShipDate)
Output:
Shipped
5/23/2019
Use regex
from bs4 import BeautifulSoup
import requests
import re
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
print(' '.join(re.sub(r'\s+',' ', soup.select_one('.Track-meter-itemLabel').text.strip()).split('\n')))

How to extract certain parts of an HTML paragraph

I am new to webscraping and regular expressions and facing a problem here. One of my code gives me an output in HTML but I need to extract a certain part out of the paragraph and not the complete paragraph. I Need help with this. Below is my code.
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
Output on screen:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
What I need: 30/06/1980 & 40631880
For Python 2.7 try this way:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result
Clean way to parse URLs (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])

How regex until last occurrence?

I am using python, I need regex to get contacts link of web page. So, I made <a (.*?)>(.*?)Contacts(.*?)</a> and result is:
href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li>Photo</li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
,but I need on last <a ... like
href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
What regex pattern should I use?
python code:
match = re.findall('<a (.*?)>(.*?)Contacts(.*?)</a>', body)
if match:
for m in match:
print ''.join(m)
Since you are parsing HTML, I would suggest to use BeautifulSoup
# sample html from question
html = '<li>About</li><li>Photo</li><li class="last">Contacts</li>'
from bs4 import BeautifulSoup
doc = BeautifulSoup(html)
aTag = doc.find('a', id='menu583') # id for Contacts link
print(aTag['href'])
# '/ru/kontakt.html'
Try BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import re
links = []
urls ['www.u1.com','www.u2.om'....]
for url in urls:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link.string.lower() == 'contact':
links.append(link.get('href'))

Categories

Resources