Can't grab a phone number from a webpage - python

I've created a script in python to fetch a phone number from a webpage but I can't find any idea as to how I can grab that as the number is in image.
Website link
This is how that number is displayed on that page:
I've written so far:
import requests
from bs4 import BeautifulSoup
url = "use_above_link"
def get_phone_number(link):
resp = requests.get(link)
soup = BeautifulSoup(resp.text,"lxml")
phone = soup.select_one("img.phone-num-img")['src']
print(phone)
if __name__ == '__main__':
get_phone_number(url)
How can I scrape this very phone number from that webpage?

Here you go.
The clues start with the following html that indicates the tel number likely has a base64 encoding
The base64 encoded value of that tel number is MDA5NzE1MjE3NjQ4MDY=. This value is not present on that page but is present at one of the other urls you can extract from the initial page html.
Issue a second request to that url, target the [data-tel] attribute, which is where the encoded string is stored, extract the base64 encoded string and decode.
import requests
from bs4 import BeautifulSoup as bs
import base64
with requests.Session() as s:
r = s.get('https://dubai.dubizzle.com/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
soup = bs(r.content, 'lxml')
link = 'https://dubai.dubizzle.com' + soup.select_one('[media][href$=shownumber]')['href']
r = s.get(link)
soup = bs(r.content, 'lxml')
encoded = soup.select_one('[data-tel]')['data-tel']
tel = base64.b64decode(encoded)
print(tel)
Notes:
It looks like the rel alternate (the second url) is simply a mobile device url and that you can issue just one request and substitute in /m/ into the original url i.e.
https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber#
Code then simplifies to:
import requests
from bs4 import BeautifulSoup as bs
import base64
r = requests.get('https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
soup = bs(r.content, 'lxml')
encoded = soup.select_one('[data-tel]')['data-tel']
tel = base64.b64decode(encoded)
print(tel)

1. Use a paid OCR service
The quickest way to solve this problem will be to use an OCR service. The downside: they are not free.
eg: Set up a google cloud project and enable the vision API. Instructions here. Then pass the image you acquired to the API and get the numbers back.
import requests
from bs4 import BeautifulSoup
from google.cloud import vision
url = "use_above_link"
client = vision.ImageAnnotatorClient()
def get_phone_number(link):
resp = requests.get(link)
soup = BeautifulSoup(resp.text,"lxml")
phone_src_url = soup.select_one("img.phone-num-img")['src']
print(phone_src_url)
response = client.annotate_image({
'image': {'source': {'image_uri': phone_src_url }},
'features': [{'type': vision.enums.Feature.Type.TEXT_DETECTION}],
})
if __name__ == '__main__':
get_phone_number(url)
2. Use OPEN CV
This method is going to involve you writing a lot of code yourself. The main assumption here is you are going to parse dubizzle links. If that is the case, the font for those phone numbers is standard. You will have to parse the images of each digit from 0 to 9 into recognisable curves. Then you will need to detect the curves in each image. Detailed instructions here.
You find and cut out 10 images - one for each digit. This will be your master set. Then you will need to match the images by following the tutorial I linked. Depending on the position of each match, you will have to order the output from left to right.

Related

Crawl dynamic page crawl website elements

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.
Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

How to webscrape old school website that uses frames

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.
As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

Get XHR info from URL

I have this website https://www.futbin.com/22/player/7504 and I want to know if there is a way to get the XHR url for the information using python. For example for the URL above I know the XHR I want is https://www.futbin.com/22/playerPrices?player=231443 (got it from inspect element -> network).
My objective is to get the price value from https://www.futbin.com/22/player/1 to https://www.futbin.com/22/player/10000 at once without using inspect element one by one.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=231443'
page = requests.get(URL)
x = page.json()
data = x['231443']['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])
You can find the player-resource id and build the url yourself. I use beautifulsoup. It's made for parsing websites, but you can take the requests content and throw that into an html parser as well if you don't want to install beautifulsoup
With it, read the first url, get the id and use your code to pull the prices. To test, change the 10000 to 2 or 3 and you'll see it works.
import re, requests
from bs4 import BeautifulSoup
for i in range(1,10000):
url = 'https://www.futbin.com/22/player/{}'.format(str(i))
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
player_resource = soup.find(id=re.compile('page-info')).get('data-player-resource')
# print(player_resource)
URL = 'https://www.futbin.com/22/playerPrices?player={}'.format(player_resource)
page = requests.get(URL)
x = page.json()
# print(x)
data = x[player_resource]['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

Problem with data extraction from Indeed by BeautifulSoup

I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!
I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
N = soup.findAll("div", {"id" : "vjs-desc"})
print N
I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?
the #vjs-desc element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.
# -*- coding: utf-8 -*-
# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup
url = "https://www......"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
for id, desc in ajax_content.items():
print id
soup = BeautifulSoup(desc, 'html.parser')
# or try this
# soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
print soup.text.encode('utf-8')
print('==============================')

Categories

Resources