Problem with data extraction from Indeed by BeautifulSoup

Problem with data extraction from Indeed by BeautifulSoup - python

I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!
I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
N = soup.findAll("div", {"id" : "vjs-desc"})
print N
I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?

the #vjs-desc element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.
# -*- coding: utf-8 -*-
# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup
url = "https://www......"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
for id, desc in ajax_content.items():
print id
soup = BeautifulSoup(desc, 'html.parser')
# or try this
# soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
print soup.text.encode('utf-8')
print('==============================')

Related

Crawl dynamic page crawl website elements

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.

Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

How to webscrape old school website that uses frames

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.

As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

Retrieving data from HTML having the child direction using python

I'm trying to get the email from the city from http://www.comuni-italiani.it/110/index.html
I have the speceific child direction using xPath Finder which is /html/body/span[3]/table[2]/tbody/tr[1]/td[2]/table/tbody/tr[11]/td/b/a. Now I'm trying to retrieve the email from this page but I know very little of BeatifulSoup library (I'm just getting started). After reading several guides I managed to write the following code, but I'm not succesfull with indicating the child route correctly
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'http://www.comuni-italiani.it/110/index.html'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('span')
for i in child_soup.children:
print("child : ", i)
What am I doing wrong??

Please find my attempt to solve your problem below. It starts the same way as in your code, just has a bit of magic to find the email and print it out.
from bs4 import BeautifulSoup
import requests
sample_web_page = 'http://www.comuni-italiani.it/110/index.html'
page = requests.get(sample_web_page)
soup = BeautifulSoup(page.content, "html.parser")
email = soup.select_one('b > a[href^="mail"]')['href']
print(email.split(':')[1])

unable to Webscrape dropdown item [Python][beautifulsoup]

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?

You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML

3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with data extraction from Indeed by BeautifulSoup - python

Related

Crawl dynamic page crawl website elements

How to webscrape old school website that uses frames

Retrieving data from HTML having the child direction using python

unable to Webscrape dropdown item [Python][beautifulsoup]

How to get text following a table/span with BeautifulSoup and Python?

Categories

Resources