Crawl dynamic page crawl website elements

Crawl dynamic page crawl website elements - python

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.

Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

Related

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.

So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

unable to Webscrape dropdown item [Python][beautifulsoup]

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?

You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

Problem with data extraction from Indeed by BeautifulSoup

I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!
I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
N = soup.findAll("div", {"id" : "vjs-desc"})
print N
I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?

the #vjs-desc element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.
# -*- coding: utf-8 -*-
# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup
url = "https://www......"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
for id, desc in ajax_content.items():
print id
soup = BeautifulSoup(desc, 'html.parser')
# or try this
# soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
print soup.text.encode('utf-8')
print('==============================')

Extract element from HTML with Python's BeautifulSoup library

I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot

to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text

I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.

Scraping using Inspect element

I am trying to get some information from Instagram by scraping it. I have tried this code on twitter and it was working fine but it shows no result on Instagram both of the code are available here.
Twitter code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://twitter.com/realmadrid"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"ProfileHeaderCard"}))
Result: Perfectly given.
Instagram Code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Result: None

If you look at the source, you will see the content is dynamically loaded so there is no div._bugdy in what is returned by your request, depending on what it is you want you may be able to pull it from the script json:
import requests
import re
import json
r = requests.get("https://www.instagram.com/barackobama/")
soup = BeautifulSoup(r.content)
js = soup.find("script",text=re.compile("window._sharedData")).text
_json = json.loads((js[js.find("{"):js.rfind("}")+1]))
from pprint import pprint as pp
pp(_json)
That gives you everything you see in the <script type="text/javascript">window._sharedData = ..... in the source returned.
If you want to ge the followers then you will need to use something like selenium, the site is pretty much all dynamically loaded content, to get the followers you need to click the link which is only visible if you are logged in, this will get you closer to what you want:
from selenium import webdriver
import time
login = "https://www.instagram.com"
dr = webdriver.Chrome()
dr.get(login)
dr.find_element_by_xpath("//a[#class='_k6cv7']").click()
dr.find_element_by_xpath("//input[#name='username']").send_keys(youruname")
dr.find_element_by_xpath("//input[#name='password']").send_keys("yourpass")
dr.find_element_by_css_selector("button._aj7mu._taytv._ki5uo._o0442").click()
time.sleep(5)
dr.get("https://www.instagram.com/barackobama")
dr.find_element_by_css_selector('a[href="/barackobama/followers/"]').click()
time.sleep(3)
for li in dr.find_element_by_css_selector("div._n3cp9._qjr85").find_elements_by_xpath("//ul/li"):
print(li.text)
That pulls some text from the li tags that appear in the popup after you click the link, you can pull whatever you want from the unordered list:

First of all there seems to be a typo in the address on line 3.
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Secondly, since you are working with dynamically loaded content, Python might not be able to see all the content you see when browsing the page in your browser.
In order to solve that there are different webdrivers, such as Selenium webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) which emulate the browser and can wait for Javascript to generate/display data before looking it up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawl dynamic page crawl website elements - python

Related

Trying to scrape Aliexpress

unable to Webscrape dropdown item [Python][beautifulsoup]

Problem with data extraction from Indeed by BeautifulSoup

Extract element from HTML with Python's BeautifulSoup library

Scraping using Inspect element

Categories

Resources