How to crawl several review pages using Python?

How to crawl several review pages using Python? - python

I have a question about web-crawler.
I want to get several review pages using Python.
Here my code for web-crawler.
URL = 'https://www.example.co.kr/users/sign_in'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
headers = {'Content-type': 'application/json', 'Accept': 'text/plain', 'User-Agent':user_agent}
login_data = {'user':{'email':'id', 'password':'password', 'remember_me':'true'}}
client = requests.session()
login_response = client.post(URL, json = login_data, headers = headers)
print(login_response.content.decode('utf-8'))
jre = 'https://www.example.co.kr/companies/reviews/ent?page=1'
index = client.get(jre)
html = index.content.decode('utf-8')
print(html)
This code only gets page=1, but I want to get page=1, page=2, page3 .... using format method. How can I achieve that?

You should use a while o a for loop on each page, depending on your necessities.
Try a pattern like this:
page = 1
while page <= MAX_PAGE or not REACHED_STOPPING_CONDITION:
# Compose page url
jre = f'https://www.example.co.kr/companies/reviews/ent?page={page}'
# Get page url
index = client.get(jre)
# Do stuff...
# Increment page counter
page += 1
I think that once you had access to website you do not have any necessity to perform login again. If it is needed, you should insert login part into the loop.
Another way to navigate website pages is to find a sort of "Next page" or "Previous page" reference in the document and then interact with them:
# Compose page url
jre = 'https://www.example.co.kr/companies/reviews/ent?page=1'
# Get page
index = client.get(jre)
while page <= MAX_PAGE or not REACHED_STOPPING_CONDITION:
# Do stuff...
# Search next page element (ex. by CSS selector)
jre.find_element_by_css_selector('next-page').click()
# Increment page counter
page += 1

Related

Site parsing myip.ms

Writing a parser for the site https://myip.ms/ And here for this page https://myip.ms/browse/sites/1/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 Everything works fine with this link, but if you go to another page https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 It does not output any data, although the site structure is the same. I think that this may be due to the fact that the site has a limit on views, or because you need to register, but I can't find what request you need to send to log in to your account. Tell me what to do?
import requests
from bs4 import BeautifulSoup
import time
link_list = []
URL = 'https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714'
HEADERS = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 YaBrowser/20.12.2.105 Yowser/2.5 Safari/537.36','accept':'*/*'}
#HOST =
def get_html(url,params=None):
r = requests.get(url,headers=HEADERS,params=params)
return r
def get_content(html):
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all('td',class_='row_name')
for item in items:
links = item.find('a').get('href')
link_list.append({
'link': links
})
def parser():
print(URL)
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parser()
print(link_list)

Use SessionID with your request. It will allow you at least 50 requests per day.
If you use proxy that support cookies this number might be even higher.
So the process is as follows:
load the page with your browser.
find session id in the request inside your Dev Tools.
use this session id in your request, no headers or additional info is required.
enjoy results for 50 requests per day.
repeat in 24 hours.

Use Post to change page

I've been using Selenium for some time to scrape a website but for some reasons it doesn't work anymore. I was using Selenium because you need to interact with the site to flip through pages (ie: click on a next button).
As a solution, I was thinking of using Post method from Requests. I'm not sure if its doable since I've never used the Post method, and since I not familiar with what it does (though I kind of understand the general idea).
My code would look something like that:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10 11 5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36"}
url = "https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail"
def infinity():
while True:
yield
c = 0
urls = []
for i in infinity():
c += 1
page = list(str(soup.find("li",{"class":"pager-current"}).text).split())
pageTot = int("".join(page[-2:])) # Check the total number of page
if c <= pageTot: # Scrape the first page
if c <= 1:
req = requests.get(url, headers=headers)
else:
pass
# This is where I'm stuck but ideally I'd be using Post method in some way
soup = BeautifulSoup(req.content,"lxml")
for link in soup.find_all("a",{"class":"a-more-detail"}):
try: # For each page scrape ads url
urls.append("https://www.centris.ca" + link["href"])
except KeyError:
pass
else: # When all pages are scrape exit the loop
break
for url in list(dict.fromkeys(urls)):
pass # do stuff
This is what is going on when you click next on the webpage:
This is the Request (the startPosition begins at 0 on page 1 and increase by leaps of 12)
And this is part of the Reponse:
{"d":{"Message":"","Result":{"html": [...], "count":34302,"inscNumberPerPage":12,"title":""},"Succeeded":true}}
With that information is it possible to use the Post method to scrape every pages ? And how could I do that ?

The following should do the trick. I've added duplicate filtering logic to avoid printing duplicate links. The script should break once there are no more results left to scrape.
import requests
from bs4 import BeautifulSoup
base = 'https://www.centris.ca{}'
post_link = 'https://www.centris.ca/Property/GetInscriptions'
url = 'https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail'
unique_links = set()
payload = {"startPosition":0}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['content-type'] = 'application/json; charset=UTF-8'
s.get(url) #Sent this requests to get the cookies
while True:
r = s.post(post_link,json=payload)
if not len(r.json()['d']['Result']['html']):break
soup = BeautifulSoup(r.json()['d']['Result']['html'],"html.parser")
for item in soup.select(".thumbnailItem a.a-more-detail"):
unique_link = base.format(item.get("href"))
if unique_link not in unique_links:
print(unique_link)
unique_links.add(unique_link)
payload['startPosition']+=12

Can't extract a link connected to `see all` button from a webpage

I've created a script to log in to linkedin using requests. The script is doing fine.
After logging in, I used this url https://www.linkedin.com/groups/137920/ to scrape this name Marketing Intelligence Professionals from there which you can see in this image.
The script can parse the name flawlessly. However, what I wish to do now is scrape the link connected to the See all button located at the bottom of that very page shown in this image.
Group link you gotta log in to access the content
I've created so far (it can scrape the name shown in the first image):
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = 'your email' #put your username here
payload['session_password'] = 'your password' #put your password here
r = s.post(post_url,data=payload)
r = s.get(target_url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
print(json.loads(items)['data']['name']['text'])
How can I scrape the link connected to See all button from there?

There is an internal Rest API which is called when you click on "See All" :
GET https://www.linkedin.com/voyager/api/search/blended
The keywords query parameter contains the title of the group you have requested initially (the group title in the initial page).
In order to get the group name, you could scrape the html of the initial page, but there is an API which returns the group information when you gives the group ID :
GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID
The group id in your case is 137920 which can be extracted from the URL directly
An example :
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
username = 'your username'
password = 'your password'
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)
with requests.Session() as s:
# login
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = username
payload['session_password'] = password
r = s.post(post_url, data = payload)
# API
csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
headers= {
"csrf-token": csrf_token
})
group_name = r.json()["name"]["text"]
print(f"searching data for group {group_name}")
params = {
"count": 10,
"keywords": group_name,
"origin": "SWITCH_SEARCH_VERTICAL",
"q": "all",
"start": 0
}
r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
headers= {
"csrf-token": csrf_token,
"Accept": "application/vnd.linkedin.normalized+json+2.1",
"x-restli-protocol-version": "2.0.0"
})
result = r.json()["included"]
print(result)
print("list of groupName/link")
print([
(t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}')
for t in result
])
A few notes :
those API call require cookie session
those API call require a specific header for a XSRF token that is the same as JSESSIONID cookie value
a special media type application/vnd.linkedin.normalized+json+2.1 is necessary for the search call
the parenthesis inside the fields queryContext and filters shouldn't be url encoded otherwise it will not take these params into account

you can try the selenium, click the See all button, and scrape the linked connected content:
from selenium import webdriver
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.linkedin.com/xxxx')
driver.find_element_by_name('s_image').click()
selenium docs: https://selenium-python.readthedocs.io/

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?

You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

Posting form data with requests and python

Posting form data isn't working and since my other post about this wasn't working, I figured I would try to ask the question again so maybe I can get another perspective. I am currently trying to get the requests.get(url, data=q) to work. When I print, I am getting a page not found. I have resorted just to set variables and join them to the entire URL to make it work but I really want to learn this aspect about requests. Where am I making the mistake? I am using the HTML tag attributes name=search_terms and name=geo_location_terms for the form.
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com"
q = {'search_terms': search_terms, 'geo_locations_terms': location}
page = requests.get(url, data=q)
print(page.url)

You have few little mistakes in your code:
Check form's action parameter. Then url = "https://www.yellowpages.com/search"
Second parameter is geo_location_terms not geo_locations_terms.
You should pass query parameters in requests.get as params not as request data (data).
So, the final version of code:
import requests
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search_terms, 'geo_location_terms': location}
page = requests.get(url, params=q)
print(page.url)
Result:
https://www.yellowpages.com/search?search_terms=Bars&geo_location_terms=New+Orleans%2C+LA

Besides the issues pointed by #Lev Zakharov, you need to set the cookies in your request, like this:
import requests
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com/search"
with requests.Session() as session:
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Cookie': 'cookies'
})
q = {'search_terms': search_terms, 'geo_locations_terms': location}
response = session.get(url, params=q)
print(response.url)
print(response.status_code)
Output
https://www.yellowpages.com/search?search_terms=Bars&geo_locations_terms=New+Orleans%2C+LA
200
To get the cookies you can see the requests using some Network listener for instance using Chrome Developer Tools Network tab, then replace the value 'cookies'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to crawl several review pages using Python? - python

Related

Site parsing myip.ms

Use Post to change page

Can't extract a link connected to `see all` button from a webpage

I get nothing when trying to scrape a table

Posting form data with requests and python

Categories

Resources