Put data into a list from webpage (splinter) - python

I am doing a little bot, that should give information from website (ebay) and put into a list using splinter and python. My first lines of code:
from splinter import Browser
with Browser() as browser:
url = "http://www.ebay.com"
browser.visit(url)
browser.fill('_nkw', 'levis')
button = browser.find_by_id('gh-btn')
button.click()
How I can put information that in red frame to list, using information from web page?
Like : [["Levi Strauss & Co. 513 Slim Straight Jean Ivory Men's SZ", 12.99, 0], ["Levi 501 Jeans Mens Original Levi's Strauss Denim Straight", 71.44, "Now"], ["Levis 501 Button Fly Jeans Shrink To Fit Many Sizes", [$29.99, $39.99]]]

This is not perfect answer, but it should work.
first thing install these two module
requests and BS4:
pip install requests
pip install beautifulsoup4
import requests
import json
from bs4 import BeautifulSoup
#setting up the headers
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.ebay.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Host': 'www.ebay.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
}
#setting up my proxy, you can disable it
proxy={
'https':'127.0.0.1:8888'
}
#search terms
search_term='armani'
#request session begins
ses=requests.session()
#first get home page so to set cookies
resp=ses.get('https://www.ebay.com/',headers=headers,proxies=proxy,verify=False)
#next get the search term page to parse request
resp=ses.get('https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2374313.m570.l1313.TR12.TRC2.A0.H0.X'+search_term+'.TRS0&_nkw='+search_term+'&_sacat=0',
headers=headers,proxies=proxy,verify=False)
soup = BeautifulSoup(resp.text, 'html.parser')
items=soup.find_all('a', { "class" : "vip" })
price_items=soup.find_all('span', { "class" : "amt" })
final_list=list()
for item,price in zip(items,price_items):
try:
title=item.getText()
price_val=price.find('span',{"class":"bold"}).getText()
final_list.append((title,price_val))
except Exception as ex:
pass
print(final_list)
This is the output that I got

I agree with #Aki003, Something like this
def get_links(ebay_url):
page = requests.get(ebay_url).text
soup = BeautifulSoup(page)
links = []
for item in soup.find_all('a'):
links.append(item.get('href'))
return(links)
You can scrape for any other element on the webpage. Check the beautifulsoup documentation.

Related

Error 403 on public page w/ python get request

I'm a complete newbie to python and trying to get the content of a webpage with a get request. The page I'm trying to access is public without any authorization as far as I can see. It's a job listing from the career website of a popular company and everyone can view the page.
My code looks like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tuvsud.com/de-de/karriere/stellen/jobs/projektmanagerin-auditservice-food-corporate-functions-business-support-all-regions-133776'
headers = {
'Host': '',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(r.status_code)
However I get the status code 403. With the google url for example it works though.
I would be happy about any help! thanks in advance

Unable to fetch tabular content from a site using requests

I'm trying to fetch tabular content from a webpage using the requests module. After navigating to that webpage, when I manually type 0466425389 right next to Company number and hit the search button, the table is produced accordingly. However, when I mimic the same using requests, I get the following response.
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><redirect url="/bc9/web/catalog"></redirect></partial-response>
I've tried with:
import requests
link = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'page_searchForm:actions:0:button',
'javax.faces.partial.execute': 'page_searchForm',
'javax.faces.partial.render': 'page_searchForm page_listForm pageMessagesId',
'page_searchForm:actions:0:button': 'page_searchForm:actions:0:button',
'page_searchForm': 'page_searchForm',
'page_searchForm:j_id3:generated_number_2_component': '0466425389',
'page_searchForm:j_id3:generated_name_4_component': '',
'page_searchForm:j_id3:generated_address_zipCode_6_component': '',
'page_searchForm:j_id3_activeIndex': '0',
'page_searchForm:j_id2_stateholder': 'panel_param_visible;',
'page_searchForm:j_idt133_stateholder': 'panel_param_visible;',
'javax.faces.ViewState': 'e1s1'
}
headers = {
'Faces-Request': 'partial/ajax',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'https://cri.nbb.be',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'cri.nbb.be',
'Origin': 'https://cri.nbb.be',
'Referer': 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.get(link)
s.headers.update(headers)
res = s.post(link,data=payload)
print(res.text)
How can I fetch tabular content from that site using requests?
From looking at the "action" attribute on the search form, the form appears to generate a new JSESSIONID every time it is opened, and this seems to be a required attribute. I had some success by including this in the URL.
You don't need to explicitly set the headers other than the User-Agent.
I added some code: (a) to pull out the "action" attribute of the form using BeautifulSoup - you could do this with regex if you prefer, (b) to get the url from that redirection XML that you showed at the top of your question.
import re
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
...
with requests.Session() as s:
s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# GET to get search form
req1 = s.get(link)
# Get the form action
soup = BeautifulSoup(req1.text, "lxml")
form = soup.select_one("#page_searchForm")
form_action = urljoin(link, form["action"])
# POST the form
req2 = s.post(form_action, data=payload)
# Extract the target from the redirection xml response
target = re.search('url="(.*?)"', req2.text).group(1)
# Final GET to get the search result
req3 = s.get(urljoin(link, target))
# Parse and print (some of) the result
soup = BeautifulSoup(req3.text, "lxml").body
for detail in soup.select(".company-details tr"):
columns = detail.select("td")
if columns:
print(f"{columns[0].text.strip()}: {columns[1].text.strip()}")
Result:
Company number: 0466.425.389
Name: A en B PARTNERS
Address: Quai de Willebroeck 37
: BE 1000 Bruxelles
Municipality code NIS: 21004 Bruxelles
Legal form: Cooperative company with limited liability
Legal situation: Normal situation
Activity code (NACE-BEL)
The activity code of the company is the statistical activity code in use on the date of consultation, given by the CBSO based on the main activity codes available at the Crossroads Bank for Enterprises and supplementary informations collected from the companies: 69201 - Accountants and fiscal advisors
I think requests could not handle dynamic web pages. I use helium and pandas to do the work.
import helium as he
import pandas as pd
url = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
driver = he.start_chrome(url)
he.write('0466425389', into='Company number')
he.click('Search')
he.wait_until(he.Button('New search').exists)
he.select(he.ComboBox('10'), '100')
he.wait_until(he.Button('New search').exists)
with open('download.html', 'w') as html:
html.write(driver.page_source)
he.kill_browser()
df = pd.read_html('download.html')
df[2]
Output

Cannot identify Javascript XHR API loading data to this page

I am trying to parse the EPG data at the below link. When I inspect the HTML using the below, all the program data is missing. I realise this is because it's being loaded async by Javascript, but I cannot figure out in Chrome Tools which is the API call as there seems to be a lot loaded into this page at once:
import requests
url = 'https://mi.tv/ar/programacion/lunes'
headers ={
'Accept': 'text/html, */*; q=0.01',
'Referer': outer,
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'X-KL-Ajax-Request': 'Ajax_Request',
'X-Requested-With': 'XMLHttpRequest'
}
r = requests.get(url=url, headers=headers)
rr = r.text
print(rr)
...anyone identify for me what the correct API is? I can see there are API parameters given in the HTML, but I've not been able to assemble them into a working link and I cannot see anything with that URL root in chrome tools...
The following shows the right url to use and how to return listings in a dict by channel key
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://mi.tv/ar/async/guide/all/lunes/60', headers = headers)
soup = bs(r.content, 'lxml')
listings = {c.select_one('h3').text: list(zip([i.text for i in c.select('.time')], [i.text for i in c.select('.title')]))
for c in soup.select('.channel')}
pprint(listings)

Scraping site missing data

So I'm trying to scrape the open positions on this site and when I use any type of requests (currently trying request-html) it doesn't show everything that's in the HTML.
# Import libraries
import time
from bs4 import BeautifulSoup
from requests_html import HTMLSession
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
session = HTMLSession()
# Connect to the URL
response = session.get(url)
response.html.render()
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html5lib")
b = soup.findAll('a')
Not sure where to go. Originally thought the problem was due to javascript rendering but this is not working.
The issue is that the initial GET doesn't get the data (which I assume is the job listings), and the js that does do that, uses a POST with a authorization token in the header. You need to get this token and then make the POST to get the data.
This token appears to be dynamic so we're going to get a little wonky getting it, but doable.
url0=r'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
url=r'https://germanamerican.csod.com/services/x/career-site/v1/search'
s=HTMLSession()
r=s.get(url0)
print(r.status_code)
r.html.render()
soup=bs(r.text,'html.parser')
scripts=soup.find_all('script')
for script in scripts:
if 'csod.context=' in script.text: x=script
j=json.loads(x.text.replace('csod.context=','').replace(';',''))
payload={
'careerSiteId': 5,
'cities': [],
'countryCodes': [],
'cultureId': 1,
'cultureName': "en-US",
'customFieldCheckboxKeys': [],
'customFieldDropdowns': [],
'customFieldRadios': [],
'pageNumber': 1,
'pageSize': 25,
'placeID': "",
'postingsWithinDays': None,
'radius': None,
'searchText': "",
'states': []
}
headers={
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'authorization': 'Bearer '+j['token'],
'cache-control': 'no-cache',
'content-length': '272',
'content-type': 'application/json',
'csod-accept-language': 'en-US',
'origin': 'https://germanamerican.csod.com',
'referer': 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
r=s.post(url,headers=headers,json=payload)
print(r.status_code)
print(r.json())
the r.json() thats printed out is a nice json format version of the table of job listings.
I don't think it's possible to scrape that website with Requests.
I would suggest using Selenium or Scrapy.
Welcome to SO!
Unfortunately, you won't be able to scrape that page with requests (nor requests_html or similar libraries) because you need a tool to handle dynamic pages - i.e., javascript-based.
With python, I would strongly suggest selenium and its webdriver. Below a piece of code that prints the desired output - i.e., all listed jobs (NB it requires selenium and Firefox webdriver to be installed and with the correct PATH to run)
# Import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
browser = webdriver.Firefox() # initialize the webdriver. I use FF, might be Chromium or else
browser.get(url) # go to the desired page. You might want to wait a bit in case of slow connection
page = browser.page_source # this is the page source, now full with the listings that have been uploaded
soup = BeautifulSoup(page, "lxml")
jobs = soup.findAll('a', {'data-tag' : 'displayJobTitle'})
for j in jobs:
print(j.text)
browser.quit()

Scraping data from google finance using BeautifulSoup in python

I'm trying to get data from google finance from this link like this:
url = "https://www.google.com/finance/historical?cid=4899364&startdate=Dec+1%2C+2016&enddate=Mar+23%2C+2017&num=200&ei=4wLUWImyJs-iuASgwIKYBg"
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request).read()
soup = BeautifulSoup(response, 'html.parser')
prices = soup.find_all("tbody")
print(prices)
I'm getting an empty list. I have also tried alternates like using soup.find_all('tr') but still I can't retrieve data successfully.
edit:
headers={'Host': 'www.google.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive'
}
The problem was with html.parser. I instead used lxml and it worked. Also exchanged urllib with requests.

Categories

Resources