How to scrape total search results using Python - python

I am a beginner in Python and web scraping but I am really interested. What I want to do is to extract the total number of search results per day.
If you open it, you will see here:
Used Cars for Sale
Results 1 - 20 of 30,376
What I want is only the number 30,376. Is there any way to extract it on a daily basis automatically and save it to an excel file please? I have played around some packages in Python but all I got is error messages and something not relevant like below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "..."
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
make_soup(base_url)
Can someone show me how to extract that particular number please? Thanks!

Here is the one way through requests module and soup.select function.
from bs4 import BeautifulSoup
import requests
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
def make_soup(url):
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
txt = soup.select('#result-header .result-count')[0].text
print txt.split()[-1]
make_soup(base_url)
soup.select accepts an css selector as argument. This #result-header .result-count selector means find the element having result-count class which was inside an element having result-header as id.

from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
html = urlopen(base_url).read()
soup = BeautifulSoup(html, 'lxml')
result_count = soup.find(class_="result-count").text.split('of ')[-1]
print(result_count)
out:
30,376

from bs4 import BeautifulSoup
import requests, re
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
a = BeautifulSoup(requests.get(base_url).content).select('div#result-header p.result-count')[0].text
num = re.search('([\w,]+)$',a)
print int(num.groups(1)[0].replace(',',''))
Output:
30378
Will get any other number also which is at the end of the statement.
Appending new rows to an Existing Excel File
Script to append today's date and the extracted number to existing excel file:
!!!Important!!!: Don't run this code directly on your main file. Instead make a copy of it first and run on that file. If it works properly then you can run it on your main file. I'm not responsible if you loose your data :)
import openpyxl
import datetime
wb = openpyxl.load_workbook('/home/yusuf/Desktop/data.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
a = sheet.get_highest_row()
sheet.cell(row=a,column=0).value=datetime.date.today()
sheet.cell(row=a,column=1).value=30378 # use a variable here from the above (previous) code.
wb.save('/home/yusuf/Desktop/data.xlsx')

Related

How to Extract a Division from Html with BeautifulSoup

I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:
I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'
https://www.duden.de/rechtschreibung/aufmachen
Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
with open("aufmachen.html",encoding="utf8") as f:
doc = BeautifulSoup(f,"html.parser")
tags = doc.body.findAll(text = '<div class="division " id="bedeutungen">')
print(tags)
Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Could you please help me with this problem?
Thanks for your time in advance.
The main bug in your code is that you send BS a file, not a string. Call .read() on your file to get a string.
with open("aufmachen.html", "r",encoding="utf8") as f:
doc = BeautifulSoup(f.read(),"html.parser")
However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Your first call to .findAll() didn't work because the text kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.
tags = soup.body.findAll("div", class_="division", id="bedeutungen")

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

unable to Webscrape dropdown item [Python][beautifulsoup]

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?
You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

'NoneType' Error While WebScraping StockTwits

I am trying to write a script that simply reads and prints all of the tickers on a particular accounts watchlist. I have managed to navigate to the page print the user's name from the HTML, and now I want to print all the tickers he follows by using find() to find their location, then .find_all() to find each ticker, but every time I try to use the find() command to navigate to the watchlist tickers it returns 'NoneType.'
Here is my code:
import requests
import xlwt
from xlutils.copy import copy
from xlwt import Workbook
import xlrd
import urllib.request as urllib2
from bs4 import BeautifulSoup
hisPage = ("https://stocktwits.com/GregRieben/watchlist")
page = urllib2.urlopen(hisPage)
soup = BeautifulSoup(page, "html.parser")
his_name = soup.find("span", {"class":"st_33aunZ3 st_31YdEUQ st_8u0ePN3 st_2mehCkH"})
name = his_name.text.strip()
print(name)
watchlist = soup.find("div", {"class":"st_16989tz"})
tickers = watchlist.find_all('span', {"class":"st_1QzH2P8"})
print(type(watchlist))
print(len(watchlist))
Here I want the highlighted value (LSPD.CA) and all the others afterwards (they all have the exact same HTML set up)
Here is my Error:
That content is dynamically added from an api call (so not present in your request to original url where DOM is not updated as it would be when using a browser). You can find the API call for the watchlist in the network traffic. It returns json. You can extract what you want from that.
import requests
r = requests.get('https://api.stocktwits.com/api/2/watchlists/user/396907.json').json()
tickers = [i['symbol'] for i in r['watchlist']['symbols']]
print(tickers)
If you need to get user id to pass to API it is present in a number of places in response from your original url. I am using regex to grab from a script tag
import requests, re
p = re.compile(r'subjectUser":{"id":(\d+)')
with requests.Session() as s:
r = s.get('https://stocktwits.com/GregRieben/watchlist')
user_id = p.findall(r.text)[0]
r = s.get('https://api.stocktwits.com/api/2/watchlists/user/' + user_id + '.json').json()
tickers = [i['symbol'] for i in r['watchlist']['symbols']]
print(tickers)

Python BeautifulSoup cannot find table ID

I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!

Categories

Resources