Extract follower numbers from Vkontakte using Python & BeautifulSoup - python

I am trying to extract the follower count from a page on Vkontakte, a Russian social network. As I'm a complete beginner with Python, I have tried using a code I discovered on StackOverflow initially made to extract follower count on Twitter. Here's the original code :
from bs4 import BeautifulSoup
import requests
username='realDonaldTrump'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
f = soup.find('li', class_="ProfileNav-item--followers")
print(f)
I'm using this webpage as an example : https://vk.com/msk_my. Here is my code :
from bs4 import BeautifulSoup
import requests
url = 'https://vk.com/msk_my'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
f = soup.find('span', class_="header_count fl_l")
print(f)
This, and many other variations I've tried (for example, trying to find "div" instead of "span", only prints "None". It seems BeautifulSoup can't find the follower count, and I'm sttruggling to understand why. The only way I've managed to print the follower count is with this :
text = soup.div.get_text()
print(text)
But this prints much more stuff than I want, and I don't know how to get only the follower count.

Try this. It will fetch you only the followers count. All you have to do is use selenium to be able to grab the exact page source that you can see by inspecting element.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://vk.com/msk_my')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item = soup.select(".header_count")[0].text
print("Followers: {}".format(item))
Result:
Followers: 59,343

Related

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):
#a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'
cd = driver.get(a+str(c))
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})
for f_data in fetch_data:
product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
print(product_name + '\n')
Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.
Now I don't know where M in going wrong.
Any suggestions, reference to a article that provides solution about this problem will be always welcomed.
You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.
Try the following snippet:-
#range of pages
for i in range(1,20):
driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
#get search results
products=bs.find_all('div',{'data-component-type':"s-search-result"})
#for each product in search result print product name
for i in range(0,len(products)):
for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
print(product_name)
You can print bs or fetch_data to debug.
Anyway
In my opinion, you can use requests or urllib to get page_source instead of selenium

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

beautiful soup- Scraping a site with hidden tag

I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")

python BeautifulSoup4 - why does result print 5 times?

I'm trying to do some web scraping in python using BeautifulSoup 4.
I am trying to scrape the salary of an public employee. I am doing that successfully but the result is returned 5 times and I cannot figure out why.
Here is the website I am scraping: https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett
Here is my code example:
import requests
from bs4 import BeautifulSoup
source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett')
soup = BeautifulSoup(source.text, 'html.parser')
main_box = soup.find_all('div')
for i in main_box:
try:
x = i.find('div', class_='col-12 col-lg-4 pay')
z = x.find('h2').text
print(z)
except Exception:
pass
And my results are:
$525,000
$525,000
$525,000
$525,000
$525,000
This is the correct salary, but as I said the results print 5 times.
If I go to the page, right click, and 'inspect' I find the class I am looking for, which is 'col-12 col-lg-4 pay' and then within that the 'h2' tag. There is only one 'h2' tag. And print the text of that.
So it seems I am missing something, but what?
I would just get rid of the for loop and use a more specific find query
import requests
from bs4 import BeautifulSoup
source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett')
soup = BeautifulSoup(source.text, 'html.parser')
main_box = soup.find("div", {"class": "pay"})
print(main_box.find('h2').text)
You can also extract this is by using CSS
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett'
res = requests.get(url).text
soup = BeautifulSoup(res , 'html.parser')
Value = soup.select('#paytotal')
print(Value[0].text)

Using Python to Scrape Sky Cinema List

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)
There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)
First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

Categories

Resources