There is no output when i try to scrape off this page - python

Im using a discord bot for a small community and im trying to display the number of online players on a specific game. The code im using here to me looks ok, but this is my first time scraping, so i may be asking for it to look for the wrong keywords. The module loads fine with no errors, but when entering the trigger to display the information, nothing happens. can anyone point out to me wheat i may have missed or input wrong myself
Here is the code:
import discord
from discord.ext import commands
try: # check if BeautifulSoup4 is installed
from bs4 import BeautifulSoup
soupAvailable = True
except:
soupAvailable = False
import aiohttp
class bf1online:
"""My custom cog that does stuff!"""
def __init__(self, bot):
self.bot = bot
"""This does stuff!"""
#Your code will go here
#commands.command()
async def bf1(self):
"""How many players are online atm?"""
#Your code will go here
url = "http://bf1stats.com/" #build the web adress
async with aiohttp.get(url) as response:
soupObject = BeautifulSoup(await response.text(), "html.parser")
try:
online = soupObject.find(id_='online_section').find('h2').find('p').find('b').get_text()
await self.bot.say(online + ' players are playing this game at the moment')
except:
await self.bot.say("Couldn't load amount of players. No one is playing this game anymore or there's an error.")
def setup(bot):
bot.add_cog(bf1online(bot))

Your first problem is it should be id= not id_=, no trailing underscore.
soupObject.find(id='online_section')
The next problem is that looks like:
<div id="online_section">
Loading currently playing player counts...
</div>
Because that is rendered using Js. Luckily you can mimic the ajax call that gets the data quite easily:
In [1]: import requests
...: data =
requests.get("http://api.bf1stats.com/api/onlinePlayers").json()
...:
In [2]: data
Out[2]:
{'pc': {'count': 126870, 'label': 'PC', 'peak24': 179935},
'ps4': {'count': 237504, 'label': 'PS4', 'peak24': 358182},
'xone': {'count': 98474, 'label': 'XBOXONE', 'peak24': 266869}}

Related

I'm finding it hard to understand how functions work. Would someome mind explaining them?

Please excuse the extra modulus. I've taken a small part of my code out to convert it into functions to make my code less messy. However I'm finding it really hard to understand how I put values in and take them out to print or do things with. See the code I'm using below. VideoURL would be replaced with a url of a video.
`
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
from pytube import YouTube
from pytube import Channel
channelURL = "videoURL"
YouTubeDomain = "https://www.youtube.com/channel/"
def BeautifulSoup(Link):
soup = BeautifulSoup(requests.get(Link, cookies={'CONSENT': 'YES+1'}).text, "html.parser")
data = re.search(r"var ytInitialData = ({.*});", str(soup.prettify())).group(1)
json_data = json.loads(data)
channel_id = json_data["header"]["c4TabbedHeaderRenderer"]["channelId"]
channel_name = json_data["header"]["c4TabbedHeaderRenderer"]["title"]
channel_logo = json_data["header"]["c4TabbedHeaderRenderer"]["avatar"]["thumbnails"][2]["url"]
channel_id_link = YouTubeDomain+channel_id
print("Channel ID: "+channel_id)
print("Channel Name: "+channel_name)
print("Channel Logo: "+channel_logo)
print("Channel ID: "+channel_id_link)
def vVersion(*arg):
YTV = YouTube(channelURL)
channel_id = YTV.channel_id
channel_id_link = YTV.channel_url
c = Channel(channel_id_link)
channel_name =c.channel_name
return channel_id_link, channelURL
channel_id_link, video = vVersion()
print(channel_id_link)
Link = channel_id_link
print(Link)
Test = print(BeautifulSoup(Link))
Test()
So the errors I keep getting are about having too many or too few args for the functions .
Here's the current error:
`
BeautifulSoup() takes 1 positional argument but 2 were given
File "C:\Users\Admin\test\video1.py", line 26, in BeautifulSoup
soup = BeautifulSoup(requests.get(Link, cookies={'CONSENT': 'YES+1'}).text, "html.parser")
File "C:\Users\Admin\test\video1.py", line 53, in <module>
Test = print(BeautifulSoup(Link))
`I know I'm missing something very simple.
Any help would be welcome, thank you!
`
I have tried to take the code out of my main code to isolate the issue.
I was expecting to gain a perspective on the issue.
I tried the following code to train myself on functions but it didn't really help me fix the issue I'm having with my project.
def test():
name = (input("Enter your name?"))
favNumber = (input("Please enter your best number?"))
return name, favNumber
name, favNumber = test()
print(name)
print(float(favNumber))
It's because you have named your function as BeautifulSoup which is as same as the name of the function from the library you have imported. Instead of using the function BeautifulSoup from bs4, it is now running the code you have defined which takes only one argument. So give your function another name.

Using Pyppeteer to download CSV / Excel file from Vanguard via JavaScript

I'm trying to automate downloading the holdings of Vanguard funds from the web. The links resolve through JavaScript so I'm using Pyppeteer but I'm not getting the file. Note, the link says CSV but it provides an Excel file.
From my browser it works like this:
Go to the fund URL, eg
https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio
Follow the link, "See count total holdings"
Click the link, "Export to CSV"
My attempt to replicate this in Python follows. The first link follow seems to work because I get different HTML but the second click gives me the same page, not a download.
import asyncio
from pyppeteer import launch
import os
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return page
async def fetch(url):
browser = await launch(options={'args': ['--no-sandbox']}) #headless=True,
page = await get_page(browser, url)
await page.waitFor(2000)
# save the page so we can see the source
wkg_dir = 'vanguard'
t_file = os.path.join(wkg_dir,'8225.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
accept = await page.xpath('//a[contains(., "See count total holdings")]')
print(f'Found {len(accept)} "See count total holdings" links')
if accept:
await accept[0].click()
await page.waitFor(2000)
else:
print('DID NOT FIND THE LINK')
return False
# save the pop-up page for debug
t_file = os.path.join(wkg_dir,'first_page.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
links = await page.xpath('//a[contains(., "Export to CSV")]')
print(f'Found {len(links)} "Export to CSV" links') # 3 of these
for i, link in enumerate(links):
print(f'Trying link {i}')
await link.click()
await page.waitFor(2000)
t_file = os.path.join(wkg_dir,f'csv_page{i}.csv')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
return True
#---------- Main ------------
# Set constants and global variables
url = 'https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio'
loop = asyncio.get_event_loop()
status = loop.run_until_complete(fetch(url))
Would love to hear suggestions from anyone that knows Puppeteer / Pyppeteer well.
First of all, page.waitFor(2000) should be the last resort. That's a race condition that can lead to a false negative at worst and slows your scrape down at best. I recommend page.waitForXPath which spawns a tight polling loop to continue your code as soon as the xpath becomes available.
Also on the topic of element selection, I'd use text() in your xpath instead of . which is more precise.
I'm not sure how ef.write(await page.content()) is working for you -- that should only give page HTML, not the XLSX download. The link click triggers downloads via a dialog. Accepting this download involves enabling Chrome downloads with
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
The next hurdle is bypassing or suppressing the "multiple downloads" permission prompt Chrome displays when you try to download multiple files on the same page. I wasn't able to figure out how to stop this, so my code just navigates to the page for each link as a workaround. I'll leave my solution as sub-optimal but functional and let others (or my future self) improve on it.
By the way, two of the XLSX files at indices 1 and 2 seem to be identical. This code downloads all 3 anyway, but you can probably skip the last depending on whether the page changes or not over time -- I'm not familiar with it.
I'm using a trick for clicking non-visible elements, using the browser console's click rather than Puppeteer: page.evaluate("e => e.click()", csv_link)
Here's my attempt:
import asyncio
from pyppeteer import launch
async def get_csv_links(page):
await page.goto(url)
xp = '//a[contains(text(), "See count total holdings")]'
await page.waitForXPath(xp)
accept, = await page.xpath(xp)
await accept.click()
xp = '//a[contains(text(), "Export to CSV")]'
await page.waitForXPath(xp)
return await page.xpath(xp)
async def fetch(url):
browser = await launch(headless=False)
page, = await browser.pages()
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
csv_links = await get_csv_links(page)
for i in range(len(csv_links)):
# open a fresh page each time as a hack to avoid multiple file prompts
csv_link = (await get_csv_links(page))[i]
await page.evaluate("e => e.click()", csv_link)
# let download finish; this is a race condition
await page.waitFor(3000)
if __name__ == "__main__":
url = "https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio"
asyncio.get_event_loop().run_until_complete(fetch(url))
Notes for improvement:
Try an arg like --enable-parallel-downloading or a setting like 'profile.default_content_setting_values.automatic_downloads': 1 to suppress the "multiple downloads" warning.
Figure out how to wait for all downloads to complete so the final waitFor(3000) can be removed. Another option here might involve polling for the files you expect; you can visit the linked thread for ideas.
Figure out how to download headlessly.
Other resources for posterity:
How do I get puppeteer to download a file?
Download content using Pyppeteer (will show a 404 unless you have 10k+ reputation)

Telegram command starts whenever another person starts it

So I don't know how to properly ask this question so it might seem kind of off, sorry about that.
I made a telegram bot that gets some images from a website and send it to your chat. However when an user calls the command the photos are also sent to the other users that have started the bot.
For instance, If User A calls the command to get the photos the bot will send it to him as well as to User B, User C and User D, all stacking together as if it were a single call to everyone using the bot
import requests
import os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse
import re
import telebot
API_KEY = os.getenv("API_KEY")
bot = telebot.TeleBot(API_KEY)
url_mainpage = "https://url.com"
soup = bs(requests.get(url_mainpage).content, "html.parser")
full_link = soup.find("h5", class_="elementor-image-box-title")
selectlist = full_link.select(".elementor-image-box-title a")
for a in selectlist:
global lastchapterlink
lastchapterlink = a['href']
images = []
stripped_images = []
def download_last_chapter():
soup = bs(requests.get(lastchapterlink).content, "html.parser")
images_link = soup.findAll("img")
for img in images_link:
images.append(img.get("src"))
for link in images:
stripped_images.append(link.strip())
print(stripped_images)
#bot.message_handler(commands=["last_chapter"])
def send_images(message):
download_last_chapter()
for telegram_send in stripped_images:
try:
bot.send_photo(message.chat.id, photo = telegram_send)
except:
None
bot.polling()
this is the part of the code containing the bot
Per the API documentation, the bot will reply in whatever channel it sees the message in. Are your users DMing it, or posting in a shared channel that you're all part of? Also, you're not clearing stripped_images between calls-- you're just appending the new images to it.

How can I check a webscraping page with requests realtime (always), (autoupdate)? Python

I'm a fellow young programmer and I have a question about,
I have a code checking percentages on https://shadowpay.com/en?price_from=0.00&price_to=34.00&game=csgo&hot_deal=true
And I want to make it happen in real-time.
Questions:
Is there a way to make it check in real-time or is it just by refreshing the page?
if refreshing page:
How can I make it refresh the page, I saw older answers but they did not work for me because the answers only worked in their code.
(I tried to request get it every time the while loop happens, but it doesn't work, or should it?)
This is the code:
import json
import requests
import time
import plyer
import random
import copy
min_notidication_perc = 26; un = 0; us = ""; biggest_number = 0;
r = requests.get('https://api.shadowpay.com/api/market/get_items?types=[]&exteriors=[]&rarities=[]&collections=[]&item_subcategories=[]&float={"from":0,"to":1}&price_from=0.00&price_to=34.00&game=csgo&hot_deal=true&stickers=[]&count_stickers=[]&short_name=&search=&stack=false&sort=desc&sort_column=price_rate&limit=50&offset=0', timeout=3)
while True:
#Here is the place where I'm thinking of putting it
time.sleep(5); skin_list = [];perc_list = []
for i in range(len(r.json()["items"])):
perc_list.append(r.json()["items"][i]["discount"])
skin_list.append(r.json()["items"][i]["collection"]["name"])
skin = skin_list[perc_list.index(max(perc_list))]; print(skin)
biggest_number = int(max(perc_list))
if un != biggest_number or us != skin:
if int(max(perc_list)) >= min_notidication_perc:
plyer.notification.notify(
title=f'-{int(max(perc_list))}% ShadowPay',
message=f'{skin}',
app_icon="C:\\Users\\<user__name>\\Downloads\\Inipagi-Job-Seeker-Target.ico",
timeout=120,
)
else:
pass
else:
pass
us = skin;un = biggest_number
print(f'id: {random.randint(1, 99999999)}')
print(f'-{int(max(perc_list))}% discount\n')
When using requests.get() you are retrieving the page source of that link then closing it. As you are waiting on the response you don't need the time.sleep(5) line as that is handled by requests.
In order to get the real-time value you'll have to call the page again, this is where you can use time.sleep() so as not to abuse the api.

Web Scraping with Python in combination with asyncio

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:
import requests ; from lxml import html
import asyncio
link = "http://quotes.toscrape.com/"
async def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(base_link + titles.attrib['href'])
async def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = link + next_page
processing_docs(page_link)
loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()
Upon execution what I see in the console is:
RuntimeWarning: coroutine 'processing_docs' was never awaited
processing_docs(base_link + titles.attrib['href'])
You need to call processing_docs() with await.
Replace:
processing_docs(base_link + titles.attrib['href'])
with:
await processing_docs(base_link + titles.attrib['href'])
And replace:
processing_docs(page_link)
with:
await processing_docs(page_link)
Otherwise it tries to run an asynchronous function synchronously and gets upset!

Categories

Resources