I have following code to download something (in this example it's a video).
from playwright.sync_api import sync_playwright
import func
import os, time, shutil
import requests
from tqdm.auto import tqdm
def download():
with page.expect_download() as download_info:
page.click("text=720p (MP4")
download = download_info.value
with requests.get(download.url, stream=True) as r:
# check header to get content length, in bytes
total_length = int(r.headers.get("Content-Length"))
# implement progress bar via tqdm
with tqdm.wrapattr(r.raw, "read", total=total_length, desc="") as raw:
# save the output to a file
with open(f"{os.path.basename(r.url)}", 'wb') as output:
shutil.copyfileobj(raw, output)
download.save_as(os.path.join(func.report_folder_path, download.suggested_filename))
# initialize navigation
with sync_playwright() as p:
browser = p.chromium.launch(channel="msedge", headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
# go to canon s21 login
print("Entering on download page")
page.goto('https://www.jw.org/en/library/videos/#en/mediaitems/StudioFeatured/docid-702023003_1_VIDEO')
page.wait_for_selector("text=Download")
page.locator("text=Download").first.click()
download()
time.sleep(3)
print('end')
I'm trying to implement a file download progress bar at the same time as the download is done by playwright. This code manages to download the file in the directory correctly and although it does not show a timeout error, it only ends the download function after the 30 seconds of the default timeout.
Another thing I noticed is that the download progress bar does not correctly show the real download status, i.e. it shows the download at 100% even though the real download file is still in progress.
Any idea how to fix these problems?
Related
I'm creating a small script for me to download youtube videos sounds.
Everytime you download a sound i use a tqdm bar to display download infos.
The first time you download everything work fine, but the second time my bar is completely destroyed :(. i really don't know what's happening with it...
(i think the bar doesn't update correctly)
Here's the code that handle the bar and download the sound
Thanks for your time :)
def DownloadAudioFromUrl(url):
print("Getting the URL...")
vid = pafy.new(url)
print("Done")
print("Getting best quality...")
stream = vid.getbestaudio()
fileSize = stream.get_filesize()
print("Done")
print("Downloading: " + vid.title + " ...")
with tqdm.tqdm(total=fileSize, unit_scale=True, unit='B', initial=0) as pbar:
stream.download("Download/", quiet=True, callback=lambda _, received, *args: UpdateBar(pbar, received))
print("Done")
ResetProgressBar(pbar)
WebmToMp3()
def ResetProgressBar(bar):
bar.reset()
bar.clear()
bar.close()
# i used these last time i tried i don't undersand how they work :/
def UpdateBar(pbar, current_received):
global previous_received
diff = current_received - previous_received
pbar.update(diff)
previous_received = current_received
So i tried to update the bar with "reset" "clear" and "stop" but it changed nothing
Starting point is Spyder IDE.
>Spyder IDE (5.1.0)
>
>The Scientific Python Development Environment | Spyder-IDE.org
>
>Python 3.8.5 64-bit | Qt 5.12.9 | PyQt5 5.12.3 | Linux 5.4.0-81-generic
What do I want to do?
Scrape a tricky blog, seems that blogspot is obfuscating a lot more, but within Spyder, I sometimes find that I cannot even scrape my own home page...
import asyncio
from requests_html import AsyncHTMLSession, HTML, HTMLSession
from bs4 import BeautifulSoup as bs
import re
import os, os.path
from pathlib2 import Path
from collections import OrderedDict as Odict
from datetime import datetime, date, timedelta
import pytz
import unicodedata
import sys
# asession = AsyncHTMLSession()
ass = AsyncHTMLSession()
sss = HTMLSession()
url='http://localhost/index.html'
def syncurl(session=None, url=None):
r = session.get(url)
return r
async def asyncurl(session=None, url=None):
r = await session.get(url)
#if r.status_code == 200:
#await r.html.arender()
return r
def gurl(ass, url):
fiz = lambda : asyncurl(ass, url)
foz = ass.run(fiz)
return foz
So if I run this in Spyder then execute I get the expected 'loop already running' crap.
gurl(ass,url)
Traceback (most recent call last):
File "<ipython-input-2-ebc91fe79d44>", line 1, in <module>
gurl(ass,url)
File "/home/user/PycharmProjects/blogscrape/BlogScraping/asynctest.py", line 38, in gurl
foz = ass.run(fiz)
File "/opt/anaconda3/lib/python3.8/site-packages/requests_html.py", line 774, in run
done, _ = self.loop.run_until_complete(asyncio.wait(tasks))
File "/opt/anaconda3/lib/python3.8/asyncio/base_events.py", line 592, in run_until_complete
self._check_running()
File "/opt/anaconda3/lib/python3.8/asyncio/base_events.py", line 552, in _check_running
raise RuntimeError(f'This event loop is already running : {self._thread_id}')
RuntimeError: This event loop is already running : 139750638774080
I'm not trying to reinvent the wheel here, and I'm sure many others have this issue, but so far I've not seen a concise answer, (other than it's a Spyder bug etc).
I just want it to work in Spyder, (principally, because I like to play around with pandas to look at the results).
I suppose one way would be to run the thing as a stand alone script saving the results into a pickle, and THEN use spyder to reload the dataframe and use that. But, hey, why is that necessary?
The principal problem is the lack of clarity in requests-html. The error is very opaque to anyone who is simply trying to work around the original problem of ..
RuntimeError: Cannot use HTMLSession within an existing event loop.
Use AsyncHTMLSession instead.
And yes, I have tried to Google this problem, but they always start talking 'asyncio' stuff. I'm reading the 'requests-html' help, anything beyond that is above my pay-grade (currently zero).
So any advice, please?
(only simple stuff from asyncio that a simple IC designer could understand).
Thanks #Daniel,
Yes, that does seem to work, to fix the issue shown above. It is not 100% perfect though, since some times I get a timeout error, that I'm not sure why, but I no longer get the timeout error.
Just to put it all in one place.. After installing with,
pip install nest_asyncio
Just add the following to the python code.
import nest_asyncio
nest_asyncio.apply()
This is enough to get the code running within Spyder, (as this was the original issue).
Adding an extra sleep / timeout in the code for 'asyncurl' allows the script to run, albeit slowly, so don't try and run too many calls in the script. The above function is modified as follows.
async def asyncurl(session=None, url=None):
r = await session.get(url)
await asyncio.sleep(5.0)
# if r.status_code == 200:
await r.html.arender(timeout=20000)
return r
I am trying to download files in python using the wget module. I understand that it's supposed to have several progress bar modes but non actually show in the console.
I could not find any documentation for this module.
import wget
from pathlib import Path
print('Beginning download...')
url = 'https://storage.googleapis.com/meirtvmp3/archive/hebrew/mp3/sherki/daattvunot/idx_69115.mp3'
wget.download(url)
The file downloads but no progress bar shows.
The source code for the module does have a reference to it. But when I try it in my code python cant find a reference to it.
EDIT
I tested the script from the terminal and it worked as expected. I guess it's a pycharm/venv bug.
This works for me,
#create this bar_progress method which is invoked automatically from wget
def bar_progress(current, total, width=80):
progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
# Don't use print() as it will print in new line every time.
sys.stdout.write("\r" + progress_message)
sys.stdout.flush()
#Now use this like below,
url = 'http://url_to_download'
save_path = "/home/save.file"
wget.download(url, save_path, bar=bar_progress)
This is my first posting, so please forgive any lack of decorum
I am building a SeeingWand as outlined in MagPi issue #71.
I have installed and tested all the HW. Then install the python code, the original; code was python2.7, I have update the code to run under python3, but get a strange error when i run the code:
The system displays that the http module does not have a .client attribute.
The documentation says it does. I have tried .client and .server attributes both give the same error. What am i doing wrong?
I have tried several coding variations and several builds of the raspberry OS (Raspbian) mostly give the same errors
import picamera, http, urllib, base64, json, re
from os import system
from gpiozero import Button
CHANGE {MS_API_KEY} BELOW WITH YOUR MICROSOFT VISION API KEY
ms_api_key = "{MS_API_KEY}"
camera button - this is the BCM number, not the pin number
camera_button = Button(27)
setup camera
camera = picamera.PiCamera()
setup vision API
headers = {
'Content-Type': 'application/octet-stream',
'Ocp-Apim-Subscription-Key': ms_api_key,
}
params = urllib.parse.urlencode({
'visualFeatures': 'Description',
})
loop forever waiting for button press
while True:
camera_button.wait_for_press()
camera.capture('/tmp/image.jpg')
body = open('/tmp/image.jpg', "rb").read()
try:
conn = http.client.HTTPsConnection('westcentralus.api.cognitive.microsoft.com')
conn.request("POST", "/vision/v1.0/analyze?%s"%params, body, headers)
response = conn.getresponse()
analysis=json.loads(response.read())
image_caption = analysis["description"]["captions"][0]["text"].capitalize()
# validate text before system() call; use subprocess in next version
if re.match("^[a-zA-z ]+$", image_caption):
system('espeak -ven+f3 -k5 -s120 "' + image_caption + '"')
else :
system('espeak -ven+f3 -k5 -s120 "i do not know what i just saw"')
conn.close()
except Exception as e:
print (e.args)
The system displays an error stating that the http module does not have a .client attribute.
The documentation says it does. I have tried .client and .server attributes both give the same error. What am i doing wrong?
Expected results are:
when i push button 1 I expect the camera to take a picture
when i push button 2 i expect to access MSFT Azure to identify the picture using AI
the final output is for the Wand to access the audio hat and describe what the Wand is "looking" at.
try adding an import like this:
import http.client
Edit: http is a Python package. Even if the package contains some modules, it does not automatically import those modules when you import the package, unless the __init__.py for that package does so on your behalf. In the case of http, the __init__.py is empty, so you get nothing gratis just for importing the package.
I'm writing some tests and I'm I'm using the Firefox webdriver with a FirefoxProfile to download a file from an external url, but I need to read such file as soon as it finishes downloading to retrieve some specific data.
I set my profile and driver like this:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", '/some/path/')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
ff = webdriver.Firefox(firefox_profile=fp)
Is there some way to know when the file finishes downloading, so that I know when to call the reader function without having to poll the download directory, waiting with time.sleep or using any Firefox add-on?
Thanks for any help :)
You could try hooking the file up to a file object as it downloads to use it like a stream buffer, polling it as it downloads to get the data you need, monitoring for the download completion yourself directly (either by waiting for the file to be of the expected size or by assuming it is complete if there has been no new data added for a certain amount of time).
Edit:
You could try to look at the download tracking db in the profile folder as referenced here. Looks like you can wait for your file to have status 1.
I like to use inotify to watch for these kinds of events. Some example code,
from pyinotify import (
EventsCodes,
ProcessEvent,
Notifier,
WatchManager,
)
class EventManager(ProcessEvent):
def process_IN_CLOSE_WRITE(self, event):
file_path = os.path.join(event.path, event.name)
# do something to file, you might want to wait a second here and
# also test for existence because ff might be making temp files
wm = WatchManager()
notifier = Notifier(wm, EventManager())
wdd = wm.add_watch('/some/path', EventsCodes.ALL_FLAGS['IN_CLOSE_WRITE'], rec=True)
While True:
try:
notifier.process_events()
if notifier.check_events():
notifier.read_events()
except:
notifier.stop()
raise
The notifier decides which method to call on the event manager based on the name of the event. So in this case we are only watching for IN_CLOSE_WRITE events
It's far from ideal, however with firefox you could check the target folder for the presence of the .part file which is present while it's still downloading (with other browsers you can do something similar).
A while loop will then halt everything while waiting for the download to complete:
import os
def test_for_partfile():
part_file = False
dir = "C:\\Downloads"
filelist = (os.listdir(dir))
for partfile in filelist:
if partfile.endswith('.part'):
part_file = True
return part_file
while test_for_partfile():
time.sleep(15)