python: streaming httpRequest / loading website partially - python

I would like to know if there is a way in python to "stream" httpRequests in order to avoid loading the whole page.
What I´m currently doing to get the html data of a given url is this:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
This way I´m always loading the whole website, but since I only need a small part of it I´m using more bandwith then I need to. If I could stop loading the website after I found a specific value / expression, or even better if I could specify where to start / end loading the website eg. starting at character #3000 loading until #5000 I´d save a lot of bandwith.
thanks in advance
tschery

This stackoverflow answer shows how to do partial HTTP loading in Python. You can also use response.read(N) (N being the number of bytes to read) but there is no guarantee that the exact amount you specify is downloaded.

Related

Python script to access one of the resources responses of a given webpage

I'm trying to get the Json response of one of the resources (line highlighted) automatically called by the URL I'm calling (first line).
I've tried to reconstruct the response's URL (highlighted) but I failed (I can't find from where some part of it are coming)
Even if I could, it seems like there are numerous other parameters to specify (cookies, etc.) and a bunch of previous resources called (that might include pieces of the puzzle)
Is there something like requests that can "cascade" the resources and let me choose the one I want the response to be saved ?
PS: there is far more in the response Json I want to save than in the content of the page for which I query the URL shown in a web browser)

How can I download free videos from youtube or any other site with Python, without an external video downloader?

Problem
Hi! Most answers refer to using pytube when using Python. But the problem is that pytube doesn't work for many videos on youtube now. It's outdated, and I always get errors. I also want to be able to get other free videos from other sites that are not on youtube.
And I know there are free sites and paid programs that let you put in a url, and it'll download it for you. But I want to understand the process of what's happening.
The following code works for easy things. Obviously it can be improved, but let's keep it super simple...
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r2---sn-vgqsknes.googlevideo.com/videoplayback?expire=1585432044&ei=jHF_XoXwBI7PDv_xtsgN&ip=12.345.678.99&id=743bcee1c959e9cd&itag=244&aitags=133%2C134%2C135%2C136%2C137%2C160%2C242%2C243%2C244%2C247%2C248%2C278%2C298%2C299%2C302%2C303&source=youtube&requiressl=yes&mh=1w&mm=31%2C26&mn=sn-vgqsknes%2Csn-ab5szn7z&ms=au%2Conr&mv=m&mvi=4&pl=23&pcm2=yes&initcwndbps=3728750&vprv=1&mime=video%2Fwebm&gir=yes&clen=22135843&dur=283.520&lmt=1584701992110857&mt=1585410393&fvip=5&keepalive=yes&beids=9466588&c=WEB&txp=5511222&sparams=expire%2Cei%2Cip%2Cid%2Caitags%2Csource%2Crequiressl%2Cpcm2%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRgIhAI3WtBFTf4kklX4xl859U8yzqavSzu-2OEn8tvHPoqAWAiEAlSDPhPdb5y4xPxPoXJFCNKr-h2c4jxKU8sAaaxxa7ok%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRQIhAJkFK4xhfLraysF13jSZpHCoklyhJrwLjNSCQ1v7IzeXAiBLpVpYf72Gp-dlvwTM2tYzMcVl4Axzm2ARd7fN1gPW-g%3D%3D&alr=yes&cpn=EvFJNwgO-zNQOWkz&cver=2.20200327.05.01&ir=1&rr=12&fexp=9466588&range=15036316-15227116&rn=14&rbuf=0'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)
This works. But when I want a youtube video (and I obviously can't use a regular youtube url because the document request is different from the video request)...
Steps taken
I'll check the network tab in the dev tools, and it's all filled with a bunch of xhr requests. The headers for them always have the very long url for the request, accept-ranges: bytes, and content-type: video/webm, or something similar for mp4, etc.
Then I'll copy the url for that xhr, change the file extension to the correct one, and run the program.
Result
Sometimes that downloads a small chunk of the video with no sound (few seconds long), and other times it will download a bigger chunk but with no image. But I want the entire video with sound.
Question
Can someone please help me understand how to do this, and explain what's happening, whether it's on another site or youtube??
Why does good_url work, but not bad_url??? I figured it might be a timeout thing, so I got that xhr, and immediately tested it from python, but still no luck.
A related question (don't worry about answering this one, unless required)...
Sometimes youtube has Blob urls in the html too, example: <video src='blob:https://www.youtube.com/f4484c06-48ed-4531-a6ee-6a3ae0291d26'>...
I've read various answers for what blobs are, and I'm not understanding it, because it looks to me like a blob url is doing an xhr to change a url in the DOM, as if it was trying to do the equivalent of an internal redirect on a webserver for a private file that would be served based on view/object level permissions. Is that what's happening? Cause I don't see the point, especially when these videos are free? The way I've done that, such as with lazy loading, is to have a data-src attribute with the correct value, and an onload event handler runs a function switching the data-src value to the src value.
you can try this video
https://www.youtube.com/watch?v=GAvr5_EtOnI
from this bad_url remove &range=3478828-4655264
try this code:
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r1---sn-gvcp5mp5u5-q5js.googlevideo.com/videoplayback?expire=1631119417&ei=2ZM4Yc33G8S8owPa1YiYDw&ip=42.0.7.242&id=o-AKG-sNstgjok92lJp_o4pF_iJ2MWD4skzEvFcTLl8LX8&itag=396&aitags=133,134,135,136,137,160,242,243,244,247,248,278,394,395,396,397,398,399&source=youtube&requiressl=yes&mh=gB&mm=31,29&mn=sn-gvcp5mp5u5-q5js,sn-i3belney&ms=au,rdu&mv=m&mvi=1&pl=24&initcwndbps=82500&vprv=1&mime=video/mp4&ns=mh_mFY1G7qq0apTltxepCQ8G&gir=yes&clen=7874090&dur=213.680&lmt=1600716258020131&mt=1631097418&fvip=1&keepalive=yes&fexp=24001373,24007246&beids=9466586&c=WEB&txp=5531432&n=Q3AfqZKoEoXUzw&sparams=expire,ei,ip,id,aitags,source,requiressl,vprv,mime,ns,gir,clen,dur,lmt&lsparams=mh,mm,mn,ms,mv,mvi,pl,initcwndbps&lsig=AG3C_xAwRQIgYZMQz5Tc2kucxFsorprl-3e4mCxJ3lpX1pbX-HnFloACIQD-CuHGtUeWstPodprweaA4sUp8ZikyxySZp1m3zlItKg==&alr=yes&sig=AOq0QJ8wRQIhAJZg4q9vLal64LO6KAyWkpY1T8OTlJRd9wNXrgDpNOuQAiB77lqm4Ka9uz2CAgrPWMSu6ApTf5Zqaoy5emABYqCB_g==&cpn=5E-Sqvee9UG2ZaNQ&cver=2.20210907.00.00&rn=19&rbuf=90320'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)

Doesn't get results after a while scraping (python)

I'm trying to scrap a large database for a project of mine, however I find that after I scrap a relatively big amount of data, I stop receiving some of the xml information I'm interested in. I'm not sure if it's because the server is limiting my access or because it starts scraping too fast.
I put a "sleep" line between the scraping loops to overcome this, but as I try to reach more data it doesn't work anymore.
I guess this is a known problem in web scraping but I'm very new to this field so any suggestion will be very helpful.
Note: I tried 'request' with some free proxies but that didn't work either (still some data missing). I also checked the original website and it does have the data I seek.
Edit: It looks like most of that data I'm missing comes from specific attributes that don't load as fast as all other data. So I think I'm looking for a way to tell if this xml I'm looking for has loaded already.
I'm using lxml and requests.
Thanks.

Requests.get in python3 not working. Need help to make it get data from all pages of a search query

Code to get all data from URL in requests.get not working - it only retrieves one page's worth of data (30 records). How do I have to modify my code to make sure I get data from all the pages?
NAEYCData = requests.get('http://families.naeyc.org/search_programs/results/0/NJ/0/100/0/0/0/us/0?page=')
openFile = open('NAEYCData', 'wb')
for chunk in NAEYCData.iter_content(100000):
openFile.write(chunk)
The actual page only provides 30 results at a time. Each subsequent page is accessed with a different argument to page in the URL (the first page is page=0, the second is page=1, etc.).
You could download each page individually, but frankly, the better solution (for you and their server) is probably to download the CSV linked to on the search results page you're trying to grab, which contains the same information structured as a single CSV file, requiring fewer connections and less bandwidth to transfer, and is easy to parse programmatically (easier than HTML in general, and much easier than parsing nine separate HTML pages and gluing the results back together).

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Categories

Resources