download video with python webscraping - python

Is it possible to download a video with controlslist="nodownload" and if so how? There is a poster tag and a src tag with urls, but when I tried to open them it only said Bad URL hash.
the whole thing looks like this: <video controlslist="nodownload" loop="" poster="https://scontent-fra3-1.xx.fbcdn.net/v/t39.35426-6/306972082_627455715702176_816739884095398058_n.jpg?_nc_cat=106&ccb=1-7&_nc_sid=cf96c8&_nc_ohc=gGkqXkxok9sAX-lD2Df&_nc_ht=scontent-fra3-1.xx&oh=00_AfAwpoDJdXRX_30nbuDBub38X9EcpUWJnI4yRPZ2PI1WUA&oe=63D59017" src="https://video-fra3-1.xx.fbcdn.net/v/t42.1790-2/307608880_755324925546967_7278828413698270618_n.?_nc_cat=106&ccb=1-7&_nc_sid=cf96c8&_nc_ohc=2s1fEESN7NUAX_El4Hb&_nc_oc=AQmgHTnQH8pGCmL7kHQnvHKKzkDFJc-6kTQazbteeA1cA21gUhBHplAVKoQmgAfQa2n1lKhOdkZlAXTbObUQycEp&_nc_ht=video-fra3-1.xx&oh=00_AfAOYW5RS8PEp52dlofPE3OtHjgd2SM0dxvk-dhnBIK8BQ&oe=63D18E33" width="100%" height="175"></video>
any help is appreciated <3

You can fix the "Bad URL hash" by replacing every & with ;. Use String functions in your programming language to do this.
Fixing that above issue brings a new error about an expired/invalid signature. This means the URL changes slightly after some hours. Solution: Is that you need to always first extract the latest version of the link, then secondly you attempt to do a file save. If you're lucky, you won't need a step three.

Related

Python Selenium Finding Element With Dynamic URL and Name with Little HTML

Hey everyone! Today I am working with Python 3 and Selenium web driver. I am encountering a problem with finding by element. Here is the HTML:
<a _ngcontent-pxc-c302="" class="name ng-star-inserted" href="/person/20d4a795d3fb43bdbee7e480df27b05b">michele regina</a>
The goal is to click on the first name that appears in a listed column.
The name changes with every page refresh. This comes with two problems in that the link text is the name and it changes, and the link constantly changes as well for each different name except for the part that says /person/
I tried the following:
driver.find_element_by_css_selector("a.name ng-star-inserted").click()
driver.find_element_by_class_name("name ng-star-inserted").click()
Which resulted in an element that is not a clickable error.
I also tried xpath by copying and pasting the entire XPath copied from Google inspector... lol
driver.find_element_by_xpath('/html/body/app-root/app-container/div/div/tc-app-view/div/div/div/ng-component/mat-sidenav-container/mat-sidenav-content/div/div[2]/tc-person-table-container/tc-collapse-expand[1]/section/tc-person-table/table/tbody/tr[1]/td[2]/tc-person-table-cell/a').click()
and it kind of works sometimes, which is weird but I am sure there has to be a better way. Thank you so much in advance! Any help is appreciated! Also, I am very new to Python, I know zero HTML, and I am also newer to stack! Thanks!
From the given full xpath the a tag you are looking for is in table,section,div. Look for an id in those tags and trace the a.

Getting error when i try to click on the link text using Selenium Python

It is a href I am trying to click on this using the below code however, it is not able to find link text. It has no frames and it is on the same window. Not sure what is going on
self.driver.find_element_by_link_text("UNITED WAY OF EASTERN UTAH").click()
This is the screenshot of the element code:
Wild guess here, but has the page fully loaded (including any content created by dynamic code e.g. javascript) before you try to click on the link. If the link is created after you try to find it then obviously it will be missing. Try putting a time.sleep before you try to find it.
Took a closer look to the screenshot and this should help. You are using find_element_by_link_text which if I am not mistaken looks for a complete match between the provided text and the text in the link. However the text in your link is not an exact match. You should use find_element_by_partial_link_text instead

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

Cut and resubmit url in python

I'm new to python and trying to figure this out, so sorry if this has been asked. I couldn't find it and don't know what this may be called.
So the short of it. I want to take a link like:
http://www.somedomainhere.com/embed-somekeyhere-650x370.html
and turn it into this:
http://www.somedomainhere.com/somekeyhere
The long of it, I have been working on an addon for xbmc that goes to a website, grabs a url, goes to that url to find another url. Basically a url resolver.
So the program searches the site and comes up with somekeyhere-650x370.html. But that page is in java and is unusable to me. but when I go to com/somekeyhere that code is usable. So I need to grab the first url, change the url to the usable page and then scrape that page.
So far the code I have is
if 'somename' in name:
try:
n=re.compile('<iframe title="somename" type="text/html" frameborder="0" scrolling="no" width=".+?" height=".+?" src="(.+?)">" frameborder="0"',re.DOTALL).findall(net().http_GET(url).content)[0]
CONVERT URL to .com/somekeyhere SO BELOW NA CAN READ IT.
na = re.compile("'file=(.+?)&.+?'",re.DOTALL).findall(net().http_GET(na).content)[0]
Any suggestions on how I can accomplish converting the url?
I really didn't get the long of your question.
However, answering the short
Assumptions:
somekey is a alphanumeric
a='http://www.domain.com/embed-somekey-650x370.html'
p=re.match(r'^http://www.domain.com/embed-(?P<key>[0-9A-Za-z]+)-650x370.html$',a)
somekey=p.group('key')
requiredString="http://www.domain.com/"+somekey #comment1
I have really provided a very specific answer here for just the domain name.
You should modify the regex as required. I see your code in question uses regex and hence i assume
you can frame a regex to match your requirement better.
EDIT 1 : also see urlparse from here
https://docs.python.org/2/library/urlparse.html?highlight=urlparse#module-urlparse
It provides an easy way to get to parse your url
Also, in line with "#comment1" you can actually save the domain name to a variable and reuse it here

how to make youtube videos embed on your webpage when a link is posted

I have a website that gets a lot of links to youtube and similar sites and I wanted to know if there is anyway that I can make a link automatically appear as a video. Like what happens when you post a link to a video on facebook. You can play it right on the page. Is there a way to do this without users actually posting the entire embed video HTML code?
By the way I am using google app engine with python and jinja2 templating.
Each youtube video has a unique ID which is present in the url.
Examples here:
http://www.youtube.com/watch?v=DU0Q0U08gAc&feature=g-all-esi
http://youtu.be/DU0Q0U08gAc
In this case, DU0Q0U08gAc is the movie id.
This just gets inserted in the embed tag, as you can see here:
<iframe width="560" height="315" src="http://www.youtube.com/embed/DU0Q0U08gAc" frameborder="0" allowfullscreen></iframe>
So you need to parse the url for the id and insert it to an embed tag. I believe that in the case of youtu.be style links, it's just whatever's after the '/', and in the case of youtube.com links it's probably best practice to use the urlparse library to get the 'v' variable from the url's query string. Hopefully someone will chime in if there's a corner case I'm not aware of.
Your solution is Micawber...which is available in pure Python, as well as Django and Flask plugins. Works nicely with Jinja. Embeds vids and pics into your app exactly like Facebook. You can install it via pip. Good docs; easy to follow examples. The author is responsive to questions. Works great, and it's totally free. Check it out:
http://readthedocs.org/projects/micawber/
You can also check out http://oembed.com and http://embed.ly ...although the latter is not free, and starts at $19/mo (as of July 2012).
Use this code when getting embed link from list value. In the template inside the iframe use below code
src="{{results[0].video_link}}"
"video_link" is the Field name.

Categories

Resources