Extract video from .swf using Python - python

I've written code that generated the links to videos such as the one below.
Once obtained, I try to download it in this manner:
import urllib.request
import os
url = 'http://www.videodetective.net/flash/players/?customerid=300120&playerid=351&publishedid=319113&playlistid=0&videokbrate=750&sub=RTO&pversion=5.2%22%20width=%22670%22%20height=%22360%22'
response = urllib.request.urlopen(url).read()
outpath = os.path.join(os.getcwd(), 'video.mp4')
videofile = open(outpath , 'wb')
videofile.write(response)
videofile.close()
All I get is a 58kB file in that directory that can't be read. Could someone point me in the right direction?

With your code, you aren't downloading the encoded video file here, but the flash application (in CWS-format) that is used to play the video. It is executed in the browser and dynamically loads and plays the video. You'd need to apply some reverse-engineering to figure out the actual video source. The following is my attempt at it:
Decompressing the SWF file
First, save the 58K file you mentioned on your hard disk under the name test.swf (or similiar).
You can then use the small Perl script cws2fws for that:
perl cws2fws test.swf
This results in a new file named test.fws.swf in the same directory
Searching for the configuration URL in the FWS file
I used a simple
strings test.fws.swf | grep http
Which yields:
...
cookieOhttp://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=
...
Interesting. Let's try to put our known customerid, playerid and publishedid arguments to this URL:
http://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=300120&playerid=351&publishedid=319113
If we open that in a browser, we can see the player configuration XML, which in turn points us to
http://www.videodetective.net/flash/players/playlist.aspx?videokbrate=450&version=4.6&customerid=300120&fmt=3&publishedid=&sub=
Now if we open that, we can finally see the source URL:
http://cdn.videodetective.net/svideo/mp4/450/6993/293732.mp4?c=300120&r=450&s=293732&d=153&sub=&ref=&fmt=4&e=20111228220329&h=03e5d78201ff0d2f7df9a
Now we can download this h264 video file and we are finished.
Automating the whole process in a Python script
This is an entirely different task (left as an exercise to the reader).

Related

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).
For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Fatal error reading PNG image file: Not a PNG file in Ubuntu 20.04 LTS

I try to download an image using requests module in python.It works but when i try to open this image it showing "Fatal error reading PNG image file: Not a PNG file". Here is my error screenshot.And the code i used to download is,
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.png","wb") as f:
f.write(r.content)
And i run my code in terminal just saying, python3 s.py (s.py is the name of file).
Is something wrong in my code or something else in my operating system(ubuntu 20.04 LTS)?
import requests
response = requests.get("https://devnote.in/wp-content/uploads/2020/04/devnote.png")
file = open("sample_image.png", "wb")
file.write(response.content)
print (response.content)
file.close()
https://devnote.in/wp-content/uploads/2020/04/devnote.png this url is Disable mod_security. so this return error like :
<html><head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>.
Disable mod_security using .htaccess on apache server
Mod_security can be easily disabled with the help of .htaccess.
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>
It's because you tried to save javaWeb.jpg (A JPG file) as java_book.png (A PNG file).
In an attempt to see what we are working on, I've tried replicating the issue, please see below what found out.
1.) The file you are attempting to open is the ENTIRE HTML document. I can support this, because we are finding !DOCTYPE html at the beginning of your 'wb' or WRITE BINARY command.
<---------------------------------------------- WE ARE AT AN IMPASSE
From here we have a few options to solve our problem.
a.) We could simply download the image from the web page - placing it in a local folder/directory/ or wherever you want it. This is by far our easiest call, because it allows us to call and open it for later without having to do too much. While I'm on a Windows machine - Ubuntu should have no problem doing this either (Unless you aren't in an UBUNTU with a GUI - that can be fixed with startx IF SUPPORTED)
b.) If you have to pull directly from the site itself, you could try something like this using BEAUTIFULSOUP from this answer here. Honestly, I've never really used the latter option since downloading and moving is much more effective.
You just need to save the image as a JPG.
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.jpg","wb") as f:
f.write(r.content)
Yeah, it's a full HTML document:

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

Only 1 KB of file is downloading, instead of the whole thing in Python

I've attempted to use urllib, requests, and wget. All three don't work.
I'm trying to download a 300KB .npz file from a URL. When I download the file with wget.download(), urllib.request.urlretrieve(), or with requests, an error is not thrown. The .npz file downloads. However, this .npz file is not 300KB. The file size is only 1 KB. Also, the file is unreadable - when I use np.load(), the error OSError: Failed to interpret file 'x.npz' as a pickle shows up.
I am also certain that the URL is valid. When I download the file with a browser, it is correctly read by np.load() and has the right file size.
Thank you very much for the help.
Edit 1:
The full code was requested. This was the code:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
wget.download(loadfrom, savedir)
data = np.load(savedir)
I've also used variants with urllib:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
urllib.request.urlretrieve(loadfrom, savedir)
data = np.load(savedir)
and requests:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
r = requests.get(loadfrom).content
with open("x.npz",'wb') as f:
f.write(r)
data = np.load(savedir)
They all produce the same result, with the aforementioned conditions.
Kindly show the full code and the exact lines you use to download the file. Remember you need to use
r=requests.get("direct_URL_of_your_file.npz").content
with open("local_file.npz",'wb') as f:
f.write(r)
Also make sure the URL is a direct download link.
The issue was that the server needed javascript to run as a security precaution. So, when I send the request, all I got was html with "This Site Requres Javascript to Work". I found out that there was a __test cookie that needed to be passed during the request.
This answer explains it fully. This video may also be helpful.

Categories

Resources