I have a list containing url of images. I want to read the images in each url line by line using python. I have tried different ways, but could only read one line.
Not having seen your code, but I would recommend using Requets.
In a shell I did:
pip install --user requests
to get the above module.
If you have an url you would be able to perform in an interactive Python
import requests
r = requests.get("http://docs.python-requests.org/en/master/_static/requests-sidebar.png")
And to examine the content of the image:
print r.content
Beware the above prints the binary content to your console.
Hope it helps.
Related
I want to get the file link from the anime I'm watching from the site.
`import requests
from bs4 import BeautifulSoup
import re
page = requests.get("http://naruto-tube.org/shippuuden-sub-219")
soup = BeautifulSoup(page.content, "html.parser")
inner_content = requests.get(soup.find("iframe")["src"])
print(inner_content.text)`
the output is the source code from the filehoster's website (ani-stream). However, my problem now is how to i get the "file: xxxxxxx" line to be printed and not the whole source code?
You can Beautiful Soup to parse the iframe source code and find the script elements, but from there you're on your own. The file: "xxxxx", line is in JavaScript code, so you'll have to find the function call (to playerInstance.setup() in this case) and decide which of the two such "file:" lines is the one you want, and strip away the unwanted JS syntax around the URL.
Regular expressions will help with that, and you're probably better off just looking for the lines in the iframe's HTML. You already have re imported, so I just replaced your last line with:
lines = re.findall("file: .*$", inner_content.text, re.MULTILINE)
print( '\n'.join(lines) )
...to get a list of lines with "file:" in them. You can (and should) use a fancier RE to find just the one with "http:// and allows only whitespace before "file:" on the lines. (Python, Java and my text editor all have different ideas about what's in an RE, so I have to go to docs every time I write one. You can do that too--it's your problem, after all.)
The requests.get() function doesn't seem to work to get the bytes. Try Vishnu Kiran's urlretrieve approach--maybe that will work. Using the URL in a browser window does seem to get the right video, though, so there may be a user agent and/or cookie setting that you'll have to spoof.
If the iframe's source is not the primary domain of the website(naruto-tube.org) its contents cannot be accessed via scraping.
You will have to use a different website or you will need to get the url in the Iframe and use some library like requests to call the url.
Note you must also pass all parameters to the url if any to actually get any result. Like so
import urllib
urllib.urlretrieve ("url from the Iframe", "mp4.mp4")
I am trying to run this basic code but even after waiting for long, Python shell simply get stuck and i always find myself facing 'Python 3.6.5 Shell(Not Responding)'. Please suggest.
import requests
from bs4 import BeautifulSoup
webdump = requests.get("https://www.flipkart.com/").text
soup = BeautifulSoup(webdump,'lxml')
print(soup.prettify())
This page is around 1MB, so spitting more than 974047 bytes (soup.prettify() adds more spaces and newlines) into the terminal at once is probably what makes it stuck.
Try printing this text line by line:
for line in soup.prettify().splitlines(False):
print(line)
I do the following Python commands:
import urllib
data = urllib.urlencode({"contains":"my_function"})
u = urllib.urlopen("http://myservername:1000/myfolder/?%s" % data)
u.read()
Then I get from that read command a lot of lines with HTML tags and one of the strings is of my interest. It looks like this:
...... onClick='doCommand("my_function","51267", $("ttt27222").value); $("ttt27222").value="";' >Apply
This is what I want to do from command line of Python using urllib.
Please let me know how to build urllib statement in order to call this my_function function passing it two parameters: 51267 and soem number for value.
Thank you
doCommand() seems like a javascript function. urllib doesn't execute javascript. You could use selenium webdriver, ghost.py to emulate web browser (to execute javascript in the context of the web page).
If the following script.py writes "some text here" to output.txt file, my URL will be http://my_name/script.py. My question is, how can I read the output.txt as soon as (right after) the following function creates it, so that my URL reads like http://my_name/output.txt.
Many thanks in advance.
#------ script.py -------
def write_txt(){
f=('./output.txt', 'w')
f.write("some text here")
}
try webbrowser lib.
import webbrowser
myurl = "file:///mydir/output.txt"
webbrowser.open(myurl)
However:
Note that on some platforms, trying to
open a filename using this function,
may work and start the operating
system’s associated program.
That is: your file will probably be open in your default text editor (p.e. notepad). A possible solution is to give a custom extension to your file (p.e. output.url) and to associate the extension to your browser (not tested)
Depends on various factors, like OS and webserver used.
Pipe the output to the browser specifying a correct content-type, or, given you script writes to an accessible location, issue a HTTP redirect code pointing to that location.
Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.
When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?
Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:
import re
import urllib2
import pycurl
url = "http://server.domain/"
path = "path/"
pattern = '(.*?)' % path
response = urllib2.urlopen(url+path).read()
for filename in re.findall(pattern, response):
with open(filename, "wb") as fp:
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url+path+filename)
curl.setopt(pycurl.WRITEDATA, fp)
curl.perform()
curl.close()
You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):
import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')
This should be work :)
and this is a fnction that can do the same thing (using urllib):
def download(url):
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?
If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.
Download the index file
If it's really huge, it may be worth reading a chunk at a time;
otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser;
else if there is much string processing to do, use regex;
else use simple string methods.
Again, you can parse it all-at-once or incrementally.
Incrementally is somewhat more efficient and elegant,
but unless you are processing multiple tens of thousands
of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try
running multiple download threads;
another (significantly faster) approach might be
to delegate the work to a dedicated downloader
program like Aria2 http://aria2.sourceforge.net/ -
note that Aria2 can be run as a service and controlled
via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython
My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.
Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.
This is a non-convential way, but although it works
fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write)
urllib.urlretrieve(link, picName) - correct way
Here's an untested solution:
import urllib2
response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')
for file in urls:
print 'Downloading ' + file
response = urllib2.urlopen(file)
handle = open(file, 'w')
handle.write(response.read())
handle.close()
It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!