Web scraping Python Shell Not Responding - python

I am trying to run this basic code but even after waiting for long, Python shell simply get stuck and i always find myself facing 'Python 3.6.5 Shell(Not Responding)'. Please suggest.
import requests
from bs4 import BeautifulSoup
webdump = requests.get("https://www.flipkart.com/").text
soup = BeautifulSoup(webdump,'lxml')
print(soup.prettify())

This page is around 1MB, so spitting more than 974047 bytes (soup.prettify() adds more spaces and newlines) into the terminal at once is probably what makes it stuck.
Try printing this text line by line:
for line in soup.prettify().splitlines(False):
print(line)

Related

Getting a specific file from requested iframe

I want to get the file link from the anime I'm watching from the site.
`import requests
from bs4 import BeautifulSoup
import re
page = requests.get("http://naruto-tube.org/shippuuden-sub-219")
soup = BeautifulSoup(page.content, "html.parser")
inner_content = requests.get(soup.find("iframe")["src"])
print(inner_content.text)`
the output is the source code from the filehoster's website (ani-stream). However, my problem now is how to i get the "file: xxxxxxx" line to be printed and not the whole source code?
You can Beautiful Soup to parse the iframe source code and find the script elements, but from there you're on your own. The file: "xxxxx", line is in JavaScript code, so you'll have to find the function call (to playerInstance.setup() in this case) and decide which of the two such "file:" lines is the one you want, and strip away the unwanted JS syntax around the URL.
Regular expressions will help with that, and you're probably better off just looking for the lines in the iframe's HTML. You already have re imported, so I just replaced your last line with:
lines = re.findall("file: .*$", inner_content.text, re.MULTILINE)
print( '\n'.join(lines) )
...to get a list of lines with "file:" in them. You can (and should) use a fancier RE to find just the one with "http:// and allows only whitespace before "file:" on the lines. (Python, Java and my text editor all have different ideas about what's in an RE, so I have to go to docs every time I write one. You can do that too--it's your problem, after all.)
The requests.get() function doesn't seem to work to get the bytes. Try Vishnu Kiran's urlretrieve approach--maybe that will work. Using the URL in a browser window does seem to get the right video, though, so there may be a user agent and/or cookie setting that you'll have to spoof.
If the iframe's source is not the primary domain of the website(naruto-tube.org) its contents cannot be accessed via scraping.
You will have to use a different website or you will need to get the url in the Iframe and use some library like requests to call the url.
Note you must also pass all parameters to the url if any to actually get any result. Like so
import urllib
urllib.urlretrieve ("url from the Iframe", "mp4.mp4")

Read url line by line in python

I have a list containing url of images. I want to read the images in each url line by line using python. I have tried different ways, but could only read one line.
Not having seen your code, but I would recommend using Requets.
In a shell I did:
pip install --user requests
to get the above module.
If you have an url you would be able to perform in an interactive Python
import requests
r = requests.get("http://docs.python-requests.org/en/master/_static/requests-sidebar.png")
And to examine the content of the image:
print r.content
Beware the above prints the binary content to your console.
Hope it helps.

How do I use PYTHONIOENCODING environment variable to get past a unicode interpretation issue

I am trying to run a very short script in Python
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
I asked a similar question earlier, and this question has been created based on a respectable comment by J Sebastian on his answer. Python program is running in IDLE but not in command line
Is there a way to set the PythonIOEncoding earlier in either Github's Atom or Sublime Text 2 to automatically encode soup.prettify() to utf-8
I am going to run this program on a server (of course, the current portion is merely a quick test)
s=soup.prettify().encode('utf8') makes it UTF-8 explicitly.
setting PYTHONIOENCODING=utf8 in the shell and then print(soup.prettify()) should use the specified encoding implicitly.

Call command of web service from command line of Python

I do the following Python commands:
import urllib
data = urllib.urlencode({"contains":"my_function"})
u = urllib.urlopen("http://myservername:1000/myfolder/?%s" % data)
u.read()
Then I get from that read command a lot of lines with HTML tags and one of the strings is of my interest. It looks like this:
...... onClick='doCommand("my_function","51267", $("ttt27222").value); $("ttt27222").value="";' >Apply
This is what I want to do from command line of Python using urllib.
Please let me know how to build urllib statement in order to call this my_function function passing it two parameters: 51267 and soem number for value.
Thank you
doCommand() seems like a javascript function. urllib doesn't execute javascript. You could use selenium webdriver, ghost.py to emulate web browser (to execute javascript in the context of the web page).

Requiring help in figuring out indent error in python code

I get an indentation error when trying to run the code below. I am trying to print out the URLs of a set of html pages recursively.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set()
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit()
pages=newpages
Well you're coded is totally unindented so Python will cry when you try and run it.
Remember in Python whitespace is important. Indenting with 4 spaces rather than tab saves a lot of "invisible" indentation errors.
I've down-voted as the code was pasted unformatted/unindented which means either the poster doesn't understand python (and hasn't read a basic tutorial) or pasted the code without re-indenting , which makes it impossible for anyone to answer.

Categories

Resources