I get an indentation error when trying to run the code below. I am trying to print out the URLs of a set of html pages recursively.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set()
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit()
pages=newpages
Well you're coded is totally unindented so Python will cry when you try and run it.
Remember in Python whitespace is important. Indenting with 4 spaces rather than tab saves a lot of "invisible" indentation errors.
I've down-voted as the code was pasted unformatted/unindented which means either the poster doesn't understand python (and hasn't read a basic tutorial) or pasted the code without re-indenting , which makes it impossible for anyone to answer.
Related
I am working on an assignment for work where my boss wants me to write a program to verify company domains exist with a certain HR system. This HR system has their name directly in the company's website URL. For example, ropeswings.bamboohr.com. I'm writing a program that checks those URLs and if the HTTP response is 200, my program adds them to a list. Once the program is working I'm going to output that list to a txt file.
My problem is the python request and logging the HTTP response properly. When I run this code with the name of a company that I know for a fact has this HR company, the list still outputs as empty. And I can't figure out where I went wrong. Any ideas?
import sys, re, csv
domains = []
bamboos = []
with open('clist.csv') as file:
reader = csv.reader(file, delimiter=',')
for row in reader:
main = row[14][:-4]
if main in domains:
continue
else:
domains.append(main)
domains.pop(0)
for domain in domains:
try:
request = requests.get('https://' + domain + '.bamboohr.com/login.php')
if request.status_code == 200:
bamboos.append(domain)
except:
continue
print(bamboos)
This may sound a little bit awkward, but I think you aren't importing the requests package within the module. Remember that requests is non a built-in class/object. Please, import it and try again.
I think the problem is you have not imported requests. Add import requests to your python file.
Once the code execution enters try, an error in try block will cause the execution to go to the except block. As, you are using continue statement in except, it wont raise any error and goes back to the for loop again resulting in an empty list. Your program maybe failing in requests.get(). You can also check this by adding a print in the beginning of except.
If except is not executed try to print the response code. If you would like to get the traceback errors if any, execute the file without try and except. This will give you an idea if there is something wrong with your code
I want to get the file link from the anime I'm watching from the site.
`import requests
from bs4 import BeautifulSoup
import re
page = requests.get("http://naruto-tube.org/shippuuden-sub-219")
soup = BeautifulSoup(page.content, "html.parser")
inner_content = requests.get(soup.find("iframe")["src"])
print(inner_content.text)`
the output is the source code from the filehoster's website (ani-stream). However, my problem now is how to i get the "file: xxxxxxx" line to be printed and not the whole source code?
You can Beautiful Soup to parse the iframe source code and find the script elements, but from there you're on your own. The file: "xxxxx", line is in JavaScript code, so you'll have to find the function call (to playerInstance.setup() in this case) and decide which of the two such "file:" lines is the one you want, and strip away the unwanted JS syntax around the URL.
Regular expressions will help with that, and you're probably better off just looking for the lines in the iframe's HTML. You already have re imported, so I just replaced your last line with:
lines = re.findall("file: .*$", inner_content.text, re.MULTILINE)
print( '\n'.join(lines) )
...to get a list of lines with "file:" in them. You can (and should) use a fancier RE to find just the one with "http:// and allows only whitespace before "file:" on the lines. (Python, Java and my text editor all have different ideas about what's in an RE, so I have to go to docs every time I write one. You can do that too--it's your problem, after all.)
The requests.get() function doesn't seem to work to get the bytes. Try Vishnu Kiran's urlretrieve approach--maybe that will work. Using the URL in a browser window does seem to get the right video, though, so there may be a user agent and/or cookie setting that you'll have to spoof.
If the iframe's source is not the primary domain of the website(naruto-tube.org) its contents cannot be accessed via scraping.
You will have to use a different website or you will need to get the url in the Iframe and use some library like requests to call the url.
Note you must also pass all parameters to the url if any to actually get any result. Like so
import urllib
urllib.urlretrieve ("url from the Iframe", "mp4.mp4")
I am trying to run this basic code but even after waiting for long, Python shell simply get stuck and i always find myself facing 'Python 3.6.5 Shell(Not Responding)'. Please suggest.
import requests
from bs4 import BeautifulSoup
webdump = requests.get("https://www.flipkart.com/").text
soup = BeautifulSoup(webdump,'lxml')
print(soup.prettify())
This page is around 1MB, so spitting more than 974047 bytes (soup.prettify() adds more spaces and newlines) into the terminal at once is probably what makes it stuck.
Try printing this text line by line:
for line in soup.prettify().splitlines(False):
print(line)
I'm trying to make python challange.
http://www.pythonchallenge.com/pc/def/ocr.html
Ok. I know, I can just copy paste the code from source to a txt file and make things like that but I want to take it from net for improving myself. (+ I have done it already) I have tried
re.findall(r"<!--(.*?)-->,html)
But it doesn't get anything.
If you want my full code is here:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes
Also I tried making it like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes
Now it finds the text but still can't get that mess :(
I would use an HTML parser instead. You can find comments in HTML with BeautifulSoup.
Working code:
import requests
from bs4 import BeautifulSoup, Comment
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
code = soup.find_all(text=lambda text: isinstance(text, Comment))[-1]
print(code.strip())
Not sure what you mean by "that mess". You should include all of the details of the challenge within this post, instead of linking users to the pythonchallenge post.
Either way, if you set the regex to be in single-line mode, //s, then the dot character, ., should match newlines, /n, as well. This obviates the \n(.+)\n construction in your regex, which may solve your problem.
Here's a link to a working regex example.
Here is the modified python 2.7 code:
#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]
Note the re.S, (.*?), and codes[1] modifications.
re.S is python's flag for //s
(.*?) makes the * quantifier non-greedy
codes[1] prints the second set of content found within HTML comments (since findall(..) matches 2 and returns an array of both sets).
You can solve:
codes = re.findall("/<!--(.*?)-->/s",str(x.content))
"s" find with whitespace and breakline
Am trying to test out urllib2. Here's my code:
import urllib2
response = urllib2.urlopen('http://pythonforbeginners.com/')
print response.info()
html = response.read()
response.close()
When I run it, I get:
Syntax Error: invalid syntax. Carrot points to line 3 (the print line). Any idea what's going on here? I'm just trying to follow a tutorial and this is the first thing they do...
Thanks,
Mariogs
In Python3 print is a function. Therefore it needs parentheses around its argument:
print(response.info())
In Python2, print is a statement, and hence does not require parentheses.
After correcting the SyntaxError, as alecxe points out, you'll probably encounter an ImportError next. That is because the Python2 module called urllib2 was renamed to urllib.request in Python3. So you'll need to change it to
import urllib.request as request
response = request.urlopen('http://pythonforbeginners.com/')
As you can see, the tutorial you are reading is meant for Python2. You might want to find a Python3 tutorial or Python3 urllib HOWTO to avoid running into more of these problems.