Does anyone know how I would be able to take the the URL as an argument in Python as page?
Just to readline in the script, user inputs into the shell and pass it through as an argument just to make the script more portable?
import sys, re
import webpage_get
def print_links(page):
''' find all hyperlinks on a webpage passed in as input and
print '''
print '\n[*] print_links()'
links = re.findall(r'(\http://\w+\.\w+[-_]*\.*\w+\.*?\w+\.*?\w+\.*[//]*\.*?\w+ [//]*?\w+[//]*?\w+)', page)
# sort and print the links
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
for link in links:
print link
def main():
# temp testing url argument
sys.argv.append('http://www.4chan.org')
# Check args
if len(sys.argv) != 2:
print '[-] Usage: webpage_getlinks URL'
return
# Get the web page
page = webpage_get.wget(sys.argv[1])
# Get the links
print_links(page)
if __name__ == '__main__':
main()
It looks like you kind of got started with command line arguments but just to give you an example for this specific situation you could do something like this:
def main(url):
page = webpage_get.wget(url)
print_links(page)
if __name__ == '__main__':
url = ""
if len(sys.argv >= 1):
url = sys.argv[0]
main(url)
Then run it from shell like this
python test.py http://www.4chan.org
Here is a tutorial on command line arguments which may help your understanding more than this snippet http://www.tutorialspoint.com/python/python_command_line_arguments.htm
Can you let me know if I miss understood your question? I didn't feel to confident in the meaning after I read it.
Related
i have a script that check the input link, if it's equivalent to one i specified in the code, then it will use my code, else it open the link in chrome.
i want to make that script kind of as a default browser, as to gain speed compared to opening the browser, getting the link with an help of an extension and then send it to my script using POST.
i used procmon to check where the process in question query the registry key and it seem like it tried to check HKCU\Software\Classes\ChromeHTML\shell\open\command so i added a some key there and in command, i edited the content of the key with my script path and arguments (-- %1)(-- only here for testing purposes)
unfortunately, once the program query this to send a link, windows prompt to choose a browser instead of my script, which isn't what i want.
Any idea?
in HKEY_CURRENT_USER\Software\Classes\ChromeHTML\Shell\open\command Replace the value in default with "C:\Users\samdra.r\AppData\Local\Programs\Python\Python39\pythonw.exe" "[Script_path_here]" %1
when launching a link, you'll be asked to set a default browser only once (it ask for a default browser for each change you make to the key):
i select chrome in my case
as for the python script, here it is:
import sys
import browser_cookie3
import requests
from bs4 import BeautifulSoup as BS
import re
import os
import asyncio
import shutil
def Prep_download(args):
settings = os.path.abspath(__file__.split("NewAltDownload.py")[0]+'/settings.txt')
if args[1] == "-d" or args[1] == "-disable":
with open(settings, 'r+') as f:
f.write(f.read()+"\n"+"False")
print("Background program disabled, exiting...")
exit()
if args[1] == "-e" or args[1] == "-enable":
with open(settings, 'r+') as f:
f.write(f.read()+"\n"+"True")
link = args[-1]
with open(settings, 'r+') as f:
try:
data = f.read()
osupath = data.split("\n")[0]
state = data.split("\n")[1]
except:
f.write(f.read()+"\n"+"True")
print("Possible first run, wrote True, exiting...")
exit()
if state == "True":
asyncio.run(Download_map(osupath, link))
async def Download_map(osupath, link):
if link.split("/")[2] == "osu.ppy.sh" and link.split("/")[3] == "b" or link.split("/")[3] == "beatmapsets":
with requests.get(link) as r:
link = r.url.split("#")[0]
BMID = []
id = re.sub("[^0-9]", "", link)
for ids in os.listdir(os.path.abspath(osupath+("/Songs/"))):
if re.match(r"(^\d*)",ids).group(0).isdigit():
BMID.append(re.match(r"(^\d*)",ids).group(0))
if id in BMID:
print(link+": Map already exist")
os.system('"'+os.path.abspath("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")+'" '+link)
return
if not id.isdigit():
print("Invalid id")
return
cj = browser_cookie3.load()
print("Downloading", link, "in", os.path.abspath(osupath+"/Songs/"))
headers = {"referer": link}
with requests.get(link) as r:
t = BS(r.text, 'html.parser').title.text.split("·")[0]
with requests.get(link+"/download", stream=True, cookies=cj, headers=headers) as r:
if r.status_code == 200:
try:
id = re.sub("[^0-9]", "", link)
with open(os.path.abspath(__file__.split("NewAltDownload.pyw")[0]+id+" "+t+".osz"), "wb") as otp:
otp.write(r.content)
shutil.copy(os.path.abspath(__file__.split("NewAltDownload.pyw")[0]+id+" "+t+".osz"),os.path.abspath(osupath+"/Songs/"+id+" "+t+".osz"))
except:
print("You either aren't connected on osu!'s website or you're limited by the API, in which case you now have to wait 1h and then try again.")
else:
os.system('"'+os.path.abspath("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")+'" '+link)
args = sys.argv
if len(args) == 1:
print("No arguments provided, exiting...")
exit()
Prep_download(args)
you obtain the argument %1 (the link) with sys.argv()[-1] (since sys.argv is a list) and from there, you just check if the link is similar to the link you're looking for (in my case it need to look like https://osu.ppy.sh/b/ or https://osu.ppy.sh/beatmapsets/)
if that's the case, do some code, else, just launch chrome with chrome executable and the link as argument. and if the id of the beatmap is found in the Songs folder, then i also open the link in chrome.
to make it work in the background i had to fight with subprocesses and even more tricks, and at the end, it started working suddenly with pythonw and .pyw extension.
Hello there I want to code a python program, that opens a website. When you just type a shortcut e.g. "google" it will open "https://www.google.de/". The problem is that it won´t open the right url.
import webbrowser
# URL list
google = "https://www.google.de"
ebay = "https://www.ebay.de/"
# shortcuts
Websites = ("google", "ebay")
def inputString():
inputstr = input()
if inputString(google) = ("https://www.google.de")
else:
print("Please look for the right shortcut.")
return
url = inputString()
webbrowser.open(url)
Using your example you can do:
google = "https://www.google.de"
ebay = "https://www.ebay.de/"
def inputString():
return input()
if inputString() == "google":
url = google
webbrowser.open(url)
Or you can do it the simple way as #torxed said:
inputstr = input()
sites = {'google' : 'https://google.de', 'ebay':'https://www.ebay.de/'}
if inputstr in sites:
webbrowser.open(sites[inputstr])
How about:
import webbrowser
import sys
websites = {
"google":"https://www.google.com",
"ebay": "https://www.ebay.com"
}
if __name__ == "__main__":
try:
webbrowser.open(websites[sys.argv[1]])
except:
print("Please look for the right shortcut:")
for website in websites:
print(website)
run like so python browse.py google
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I currently write a python crawler, I want to switch to the next page but what is the best pratice ?
Actually it's simple, the end of url is .html?page=1, so I can increment page number but is there a best pratice to do this thing as clean as possible ?
I use urllib, url parse and beautifulSoup
#!/usr/bin/env python2
import urllib
import urlparse
from bs4 import BeautifulSoup
def getURL():
try:
fo = open("WebsiteToCrawl", "rw")
print ok() + "Data to crawl a store in : ", fo.name
except:
print fail() + "File doesn't exist, please create WebSiteTOCrawl file for store website listing"
line = fo.readlines()
print ok() + "Return website : %s" % (line)
fo.close()
i= 0
while i<len(line):
try:
returnDATA = urllib.urlopen(line[i]).read()
print ok() + "Handle :" + line[i]
handleDATA(returnDATA)
except:
print fail() + "Can't open url"
i += 1
def handleDATA(returnDATA):
try:
soup = BeautifulSoup(returnDATA)
for link in soup.find_all('a'):
urls = link.get('href')
try:
print urls
except:
print end() + "EOF: All site crawled"
def main():
useDATA = getURL()
handleDATA(useDATA)
if __name__ == "__main__":
main()
NB: I've simpfly the code than the original
If it's as straightforward as changing the number in the url, then do that.
However, you should consider how you're going to know when to stop. If the page returns pagination detail at the bottom (e.g. Back 1 2 3 4 5 ... 18 Next) then you could grab the contents of that element and find the 18.
An alternative, albeit slower, would be to parse the pagination links on each page and follow them manually by opening the url directly or using a click method to click next until next no longer appears on the page. I don't use urllib directly but it can be done super easily with Selenium's python bindings (driven by PhantomJS if you need it be headless). You could also do this whole routine with probably an even smaller amount of code using RoboBrowser if you don't have AJAX to deal with.
import cgi
form = cgi.FieldStorage()
test = form['name'].value
if test is None:
print('empty')
else:
print ('Hello ' + test)
... and that doesn't seem to display anything when my url is something like .../1.py
if i set it to .../1.py?name=asd it will display Hello asd
also how to get everything after the question mark and after the domain name: for example if i try to access http://localhost/thisis/test i want to get /thisis/test.
edit: i tried to use try: and i couldn't get it working.
To answer the first part of my question, i found what the problem is and that's the correct code:
import cgi
form = cgi.FieldStorage()
try:
test = form['name'].value
except KeyError:
print('not found')
else:
print(test)
for my second question:
import os
print(os.environ["REQUEST_URI"])