Downloading all ctrl alt del webcomics using terminal - python

I've tried using the following commands to download the ctrl alt del comics.
$ for filename in $(seq 20021023 20100503); do wget http://www.ctrlaltdel-online.com/comics/"$filename".jpg; done
I get the following error code, bash: syntax error near unexpected token 'do'
I've also tried using cURL, using this command,
curl http://ctrlaltdel-online.com/comics[20021023..20100503].jpg
I get the following error code, curl: (3) [globbing] error: bad range specification after pos 37
What's wrong, and how can I fix it?

As msw pointed out, crawling a site could be either illegal, unethical, irritating to the author, or perfectly fine. Please use your scripting powers responsibly and for Good (tm). Asking permission would certainly be a nice thing to do.
Note that the ctrlaltdel-online.com web server seems to return HTTP 403 forbidden to wget with the normal wget User-Agent string. Emulating something Firefox-ish seems to bypass that (although I bet they are just explicitly denying wget, which indicates they most likely forbid this type of access).
USERAGENT='Mozilla/5.0 Firefox/3.6.3'
for DAYS in $(seq 365)
do
NEXT=`date -d "${DAYS} days ago" +%Y%m%d`
wget -U "${USERAGENT}" "http://www.cad-comic.com/comics/cad/${NEXT}.jpg"
done
Replace 365 with a larger number to go back more than a year. The wget output might be annoying, so you can pass it -q to make it quiet.

i was writing the same script. Here it is.
import sys
import re
import urllib
import os
import ctypes
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it;rv:1.8.1.11)Gecko/20071127 Firefox/2.0.0.11'
def getlinks(add,m,opener):
ufile=opener.open(add)
html=ufile.read()
dates=re.findall('href="/cad/(\d+)">',html)
links=[]
for date in dates:
if date[4:6]==m:
links.append('http://www.cad-comic.com/cad/'+date)
links.reverse()
print 'Total {} comics found.'.format(len(links))
#print len(links)
return links
def getstriplink(link,opener):
ufile=opener.open(link)
html=ufile.read()
url=re.search('img src="(.+)" alt="(.+)" title=',html)
date=link[-8:]
return(url.group(1),url.group(2),date)
def main():
y=raw_input('Enter year 2002 - current(yyyy) ')
m=raw_input('Enter month(only months 12,11 and 10 for 2002)(mm) ')
add='http://www.cad-comic.com/cad/archive/'+y
opener=MyOpener()
links=getlinks(add,m,opener)
f=open('/media/aux1/pythonary/cad'+str(y)+str(m)+'.html','w')
print 'downloading'
for link in links:
url=getstriplink(link,opener)
#date=url[0][-8:]
date=url[2]
opener.retrieve(url[0],'/media/aux1/pythonary/getcad_files/strip'+date)
sys.stdout.flush()
print'.',
f.write('<h2>'+url[1]+' '+date+'</h2>'+'<p><img src="getcad_files/strip'+date+'"/></p>')
f.close()
if __name__ == '__main__':
main()

Related

Is it possbile to let pexpect output the texts it matches?

I am familiar with expect script so I feel a bit odd when I first use pexpect. Take this simple script as an example,
#!/usr/bin/expect
set timeout 10
spawn npm login
expect "Username:"
send "qiulang2000\r"
expect "Password:"
send "xxxxx\r"
expect "Email:"
send "qiulang#gmail.com\r"
expect "Logged in as"
interact
When run it I will get the following output. It feels natural because that is how I run those commands
spawn npm login
Username: qiulang2000
Password:
Email: (this IS public) qiulang#gmail.com
Logged in as qiulang2000 on https://registry.npmjs.com/.
But when I use pexpect, no matter how I add print(child.after)or print(child.before) I just can't get output like expect, e.g. when I run following command,
#! /usr/bin/env python3
import pexpect
child = pexpect.spawn('npm login')
child.timeout = 10
child.expect('Username:')
print(child.after.decode("utf-8"))
child.sendline('qiulang2000')
child.expect('Password:')
child.sendline('xxxx')
child.expect('Email:')
child.sendline('qiulang#gmail.com')
child.expect('Logged in as')
print(child.before.decode("utf-8"))
child.interact()
I got these output, it feels unnatural because that is not what I see when I run those commands.
Username:
(this IS public) qiulang#gmail.com
qiulang2000 on https://registry.npmjs.com/.
So is it it possbile to achieve the expect script output?
--- update ---
With the comment I got from #pynexj I finally make it work, check my answer below.
With the comment I got I finally made it work
#! /usr/bin/env python3
import pexpect
import sys
print('npm login',timeout = 10)
child = pexpect.spawn('npm login')
child.logfile_read = sys.stdout.buffer // use sys.stdout.buffer for python3
child.expect('Username:')
child.sendline('qiulang2000')
child.expect('Password:')
child.sendline('xxxx')
child.expect('Email:')
child.sendline('qiulang#gmail.com')
child.expect('Logged in as')
If I need to call child.interact(), then it is important that I call child.logfile_read = None before it, otherwise sys.stdout will echo everything I type.
The answer here How to see the output in pexpect? said I need to pass an encoding for python3, but I found that if I use encoding='utf-8' it will cause TypeError: a bytes-like object is required, not 'str' If I don't set encoding at all, everything works fine.
So a simple ssh login script looks like this
#!/usr/bin/env python3
import pexpect
import sys
child = pexpect.spawn('ssh qiulang#10.0.0.32')
child.logfile_read = sys.stdout.buffer
child.expect('password:')
child.sendline('xxxx')
#child.expect('Last login') don't do that
child.logfile_read = None # important !!!
child.interact()
One problem remains unresolved, I had added one last expect call to match the ssh login output after sending the password, e.g. child.expect('Last login')
But if I added that, that line would show twice. I have gave up trying, like one comment said "pexpect's behavior is kind of counter intuitive".
Welcome to Ubuntu 16.04 LTS (GNU/Linux 4.4.0-141-generic x86_64)
* Documentation: https://help.ubuntu.com/
33 packages can be updated.
0 updates are security updates.
Last login: Fri Sep 11 11:44:19 2020 from 10.0.0.132
: Fri Sep 11 11:44:19 2020 from 10.0.0.132

Why does this work in IDE but not in CMD Prompt?

The code reaches out to a message board and indexes/reports the top topics. Using WING IDE, it works fine and reports no errors. However, when ran via command prompt, it will error out saying it can't properly encode a character. This is the first time I've seen this and haven't found a good resource to fix it.
Being that it runs fine in WING, I'm unsure what else to add to the code that would prevent this issue from happening in the command prompt.
import requests
from bs4 import BeautifulSoup
url = raw_input("Enter the board URL: ")
print "\n"
#send the HTTP request
response = requests.get(url)
if response.status_code == 200:
#pull the content
html_content = response.content
#send the page to BeautifulSoup
html_doc = BeautifulSoup(html_content, "html.parser")
#extract topic data
topic_spider = html_doc.find_all("span",{"class":"subject"})
data = []
for topic in topic_spider:
if topic.text!='':
data.append(topic.text)
topiclist = list(dict.fromkeys(data))
topiclist.sort(reverse=False)
for item in topiclist:
print ('[*] ' + item)
WING runs this just fine with no errors. Via CMD, the following result occurs after several successful results:
[*] Parenting (successful result)
Traceback (most recent call last):
File "D:\xxxx\topicindexer.py", line 29, in <module>
print ('[*] ' + item)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 31: character maps to <undefined>
I note two things.
One, you use print statements like this
print ('[*] ' + item)
which indicates you're using python 3.x
Second, however, your cmd output uses python 2.7.
That appears to be your problem. Try python3 filename.py on the commandline instead of python filename.py, as that is what it defaults to when you have both installed.
See if this solves it before anything else.
Make sure the python environment in CMD and Wing is the same.
Set the environment variables that's in Wing IDE in CMD.
Looks like your code is written in python 3 but your default is set to python 2.
When running your code in CMD just add python3 myfile.py rather than just python myfile.py

CURL request to Python as command line

Currently, I'm trying to convert CURL request to Python script.
curl $(curl -u username:password -s https://api.example.com/v1.1/reports/11111?fields=download | jq ".report.download" -r) > "C:\sample.zip"
I have tried pycurl, with no success, due to knowledge limitation.
As a solution, I have found, that it is possible to run commands through python.
https://www.raspberrypi.org/forums/viewtopic.php?t=112351
import os
os.system("curl -K.........")
And other solution ( based on the search more common) using subprocess
import subprocess
subprocess.call(['command', 'argument'])
Currently, I'm not sure where to move and how to adapt this solution to my sitionation.
import os
os.system("curl $(curl -u username:password -s https://api.example.com/v1.1/reports/11111?fields=download | jq '.report.download' -r) > 'C:\sample.zip'")
'curl' is not recognized as an internal or external command,
operable program or batch file.
255
P.S. - Update v1
Any suggestion?
import requests
response = requests.get('https://api.example.com/v1.1/reports/11111?fields=download | jq ".report.download" -r', auth=('username', 'password'))
This work without "| jq ".report.download" this part, but this is the main part which gives at the end only link to download the file.
ANy way around it?
The error 'curl' is not recognized as an internal or external command means that python couldn't find the location where curl is installed. If you have already installed curl, try giving the full path to where curl is installed. For example, if curl.exe is located in C:\System32, the try
import os
os.system("C:\System32\curl $(curl -u username:password -s https://api.example.com/v1.1/reports/11111?fields=download | jq '.report.download' -r) > 'C:\sample.zip'")
But thats definitely not pythonic way of doing things. I would instead suggest to use requests module.
You need to invoke requests module twice for this, first to download the json content from https://api.example.com/v1.1/reports/11111?fields=download, get a new url pointed byreport.download and then invoke requests again to download data from the new url.
Something along these lines should get you going
import requests
url = 'https://api.example.com/v1.1/reports/11111'
response = requests.get(url, params=(('fields', 'download'),),
auth=('username', 'password'))
report_url = response.json()['report']['download']
data = requests.get(report_url).content
with open('C:\sample.zip', 'w') as f:
f.write(data)
You can use this site to convert the actual curl part of your command to something that works with requests: https://curl.trillworks.com/
From there, just use the .json() method of the request object to do whatever processing you need to be doing.
Finally, can save like so:
import json
with open('C:\sample.zip', 'r') as f:
json.dump(data, f)

Python equivalent of this Curl command

I am trying to download a file using python, imitating the same behavior as this curl command:
curl ftp://username:password#example.com \
--retry 999 \
--retry-max-time 0
-o 'target.txt' -C -
How would this look in python ?
Things I have looked into:
Requests : no ftp support
Python-wget: no download resume support
requests-ftp : no download resume support
fileDownloader : broken(?)
I am guessing one would need to build this from scratch and go low level with pycurl or urllib2 or something similar.
I am trying to create this script in python and I feel lost.. Should I just call curl from python subprocess ?
Any point to the write direction would be much appreciated
you can use python's inbuilt ftplib
Here is the code:
from ftplib import FTP
ftp = FTP('example.com', 'username', 'password') #logs in
ftp.retrlines() # to see the list of files and directories ftp.cwd('to change to any directory')
ftp.retrbinary('RETR filename', open('Desktop\filename', 'wb').write) # start downloading
ftp.close() # close the connection
Auto resume is supported. I even tried turning off my wifi and checked if the download is resuming.
You can refer to /Python27/Lib/ftplib.py for default GLOBAL_TIME_OUT settings.
there is this library for downloading files from ftp server
fileDownloader.py
to download the file
downloader = fileDownloader.DownloadFile(‘http://example.com/file.zip’, “C:UsersusernameDownloadsnewfilename.zip”, (‘username’,’password’))
downloader.download()
to resume download
downloader = fileDownloader.DownloadFile(‘http://example.com/file.zip’, “C:UsersusernameDownloadsnewfilename.zip”, (‘username’,’password’))
downloader.resume()

Is it possible to stream output from a python subprocess to a webpage in real time?

Thanks in advance for any help. I am fairly new to python and even newer to html.
I have been trying the last few days to create a web page with buttons to perform tasks on a home server.
At the moment I have a python script that generates a page with buttons:
(See the simplified example below. removed code to clean up post)
Then a python script which runs said command and outputs to an iframe on the page:
(See the simplified example below. removed code to clean up post)
This does output the entire finished output after the command is finished. I have also tried adding the -u option to the python script to run it unbuffered. I have also tried using the Python subprocess as well. If it helps the types of commands I am running are apt-get update, and other Python scripts for moving files and fixing folder permissions.
And when run from normal Ubuntu server terminal it runs fine and outputs in real time and from my research it should be outputting as the command is run.
Can anyone tell me where I am going wrong? Should I be using a different language to perform this function?
EDIT Simplified example:
initial page:
#runcmd.html
<head>
<title>Admin Tasks</title>
</head>
<center>
<iframe src="/scripts/python/test/createbutton.py" width="650" height="800" frameborder="0" ALLOWTRANSPARENCY="true"></iframe>
<iframe width="650" height="800" frameborder="0" ALLOWTRANSPARENCY="true" name="display"></iframe>
</center>
script that creates button:
cmd_page = '<form action="/scripts/python/test/runcmd.py" method="post" target="display" >' + '<label for="run_update">run updates</label><br>' + '<input align="Left" type="submit" value="runupdate" name="update" title="run_update">' + "</form><br>" + "\n"
print ("Content-type: text/html")
print ''
print cmd_page
script that should run command:
# runcmd.py:
import os
import pexpect
import cgi
import cgitb
import sys
cgitb.enable()
fs = cgi.FieldStorage()
sc_command = fs.getvalue("update")
if sc_command == "runupdate":
cmd = "/usr/bin/sudo apt-get update"
pd = pexpect.spawn(cmd, timeout=None, logfile=sys.stdout)
print ("Content-type: text/html")
print ''
print "<pre>"
line = pd.readline()
while line:
line = pd.readline()
I havent tested the above simplified example so unsure if its functional.
EDIT:
Simplified example should work now.
Edit:
Imrans code below if I open a browser to the ip:8000 it displays the output just like it was running in a terminal which is Exactly what I want. Except I am using Apache server for my website and an iframe to display the output. How do I do that with Apache?
edit:
I now have the output going to the iframe using Imrans example below but it still seems to buffer for example:
If I have it (the script through the web server using curl ip:8000) run apt-get update in terminal it runs fine but when outputting to the web page it seems to buffer a couple of lines => output => buffer => ouput till the command is done.
But running other python scripts the same way buffer then output everything at once even with the -u flag. While again in terminal running curl ip:800 outputs like normal.
Is that just how it is supposed to work?
EDIT 19-03-2014:
any bash / shell command I run using Imrans way seems to output to the iframe in near realtime. But if I run any kind of python script through it the output is buffered then sent to the iframe.
Do I possibly need to PIPE the output of the python script that is run by the script that runs the web server?
You need to use HTTP chunked transfer encoding to stream unbuffered command line output. CherryPy's wsgiserver module has built-in support for chunked transfer encoding. WSGI applications can be either functions that return list of strings, or generators that produces strings. If you use a generator as WSGI application, CherryPy will use chunked transfer automatically.
Let's assume this is the program, of which the output will be streamed.
# slowprint.py
import sys
import time
for i in xrange(5):
print i
sys.stdout.flush()
time.sleep(1)
This is our web server.
2014 Version (Older cherrpy Version)
# webserver.py
import subprocess
from cherrypy import wsgiserver
def application(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/plain')])
proc = subprocess.Popen(['python', 'slowprint.py'], stdout=subprocess.PIPE)
line = proc.stdout.readline()
while line:
yield line
line = proc.stdout.readline()
server = wsgiserver.CherryPyWSGIServer(('0.0.0.0', 8000), application)
server.start()
2018 Version
#!/usr/bin/env python2
# webserver.py
import subprocess
import cherrypy
class Root(object):
def index(self):
def content():
proc = subprocess.Popen(['python', 'slowprint.py'], stdout=subprocess.PIPE)
line = proc.stdout.readline()
while line:
yield line
line = proc.stdout.readline()
return content()
index.exposed = True
index._cp_config = {'response.stream': True}
cherrypy.quickstart(Root())
Start the server with python webapp.py, then in another terminal make a request with curl, and watch output being printed line by line
curl 'http://localhost:8000'

Categories

Resources