Downloading multiple pdf files with bash - python

I’d like to download from https://hebrewbooks.org/ all free available books, using a simple script.
Every book (52,000 of them) has a unique numeric number assigned. For example:
https://hebrewbooks.org/1
https://hebrewbooks.org/3
https://hebrewbooks.org/52000
But many numbers have been skipped or have been removed.
Usually a visitor would click on the download button which returns: (book number 52000)
https://download.hebrewbooks.org/downloadhandler.ashx?req=52000
Or (for book number 1)
https://download.hebrewbooks.org/downloadhandler.ashx?req=1
I would like to download all files to a local disk without having to request each file individually in a browser etc.
I know this can been achieved with a simple script (even a bash script).
Could anyone advise me where to look or where find a similar problem that’s been solved.
Edit: I forgot an important question. How do I get the script to change the name for each downloaded file from the I’d number (such as 42000) to the metadata included in each file?

As mentioned, wget would be a good tool to use. Maybe try using it in a loop?
#! /bin/bash
#iterate 52,000 times
for i in {1..52000}; do
sleep 1s
wget [local path] "https://download.hebrewbooks.org/downloader.ashx?req=${i}"
# $i is the current iteration, therefore collecting all 52,000
done
edit: Just realized someone commented this on the main question, but I'll leave this here for anyone who doesn't see them like me.

You can use wget for this task:
wget /download/path/to/save/downloaded/file "https://download.hebrewbooks.org/downloader.ashx?req=book_number"
More help: https://askubuntu.com/questions/207265/how-to-download-a-file-from-a-website-via-terminal

Related

Printing Automation using win32api Python [duplicate]

I'm correcting assignments from my students right now and I'd like to automate an annoying step I always have to do.
After annotating their PDF solutions, I need to Print them to PDF files in order to bake my annotations into the PDF so that they can be included in LaTeX. Right now I have to manually choose "Microsoft Print to PDF" and enter the PDFs name with a leading underscore (which is what my automatically generated LaTeX files expect). This gets annoying for 30+ files.
So I'd like to issue this in a batch-script automatically for all the PDFs to minimize my efforts to a simple double-click. I have seen that this is possible with e.g. C# (Here), but I'd like a solution with a simple batch script.
Can this be done?
Edit:
The C#-Code I found actually does not get the Job done. You can't print existing PDFs that way. I'd need to use Spire.PDF to do that. The Free Version however messes up the PDF; I can download the "Full" version in NuGet, this however generates a disclaimer at the beginning of any PDF, and it still can't handle things I draw in Adobe Reader DC. So C# really is not an option, I need a command-line solution.
You'll better install pdfcreator
and use the commandline options
I assume it should be quite easy using PowerShell, but I ran into the same problem as described in this post.
The PowerShell solution from here creates only blank PDF files for me.
There probably exist better solutions, but I managed to combine PDFtoPrinter and this post.
A batch script could look like this:
for /R %%f in (*.pdf) do (
(echo with createobject^("wscript.shell"^)
echo .run "<path to PDFtoPrinter.exe> ""%%f"""
echo wscript.sleep 3000
echo .sendkeys """%%~df%%~pf%%~nf_correction.pdf"""
echo .sendkeys "{enter}"
echo wscript.sleep 3000
echo end with) > %temp%\sk.vbs
start /w %temp%\sk.vbs
)
This script uses Microsoft Print to PDF to create corresponding files of the format <filename>_correction.pdf.
The batch script creates an sk.vbs script in %temp% and runs it.
The sk.vbs script then handles the file saving dialog of Microsoft Print to PDF.
Additionally, this solution has the drawback that you can't use your computer while the script runs because the sk.vbs script must send keys to the window in focus.

when using Watchman's watch-make I want to access the name of the changed files

I am writing a watchman command with watchman-make and I'm at a loss when trying to access exactly what was changed in the directory. I want to run my upload.py script and inside the script I would like to access filenames of newly created files in /var/spool/cups-pdf/ANONYMOUS .
so far I have
$ watchman-make -p '/var/spool/cups-pdf/ANONYMOUS' -—run 'python /home/pi/upload.py'
I'd like to add another argument to python upload.py so I can have an exact filepath to the newly created file so that I can send the new file over to my database in upload.py,
I've been looking at the docs of watchman and the closest thing I can think to use is a trigger object. Please help!
Solution with watchman-wait:
Assuming project layout like this:
/posts/_SUBDIR_WITH_POST_NAME_/index.md
/Scripts/convert.sh
And the shell script like this:
#!/bin/bash
# File: convert.sh
SrcDirPath=$(cd "$(dirname "$0")/../"; pwd)
cd "$SrcDirPath"
echo "Converting: $SrcDirPath/$1"
Then we can launch watchman-wait like this:
watchman-wait . --max-events 0 -p 'posts/**/*.md' | while read line; do ./Scripts/convert.sh $line; done
When we changing file /posts/_SUBDIR_WITH_POST_NAME_/index.md the output will be like this:
...
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
...
watchman-make is intended to be used together with tools that will perform a follow-up query of their own to discover what they want to do as a next step. For example, running the make tool will cause make to stat the various deps to bring things up to date.
That means that your upload.py script needs to know how to do this for itself if you want to use it with watchman.
You have a couple of options, depending on how sophisticated you want things to be:
Use pywatchman to issue an ad-hoc query
If you want to be able to run upload.py whenever you want and have it figure out the right thing (just like make would do) then you can have it ask watchman directly. You can have upload.py use pywatchman (the python watchman client) to do this. pywatchman will get installed if the the watchman configure script thinks you have a working python installation. You can also pip install pywatchman. Once you have it available and in your PYTHONPATH:
import pywatchman
client = pywatchman.client()
client.query('watch-project', os.getcwd())
result = client.query('query', os.getcwd(), {
"since": "n:pi_upload",
"fields": ["name"]})
print(result["files"])
This snippet uses the since generator with a named cursor to discover the list of files that changed since the last query was issued using that same named cursor. Watchman will remember the associated clock value for you, so you don't need to complicate your script with state tracking. We're using the name pi_upload for the cursor; the name needs to be unique among the watchman clients that might use named cursors, so naming it after your tool is a good idea to avoid potential conflict.
This is probably the most direct way to extract the information you need without requiring that you make more invasive changes to your upload script.
Use pywatchman to initiate a long running subscription
This approach will transform your upload.py script so that it knows how to directly subscribe to watchman, so instead of using watchman-make you'd just directly run upload.py and it would keep running and performing the uploads. This is a bit more invasive and is a bit too much code to try and paste in here. If you're interested in this approach then I'd suggest that you take the code behind watchman-wait as a starting point. You can find it here:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait
The key piece of this that you might want to modify is this line:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait#L169
which is where it receives the list of files.
Why not triggers?
You could use triggers for this, but we're steering folks away from triggers because they are hard to manage. A trigger will run in the background and have its output go to the watchman log file. It can be difficult to tell if it is running, or to stop it running.
The interface is closer to the unix model and allows you to feed a list of files on stdin.
Speaking of unix, what about watchman-wait?
We also have a command that emits the list of changed files as they change. You could potentially stream the output from watchman-wait in your upload.py. This would make it have some similarities with the subscription approach but do so without directly using the pywatchman client.

--version support in a Python program built with Pants

How can I get Pants to store the output of git describe somewhere in my .pex file so that I can access it from the Python code I'm writing?
Basically I want to be able to clone my project and do this:
./pants binary px
Distribute the resulting dist/px.pex to somebody
That somebody should be able to do px.pex --version and get a printout of whatever git describe said when I built the .pex in step one.
Help!
Turns out pex already does git describe on build. The result it stores in a PEX-INFO file in the root of the .pex file. So to read it, I did this:
def get_version():
"""Extract version string from PEX-INFO file"""
my_pex_name = os.path.dirname(__file__)
zip = zipfile.ZipFile(my_pex_name)
with zip.open("PEX-INFO") as pex_info:
return json.load(pex_info)['build_properties']['tag']
This is good enough IMO, but there are also drawbacks. If somebody has an improved answer I'm prepared to switch to that one as the accepted one.
Outages with this one:
Relies on relative paths to locate PEX-INFO, would be better if there was some kind of API call for this.
No way to customize how the version number is computed; I'd like to do git describe --dirty for example.

How do I modify gitstats to only utilize a specified file extension for its statistics?

The website of the statistics generator in question is:
http://gitstats.sourceforge.net/
Its git repository can be cloned from:
git clone git://repo.or.cz/gitstats.git
What I want to do is something like:
./gitstatus --ext=".py" /input/foo /output/bar
Failing being able to easily pass the above option without heavy modification, I'd just hard-code the file extension I want to be included.
However, I'm unsure of the relevant section of code to modify and even if I did know, I'm unsure of how to start such modifications.
It's seems like it'd be rather simple but alas...
I found this question today while looking for the same thing. After reading sinelaw's answer I looked into the code and ended up forking the project.
https://github.com/ShawnMilo/GitStats
I added an "exclude_extensions" config option. It doesn't affect all parts of the output, but it's getting there.
I may end up doing a pretty extensive rewrite once I fully understand everything it's doing with the git output. The original project was started almost exactly four years ago today and there's a lot of clean-up that can be done due to many updates to the standard library and the Python language.
EDIT: apparently even the previous solution below only affects the "Files" stat page, which is not interesting. I'm trying to find something better. The line we need to fix is 254, this:
lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getcommitrange('HEAD'), 'grep -v ^commit']).split('\n')
Previous attempt was:
Unfortunately, seems like git does not provide options for easily filtering by files in a commit (in the git log and git rev-list). This solution doesn't really filter all the statistics for certain file types (such as the statistics on tags), but does so for the part that calculates activity by number of lines changed.
So the best I could come up with is at line 499 of gitstats (the main script):
res = int(getpipeoutput(['git ls-tree -r --name-only "%s"' % rev, 'wc -l']).split('\n')[0])
You can change that by either adding a pipe into grep in the command, like this:
res = int(getpipeoutput(['git ls-tree -r --name-only "%s"' % rev, 'grep \\.py$', 'wc -l']).split('\n')[0])
OR, you could split out the 'wc -l' part, get the output of git ls-tree into a list of strings, and filter the resulting file names by using the fnmatch module (and then count the lines in each file, possibly by using 'wc -l') but that sounds like overkill for the specific problem you're trying to solve.
Still doesn't solve the problem (the rest of the stats will ignore this filter), but hopefully helpful.

Bash alias to Python script -- is it possible?

The particular alias I'm looking to "class up" into a Python script happens to be one that makes use of the cUrl -o (output to file) option. I suppose I could as easily turn it into a BASH function, but someone advised me that I could avoid the quirks and pitfalls of the different versions and "flavors" of BASH by taking my ideas and making them Python scripts.
Coincident with this idea is another notion I had to make a feature of legacy Mac OS (officially known as "OS 9" or "Classic") pertaining to downloads platform-independent: writing the URL to some part of the file visible from one's file navigator {Konqueror, Dolphin, Nautilus, Finder or Explorer}. I know that only a scant few file types support this kind of thing using some other command-line tools (exiv2, wrjpgcom, etc). Which is perfectly fine with me as I only use this alias to download single-page image files such as JPEGs anyways.
I reckon I might as well take full advantage of the power of Python by having the script pass the string which is the source URL of the download (entered by the user and used first by cUrl) to something like exiv2 which could write it to the Comment block, EXIF User Comment block, and (taking as a first and worst example) Windows XP's File Description field. Starting small is sometimes a good way to start.
Hope someone has advice or suggestions.
BZT
The relevant section from the Bash manual states:
Aliases allow a string to be
substituted for a word when it is used
as the first word of a simple command.
So, there should be nothing preventing you from doing e.g.
$ alias geturl="python /some/cool/script.py"
Then you could use it like any other shell command:
$ geturl http://example.com/excitingstuff.jpg
And this would simply call your Python program.
I thought Pycurl might be the answer. Ahh Daniel Sternberg and his innocent presumptions that everybody knows what he does. I asked on the list whether or not pycurl had a "curl -o" analogue, and then asked 'If so: How would one go about coding it/them in a Python script?' His reply was the following:
"curl.setopt(pycurl.WRITEDATA, fp)
possibly combined with:
curl.setopt(pycurl.WRITEFUNCITON, callback) "
...along with Sourceforge links to two revisions of retriever.py. I can barely recall where easy_install put the one I've got; how am I supposed to compare them?
It's pretty apparent this gentleman never had a helpdesk or phone tech support job in the Western Hemisphere, where you have to assume the 'customer' just learned how to use their comb yesterday and be prepared to walk them through everything and anything. One-liners (or three-liners with abstruse links as chasers) don't do it for me.
BZT

Categories

Resources