I am working on a script to extract text from law cases using https://case.law/docs/site_features/api. I have created methods for search and create-xlsx, which work well, but I am struggling with the method to open an online pdf link, write (wb) in a temp file, read and extract the data (core text), then close it. The ultimate goal is to use the content of these cases for NLP.
I have prepared a function (see below) to download the file:
def download_file(file_id):
http = urllib3.PoolManager()
folder_path = "path_to_my_desktop"
file_download = "https://cite.case.law/xxxxxx.pdf"
file_content = http.request('GET', file_download)
file_local = open( folder_path + file_id + '.pdf', 'wb' )
file_local.write(file_content.read())
file_content.close()
file_local.close()
The script works well as it download the file and it created on my desktop, but, when I try to open manually the file on the desktop I have this message from acrobat reader:
Adobe Acrobat Reader could not open 'file_id.pdf' because it is either not a supported file type or because the file has been damager (for example, it was sent as a email attachments and wasn't correctly decoded
I thought it was the Library so I tried with Requests / xlswriter / urllib3... (example below - I also tried to read it from the script to see whether it was Adobe that was the issue, but apparently not)
# Download the pdf from the search results
URL = "https://cite.case.law/xxxxxx.pdf"
r = requests.get(URL, stream=True)
with open('path_to_desktop + pdf_name + .pdf', 'w') as f:
f.write(r.text)
# open the downloaded file and remove '<[^<]+?>' for easier reading
with open('C:/Users/amallet/Desktop/r.pdf', 'r') as ff:
data_read = ff.read()
stripped = re.sub('<[^<]+?>', '', data_read)
print(stripped)
the output is:
document.getElementById('next').value = document.location.toString();
document.getElementById('not-a-bot-form').submit();
with 'wb'and 'rb' instead (and removing the *** stripped *** the sript is:
r = requests.get(test_case_pdf, stream=True)
with open('C:/Users/amallet/Desktop/r.pdf', 'wb') as f:
f.write(r.content)
with open('C:/Users/amallet/Desktop/r.pdf', 'rb') as ff:
data_read = ff.read()
print(data_read)
and the output is :
<html>
<head>
<noscript>
<meta http-equiv="Refresh" content="0;URL=?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%
20(1994).pdf" />
</noscript>
</head>
<body>
<form method="post" id="not-a-bot-form">
<input type="hidden" name="csrfmiddlewaretoken" value="5awGW0F4A1b7Y6bx
rYBaA6GIvqx4Tf6DnK0qEMLVoJBLoA3ZqOrpMZdUXDQ7ehOz">
<input type="hidden" name="not_a_bot" value="yes">
<input type="hidden" name="next" value="/pdf/7840543/In%20re%20
the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%20(1994).pdf" id="next">
</form>
<script>
document.getElementById(\'next\').value = document.loc
ation.toString();
document.getElementById(\'not-a-bot-form\').submit();
</script>
<a href="?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%2
0890%20F.%20Supp.%20914%20(1994).pdf">Click here to continue</a>
</body>
</html>
but none are working. The pdf is not protected by a password, and I tried on other website and it doesn't work either.
Therefore, I am wondering whether I have another issue that is not link to the code itself.
Please let me know if you need additional information.
thank you
It looks like instead of the PDF the web server is providing you with a web page intended to prevent bots from downloading data from the site.
There is nothing wrong with your code, but if you still want to do this you'll have to work around the bot prevention of the website.
Related
I am creating a tool where either
A new XLSX file is generated for the user to download
The user can upload an XLSX file they have, I will read the contents of that file, aand use them to generate a new file for the user to download.
I would like to make use of Pandas to read the XLSX file into a dataframe, so I can work with it easily. However, I can't get it working. Can you help me?
Example extract from CGI file:
import pandas as pd
import cgi
from mako.template import Template
from mako.lookup import TemplateLookup
import http.cookies as Cookie
import os
import tempfile
import shutil
import sys
cookie = Cookie.SimpleCookie(os.environ.get("HTTP_COOKIE"))
method = os.environ.get("REQUEST_METHOD", "GET")
templates = TemplateLookup(directories = ['templates'], output_encoding='utf-8')
if method == "GET": # This is for getting the page
template = templates.get_template("my.html")
sys.stdout.flush()
sys.stdout.buffer.write(b"Content-Type: text/html\n\n")
sys.stdout.buffer.write(
template.render())
if method == "POST":
form = cgi.FieldStorage()
print("Content-Type: application/vnd.ms-excel")
print("Content-Disposition: attachment; filename=NewFile.xlsx\n")
output_path = "/tmp/" + next(tempfile._get_candidate_names()) + '.xlsx'
data = *some pandas dataframe previously created*
if "editfile" in form:
myfilename = form['myfile'].filename
with open(myfilename, 'wb') as f:
f.write(form['myfile'].file.read())
data = pd.read_excel(myfilename)
data.to_excel(output_path)
with open(path, "rb") as f:
sys.stdout.flush()
shutil.copyfileobj(f, sys.stdout.buffer)
Example extract from HTML file:
<p>Press the button below to generate a new version of the xlsx file</p>
<form method=post>
<p><input type=submit value='Generate new version of file' name='newfile'>
<div class="wrapper">
</div>
</form>
<br>
<p>Or upload a file.</p>
<p>In this case, a new file will be created using the contents of this file.</p>
<form method="post" enctype="multipart/form-data">
<input id="fileupload" name="myfile" type="file" />
<input value="Upload and create new file" name='editfile' type="submit" />
</form>
This works without the if "editfile" in form: bit so I know something is going wrong when I am trying to access the file that the user has uploaded.
The problem is that whilst a file is created, the created file has a file size of 0 KB and will not open in Excel. Crucially, the file that the user has uploaded can not be found in the location that I have written it out.
You've passed myfilename to pandas; however that file doesn't exist on the server yet. You'll have to save the file somewhere locally first before using it.
The following will download the file to the current directory (same directory as the CGI script). Of course, you're welcome to save it to some more suitable directory, depending on your setup.
form = cgi.FieldStorage()
myfilename = form['myfile'].filename
with open(myfilename, 'wb') as f: # Save the file locally
f.write(form['myfile'].file.read())
data = pd.read_excel(myfilename)
On my Raspberry Pi I've built a script that reads data from a file and shows a live video stream.
The content is then displayed on an html page:
f = open("demofile.txt", "r")
temperature=f.read()
f.close()
print(temperature)
PAGE="""\
<html>
<body>
<center><img src="stream.mjpeg" width="640" height="480"></center>
<center><h2>%s</h2></center>
</body>
</html>
""" %(temperature)
My problem is, that the file is being updated every 10 minutes - how can I reload or pull the new content from the file each time I access the html page?
Since I'm streaming video, the script runs all the time so I can't stop it and run it again to get the most updated data from the file.
Please advise
I have a python cgi script that creates a text area and fills it with default value from the contents of a file. This used to work but recently I noticed that with change in content on the file ;the html is rendered incorrectly and the submit button and some parts of the file contents to be shown in the text area(as default content) etc is missing or messing up with the total page's html
print('<form action="x.cgi" method="post">')
print('<textarea name="textcontent" cols="120" rows="50">')
with open('somefile', 'r') as content_file:
content = content_file.read()
content_file.close()
print(content)
print('</textarea>')
print('<HR>')
print('<input type="submit" value="Submit" />')
print('</form>')
What can be done so that the contents of somefile doesnt mess with the html form . Note that somefile is a configuration file and I need everything in the file to be printed as such so user can make necessary change and submit it
I am on a windows 10 laptop.
When i manually open submit.html on my local computer click and browse to namo.jpg namo.png and then submit, i get the website processing my image and return with result file within 15 seconds.
But I can't seem to get it to do the same using Python mechanize, when it run the script, the mechanize_results.html file keeps returning too quickly and telling me in their page that "Uploaded file is not a valid image. Only JPG, PNG and GIF files are allowed.. "
Not sure what i have to change to get the site to recognize my file submitted by my python mechanize script as an image file.
my submit.html file has this
<form name="myform" id="myform" action="http://deepdreamgenerator.com/upload-im" enctype="multipart/form-data" method="POST" id="upload-form">
<input type="hidden" name="_token" value="pfC1a6HGVdbWO7mCmKVkqVinCkSYOKkQxXZV9NY1">
<input type="file" name="file" id="upload"/>
<input type="submit" />
</form>
my python mechanize script has this
import mechanize
filename = 'C:/Users/tintran/Desktop/namo.png'
url = "file:///C:/Users/tintran/Desktop/submit.html"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form('myform')
br.set_all_readonly(False)
br.form.add_file(open(filename,'r'))
res = br.submit()
content = res.read()
with open("mechanize_results.html", "w") as f:
f.write(content)
https://docs.python.org/2/library/functions.html#open
If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.
It's all about Windows here. So just use 'rb' for opening PNG file.
First, I don't need any functions more than upload a file. No progress bar, no file type, or size check, no multiple files.
What I want is the most simple HTML webpage to handle the upload and save the file with the name I specified.
I tried to use this:
<form action="../cgi-bin/upload.py" method="post" enctype="multipart/form-data">
<input type="file" name="upload" />
<input type="submit" /></form>
In upload.py:
#!/usr/bin/python
import os
import commands
import cgi, cgitb
cgitb.enable()
print "Content-Type: text/html"
print
print 'start!'
form = cgi.FieldStorage()
filedata = form['upload']
But I don't know how to save this in file, like "Beautiful.mp3".
Can any body help?
Though, really, I don't want to use any scripts. I just want the most basic html pages. Python scripts will only exist when there must be some CGI handlers. Flash is not preferred.
The filedata object will wrap a file-like object that can be treated like a regular file. Basically you would do this:
if filedata.file: # field really is an upload
with file("Beautiful.mp3", 'w') as outfile:
outfile.write(filedata.file.read())
Or, you could do just about anything else with it, using read(), readlines() or readline()