We have two servers (client-facing, and back-end database) between which we would like to transfer PDFs. Here's the data flow:
User requests PDF from website.
Site sends request to client-server.
Client server requests PDF from back-end server (different IP).
Back-end server sends PDF to client server.
Client server sends PDF to website.
1-3 and 5 are all good, but #4 is the issue.
We're currently using Flask requests for our API calls and can transfer text and .csv easily, but binary files such as PDF are not working.
And no, I don't have any code, so take it easy on me. Just looking for a suggestion from someone who may have come across this issue.
As you said you have no code, that's fine, but I can only give a few suggestions.
I'm not sure how you're sending your files, but I'm assuming that you're using pythons open function.
Make sure you are reading the file as bytes (e.g. open('<pdf-file>','rb'))
Cut the file up into chunks and send it as one file, this way it doesn't freeze or get stuck.
Try smaller PDF files, if this works definitely try suggestion #2.
Use threads, you can multitask with them.
Have a download server, this can save memory and potentially save bandwidth. Also it also lets you skip the PDF send back, from flask.
Don't use PDF files if you don't have to.
Use a library to do it for you.
Hope this helps!
I wanted to share my solution to this, but give credit to #CoolqB for the answer. The key was including 'rb' to properly read the binary file and including the codecs library. Here are the final code snippets:
Client request:
response = requests.get('https://www.mywebsite.com/_api_call')
Server response:
f = codecs.open(file_name, 'rb').read()
return f
Client handle:
with codecs.open(file_to_write, 'w') as f:
f.write(response.content)
f.close()
And all is right with the world.
Related
For my image classification project I need to collect classified images, and for me a good source would be different webcams around the world streaming video in the internet. Like this one:
https://www.skylinewebcams.com/en/webcam/espana/comunidad-valenciana/alicante/benidorm-playa-poniente.html
I don't really have any experience with video streaming and web scraping generally, so after searching for the info in internet, i came up with this naive code in python:
url='https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6-2daa3495eb99'
r1 = requests.get(url, stream=True)
filename = "stream.avi"
if(r1.status_code == 200):
with open(filename,'w') as f:
for chunk in r1.iter_content(chunk_size=1024):
f.write(chunk)
else:
print("Received unexpected status code {}".format(r.status_code))
where the url address was taken from the source of the video block from the website:
<video data-html5-video=""
poster="//static.skylinewebcams.com/_2933625150.jpg" preload="metadata"
src="blob:https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6-
2daa3495eb99"></video>
but it does not work (avi file is empty), even though in the browser video streaming is working good. Can anybody explain me how to capture this video stream into the file?
I've made some progress since then. Here is the code:
print ("Recording video...")
url='https://hddn01.skylinewebcams.com/02930601ENXS-1523680721427.ts'
r1 = requests.get(url, stream=True)
filename = "stream.avi"
num=0
if(r1.status_code == 200):
with open(filename,'wb') as f:
for chunk in r1.iter_content(chunk_size=1024):
num += 1
f.write(chunk)
if num>5000:
print('end')
break
else:
print("Received unexpected status code {}".format(r.status_code))
Now i can get some piece of video written in the file. What I've change is 1) in open(filename,'wb') changed 'w' to 'wb' to write binary data, but most important 2) changed url. I looked in Chrome devtools 'network' what requests are sent by browser to get the live stream, and just copied the most fresh one, it requests some .ts file.
Next, i've found out how to get the addresses of .ts video files. One can use m3u8 module (installable by pip) like this:
import m3u8
m3u8_obj = m3u8.load('https://hddn01.skylinewebcams.com/live.m3u8?
a=k2makj8nd279g717kt4d145pd3')
playlist=[el['uri'] for el in m3u8_obj.data['segments']]
The playlist of the video files will then be something like that
['https://hddn04.skylinewebcams.com/02930601ENXS-1523720836405.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720844347.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720852324.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720860239.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720868277.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720876252.ts']
and I can download each of the video files from the list.
The only problem left, is that in order to load the playlist i need first to open the webpage in a browser. Otherwise the playlist is gonna be empty. Probably opening the webpage initiates the streaming and this creates m3u8 file on the server that can be requested. I still don't know how to initialize streaming from python, without opening the page in the browser.
The list turns out empty because you're making an HTTP request without headers (which means you're doing it programmatically for sure) and most sites just respond to those with 403 outright.
You should use a library like Requests or pycurl to add headers to your requests and they should work fine. For an example request (complete with headers), you can open your web browser's developer console while watching streaming, find an HTTP request for the m3u8 url, right-click on it, and "copy as cURL". Note that there are site-specific, arbitrary headers that may be required to be sent with each request.
If you want to scrape multiple sites with different headers, and/or want to future-proof your code for if they change the headers, addresses or formats, then you probably need something more advanced. Worst-case scenario, you might need to run a headless browser to open the site with WebDriver/Selenium and capture the requests it makes to generate your requests.
Keep in mind you might have to read each site's ToS or otherwise you might be performing illegal activities. Scraping while breaking the ToS is basically digital trespassing and I think at least craigslist has already won lawsuits based on that criteria.
My frontend web app is calling my python Flask API on an endpoint that is cached and returns a JSON that is about 80,000 lines long and 1.7 megabytes.
It takes my UI about 7.5 seconds to download all of it.
It takes Chrome when calling the path directly about 6.5 seconds.
I know that I can split up this endpoint for performance gains, but out of curiosity, what are some other great options to improve the download speed of all this content?
Options I can think of so far:
1) compressing the content. But then I would have to decompress it on the frontend
2) Use something like gRPC
Further info:
My flask server is using WSGIServer from gevent and the endpoint code is below. PROJECT_DATA_CACHE is the already Jsonified data that is returned:
#blueprint_2.route("/projects")
def getInitialProjectsData():
global PROJECT_DATA_CACHE
if PROJECT_DATA_CACHE:
return PROJECT_DATA_CACHE
else:
LOGGER.debug('No cache available for GET /projects')
updateProjectsCache()
return PROJECT_DATA_CACHE
Maybe you could stream the file? I cannot see any way to transfer a file 80,000 lines long without some kind of download or wait.
This would be an opportunity to compress and decompress it, like you suggested. Definitely make sure that the JSON is minified.
One way to minify a JSON: https://www.npmjs.com/package/json-minify
Streaming a file:
https://blog.al4.co.nz/2016/01/streaming-json-with-flask/
It also really depends on the project, maybe you could get the users to download it completely?
The best way to do this is to break your JSON into chunks and stream it by passing a generator to the Response. You can then render the data as you receive it or show a progress bar displaying the percentage that is done. I have an example of how to stream data as a file is being downloaded from AWS s3 here. That should point you in the right direction.
I'm trying to export a CSV from this page via a python script. The complicated part is that the page opens after clicking the export button on this page, begins the download, and closes again, rather than just hosting the file somewhere static. I've tried using the Requests library, among other things, but the file it returns is empty.
Here's what I've done:
url = 'http://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=True&%3bexportFormat=CSV&%3bisExport=True%22+id%3d%22M_C_sCDTransactions_csfFilter_ExportDialog_hlAllCSV?exportAll=True&exportFormat=CSV&isExport=True'
with open('CD_Transactions_02-27-2017.CSV', "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
I'm sure I'm missing something obvious, but I'm pulling my hair out.
It looks like the file is being generated on demand, and the url stays only valid as long as the session lasts.
There are multiple requests from the browser to the webserver (including POST requests).
So to get those files via code, you would have to simulate the browser, possibly including session state etc (and in this case also __VIEWSTATE ).
To see the whole communication, you can use developer tools in the browser (usually F12, then select NET to see the traffic), or use something like WireShark.
In other words, this won't be an easy task.
If this is open government data, it might be better to just ask that government for the data or ask for possible direct links to the (unfiltered) files (sometimes there is a public ftp server for example) - or sometimes there is an API available.
The file is created on demand but you can download it anyway. Essentially you have to:
Establish a session to save cookies and viewstate
Submit a form in order to click the export button
Grab the link which lies behind the popped-up csv-button
Follow that link and download the file
You can find working code here (if you don't mind that it's written in R): Save response from web-scraping as csv file
I'm trying to program a small HTTP local proxy server to run on my machine and run some tests.
My server currently runs perfectly and serves the requests fine.
However, when I try to analyse the packer - I get a problem.
I'm searching for the tag "" in my packets, and print a message to a log when I find it.
It works on a very limited number of websites, while on the other, like StackOverflow for example, it doesn't.
Do I need to some sort of decoding before I search for the word in the received data? If so - which decoding? How do I recode the data to serve to the browser?
Here's my code for the searching and replacing:
data = i.recv(8192)
if data:
if "<head>" in data:
print "Found Head Tag."
The above code is a simple python code to retrieve the data from the socket, save it to the data object, and search for the wanted tag. As I said, it works on very few websites, and not on the others.
Many webservers use compression to lower bandwidth usage.
You will need to check HTTP headers for Content-Encoding and apply the required operations (i.e. gzip decompression) to get the plain text.
I'm looking for a way to sell someone a card at an event that will have a unique code that they will be able to use later in order to download a file (mp3, pdf, etc.) only one time and mask the true file location so a savvy person downloading the file won't be able to download the file more than once. It would be nice to host the file on Amazon S3 to save on bandwidth where our server is co-located.
My thought for the codes would be to pre-generate the unique codes that will get printed on the cards and store those in a database that could also have a field that stores the number of times the file was downloaded. This way we could set how many attempts we would allow the user for downloading the file.
The part that I need direction on is how do I hide/mask the original file location so people can't steal that url and then download the file as many times as they want. I've done Google searches and I'm either not searching using the right keywords or there aren't very many libraries or snippets out there already for this type of thing.
I'm guessing that I might be able to rig something up using django.views.static.serve that acts as a sort of proxy between the actual file and the user downloading the file. The only drawback to this method I would think is that I would need to use the actual web server and wouldn't be able to store the file on Amazon S3.
Any suggestions or thoughts are greatly appreciated.
Neat idea. However, I would warn against the single-download method, because there is no guarantee that their first download attempt will be successful. Perhaps use a time-expiration method instead?
But it is certainly possible to do this with Django. Here is an outline of the basic approach:
Set up a django url for serving these files
Use a GET parameter which is a unique string to identify which file to get.
Keep a database table which has a FileField for the file to download. This table maps the unique strings to the location of the file on the file system.
To serve the file as a download, set the response headers in the view like this:
(path is the location of the file to serve)
with open(path, 'rb') as f:
response = HttpResponse(f.read())
response['Content-Type'] = 'application/octet-stream';
response['Content-Disposition'] = 'attachment; filename="%s"' % 'insert_filename_here'
return response
Since we are using this Django page to serve the file, the user cannot find out the original file location.
You can just use something simple such as mod_xsendfile. This functionality is also available in other popular webservers such lighttpd or nginx.
It works like this: when enabled your application (e.g. a trivial PHP script) can send a special response header, causing the webserver to serve a static file.
If you want it to work with S3 you will need to handle each and every request this way, meaning the traffic will go through your site, from there to AWS, back to your site and back to the client. Does S3 support symbolic links / aliases? If so you might just redirect a valid user to one of the symbolic URLs and delete that symlink after a couple of hours.