I want to let the user start downloading a file when it's not ready yet. I don't want to send user to some page saying "Wait 30 seconds, file is being prepared." I can't generate the file in advance. I need user to click/send form, choose download location and start downloading. Generated file will be zip, so I imagine, that it should be possible to send file name with first few bytes of zips (which are always same) and before the file is generated don't confirm that tcp packet was send correctly or something like that, and after the file is generated send the rest.
How do I do that? Is there any tool which can do that. Or is there some better way? More high level solution better, C isn't my strong suit. Preferably in Python. Thanks.
File being generated is zip and before it's prepared there isn't really anything to send yet. Basically according to input I generate set of files (which takes few dozens seconds), than I zip them and serve to user. My application is in python on linux but which server I'll use isn't really important.
The client would most likely timeout(or wait 30 seconds without notification) while the file was prepared. I would use a stream compression algorithm(gzip) to compress the file(s) while in transit. This would not result in the best compression, but would serve the files in a predictable manner.
Investigate the "Content-Encoding: gzip" HTTP header.
Usually, application like this will be implement in two parts, like a ticket system.
When user click/submit the form, the form will send request to a service that will start generating the file as a background process, then (without waiting for the file to generate) it will generate a ticket/hash that represent the new file and then redirect user to a new URL, e.g. /files/<random-hash>
On this new URL, /files/<random-hash>, while the files is not ready, it'll return a simple HTML page that show message to user to wait, and have a script on the page that will keep refreshing the page every few seconds. As long as the files is not ready, it'll keep showing this message, but once the file is ready, this URL will return the actual files content in its response with appropriate mime-header instead.
The solution is quite simple to implement using a database and some programming. Though, if you're looking for an already-made tool ready to use, I'm sorry I'm not familiar with one. Hope this help.
Despite claims it's impossible, I managed to find a way. I learned a bit of Go in a meantime, so I used that, but I guess it won't be too different in other languages.
Basically first byte is written in the writer and then flushed. Browser than waits for rest.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"net/http"
"os"
"strings"
"time"
)
func Zip(w http.ResponseWriter, r *http.Request) {
file_name := r.URL.Path
file_name = strings.TrimPrefix(file_name, "/files/")
w.Header().Set("Content-type", "application/zip")
w.Write([]byte{80})
if f, ok := w.(http.Flusher); ok {
f.Flush()
}
for {
if _, err := os.Stat("./files/" + file_name); err == nil {
fmt.Println("file found, breaking")
break
}
time.Sleep(time.Second)
}
stream_file_bytes, err := ioutil.ReadFile("./files/" + file_name)
if err != nil {
fmt.Println(err)
return
}
b := bytes.NewBuffer(stream_file_bytes)
b.Next(1)
b.WriteTo(w)
fmt.Println("end")
}
func main() {
http.HandleFunc("/files/", Zip)
if err := http.ListenAndServe(":8090", nil); err != nil {
panic(err)
}
}
Related
I am trying to use python 3.10.9 on windows to create a telnet session using telnetlib, but I have trouble to read the complete response.
I create a telnet session like
session = telnetlib.Telnet(host, port, timeout)
and then I write a command like
session.write(command + b"\n")
and then I wait some really long time (like 5 seconds) before I try to read the response using
session.read_some()
but I only get half of the response back!
The complete response is e.g.
Invalid arguments
Usage: $IMU,START,<SAMPLING_RATE>,<OUTPUT_RATE>
where SAMPLING_RATE = [1 : 1000] in Hz
OUTPUT_RATE = [1 : SAMPLING_RATE] in Hz
but all I read is the following:
b'\x1b[0GInvalid arguments\r\n\r\nUsage: $IMU,START,<'
More than half of the response is missing! How to read the complete response in a non-blocking way?
Other strange read methods:
read_all: blocking
read_eager: same issue
read_very_eager: sometimes works, sometimes not. Seems to contain a repetition of the message ...
read_lazy: does not read anything
read_very_lazy: does not read anything
I have not the slightest idea what all these different read methods are for. The documentation is not helping at all.
But read_very_eager seems to work sometimes. But sometimes I get a response like
F
FI
FIL
FILT
FILTE
FILTER
and so on. But I am reading only once, not adding the output myself!
Maybe there is a more simple-to-use module I can use instead if telnetlib?
Have you tried read_all(), or any of the other read_* options available?
Available functions here.
I want to use boto3 to run a command on an ECS Fargate container which generates a lot of binary output, and stream that output into a file on my local machine.
My attempt is based on the recommendation here, and looks like this:
import json
import uuid
import boto3
import construct as c
import websocket
# Define Structs
AgentMessageHeader = c.Struct(
"HeaderLength" / c.Int32ub,
"MessageType" / c.PaddedString(32, "ascii"),
)
AgentMessagePayload = c.Struct(
"PayloadLength" / c.Int32ub,
# This only works with my test command. It won't work with my real command that returns binary data
"Payload" / c.PaddedString(c.this.PayloadLength, "ascii"),
)
# Define initial payload
init_payload = {
"MessageSchemaVersion": "1.0",
"RequestId": str(uuid.uuid4()),
"TokenValue": session["tokenValue"],
}
# Define the container you want to talk to
cluster = "..."
task = "..."
container = "..."
# Send command with large response (large enough to span multiple messages)
result = client.execute_command(
cluster=cluster,
task=task,
container=container,
# This is a sample command that returns text. My real command returns hundreds of megabytes of binary data
command="python -c 'for i in range(1000):\n print(i)'",
interactive=True,
)
# Get session info
session = result["session"]
# Create websocket connection
connection = websocket.create_connection(session["streamUrl"])
try:
# Send initial response
connection.send(json.dumps(init_payload))
while True:
# Receive data
response = connection.recv()
# Decode data
message = AgentMessageHeader.parse(response)
payload_message = AgentMessagePayload.parse(response[message.HeaderLength:])
if 'channel_closed' in message.MessageType:
raise Exception('Channel closed before command output was received')
# Print data
print("Header:", message.MessageType)
print("Payload Length:", payload_message.PayloadLength)
print("Payload Message:", payload_message.Payload)
finally:
connection.close()
This almost works, but has a problem - I can't tell when I should stop reading.
If you read the final message from aws, and call connection.recv() again, aws seems to loop around and send you the initial data - the same data you would have received the first time you called connection.recv().
One semi-hackey way to try to deal with this is by adding an end marker to the command. Sort of like:
result = client.execute_command(
...
command="""bash -c "python -c 'for i in range(1000):\n print(i)'; echo -n "=== END MARKER ===""""",
)
This idea works, but to be used properly, becomes really difficult to use. There's always a chance that the end marker text gets split up between two messages, and dealing with that becomes a pain, since you can no longer write a payload immediately to disk until you verify that the end of the payload, along with the beginning of the next payload, isn't your end marker.
Another hackey way is to checksum the first payload, and every subsequent payload, comparing the checksum of each payload to the checksum of the first payload. That will tell you if you've looped around. Unfortunately, this also has a chance of having a collision, if the binary data in 2 messages just happens to repeat, although the chances of that in practice would probably be slim.
Is there a simpler way to determine when to stop reading?
Or better yet, a simpler way to have boto3 give me a stream of binary data from the command I ran?
At the end of my data pipeline when I finally go to push a Python dict to a JSON file to be pulled on demand by an API, I'll dump the dict to the file like so:
json.dump(data, out_file)
99.9% of the time this works perfectly and the data is accessible to the end user in the desired format. i.e. :
out_file.json
{
"good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
},
"more_good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
}
}
However, my struggle is with the other 0.1% of the pushes ... I've been noticing the data will be pushed without completely removing the previous data from the file and I'll end up with situations like the following:
out_file.json
{
"good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
},
"more_good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
}
}ed", "to": ["push", ":)"]}}
As of now, I've come up with the following temporary 'solution':
Before pushing the dict I will push an empty string to clear the file:
json.dump('', out_file)
json.dump(data, out_file)
Then, when getting the file contents for the end user I'll check to ensure content availability like so:
q = json.load(in_file)
while q == '': # also acts as an if
q = json.load(in_file)
return q
My primary concern is that pushing the string prior to the data will only make the edge cases less likely (if even that) and that I will continue to see these same errors occur into the future - with the added potential of end-user data accessibility being disrupted by blank strings being sent constantly to the data file.
Since the problem occurs only 0.1% of the time and I'm not sure of what exactly causes the edge cases, it's been time consuming to test for so I can't be sure how my attempted temporary solutions have panned out yet. The inability to test for the edge cases seems to be a bug in and of itself - caused by a lack of understanding of what brings the bug about in the first place.
You haven't shown what out_file is or how you open it, but I expect the problem is when two threads/processes try to open and write into the file at roughly the same time. The file is truncated at open; so if the order is open1 - open2 - write1 - write2, you might get the similar results. There are two basic choices:
a) use some locking mechanism to signal an error if another thread/process is doing the same thing: a mutex, an exclusive access lock... then you will have to deal with one of the threads waiting till the file is not in use any more, or give up writing.
b) write to a named temporary file on the same filesystem, then use atomic replace.
I recommend the second choice, it is both simpler and safer.
I think the above is on point. One other solution you could try, although I'm not sure if this works with your use case and it's a bit of hack rather than addressing the root cause, would be to create a unique id for the file.
import json
from uuid import uuid4
f = str(uuid4)
with open(data, 'w') as f:
json.dump(data, f)
But obviously this would only work if you don't need the file to be called 'out_file.json' each time.
#Amardan hit the nail on the head when he diagnosed the problem as being caused by multiple threads writing to the same file simultaneously. To solve the problem in my specific use-case I had to diverge slightly from his recommend solution and even incidentally ended up incorporating elements of the solution recommend by #osint_alex.
Unfortunately, when trying to utilize the temporary file recommended by #Amardan. I would receive the following error when trying:
[Errno 18] Invalid cross-device link: '/tmp' -> '/app/data/out_file.json'
This wasn't too big of a problem since the solution truly lied in the ability to write to my files atomically, not in the use of temp files. So, all I had to do was create accessible files of my own to act as temporary holders for the data before writing to the final destination. Ultimately, I ended up using UUID4 to name these temporary files so that no two files would be written to at the same time (at least not any time soon ...). In the end, I was actually able to use this bug as an opportunity to outsource all my 'json.dump'-ing to one function where I can test for edge cases and ensure the file is only written to once at a time. In the end, the new function looks something like this:
def update_content(content, dest):
pth = f'/app/data/{uuid.uuid4()}.json'
with open(pth, "w") as f:
json.dump(content, f)
try:
with open(pth) as f:
q = json.load(f)
# NOTE: edge case testing here ...
os.replace(pth, dest)
except: # add exceptions as you see fit
os.remove(pth)
I've been searching (without results) a reanudable (i don't know if this is the correct word, sorry) way to download big files from internet with python, i know how do it directly with urllib2, but if something interrupt the connection, i need some way to reconnect and continue the download where it was if it's possible (like a download manager).
For other people who can help the answer, there's a HTTP protocol called Chunked Transfer Encoding that allow to do this specifying the 'Range' header of the request with the beginning and end bytes (separated by a dash), thus is possible just count how many bytes was downloaded previously and send it like the new beginning byte for continue the download. Example with requests module:
import requests
from os.path import getsize
#get size of previous downloaded chunk file
beg = getsize(PATH_TO_FILE)
#if we want we can get the size before download the file (without actually download it)
end = request.head(URL).headers['content-length']
#continue the download in the next byte from where it stopped
headers = {'Range': "bytes=%d-%s"%(beg+1,end)}
download = requests.get(URL, headers=headers)
If a would-be-HTTP-server written in Python2.6 has local access to a file, what would be the most correct way for that server to return the file to a client, on request?
Let's say this is the current situation:
header('Content-Type', file.mimetype)
header('Content-Length', file.size) # file size in bytes
header('Content-MD5', file.hash) # an md5 hash of the entire file
return open(file.path).read()
All the files are .zip or .rar archives no bigger than a couple of megabytes.
With the current situation, browsers handle the incoming download weirdly. No browser knows the file's name, for example, so they use a random or default one. (Firefox even saved the file with a .part extension, even though it was complete and completely usable.)
What would be the best way to fix this and other errors I may not even be aware of, yet?
What headers am I not sending?
Thanks!
This is how I send ZIP file,
req.send_response(200)
req.send_header('Content-Type', 'application/zip')
req.send_header('Content-Disposition', 'attachment;'
'filename=%s' % filename)
Most browsers handle it correctly.
If you don't have to return the response body (that is, if you are given a stream for the response body by your framework) you can avoid holding the file in memory with something like this:
fp = file(path_to_the_file, 'rb')
while True:
bytes = fp.read(8192)
if bytes:
response.write(bytes)
else:
return
What web framework are you using?