My Node & Python backend is running just fine, but I now encountered an issue where if a JSON I'm sending from Python back no Node is too long, it gets split into two and my JSON.parse at the Node side fails.
How should I fix this? For example, the first batch clips at
... [1137.6962355826706, -100.78015825640887], [773.3834338399517, -198
and the second one has the remaining few entries
.201506231888], [-87276.575065248, -60597.8827676457], [793.1850250453127,
-192.1674702207991], [1139.4465453979683, -100.56741252031816],
[780.498416769341, -196.04064849430705]]}
Do I have to create some logic on the Node side for long JSONs or is this some sort of a buffering issue I'm having on my Python side that I can overcome with proper settings? Here's all I'm doing on the python side:
outPoints, _ = cv2.projectPoints(inPoints, np.asarray(rvec),
np.asarray(tvec), np.asarray(camera_matrix), np.asarray(dist_coeffs))
# flatten the output to get rid of double brackets per result before JSONifying
flattened = [val for sublist in outPoints for val in sublist]
print(json.dumps({'testdata':np.asarray(flattened).tolist()}))
sys.stdout.flush()
And on the Node side:
// Handle python data from print() function
pythonProcess.stdout.on('data', function (data){
try {
// If JSON handle the data
console.log(JSON.parse(data.toString()));
} catch (e) {
// Otherwise treat as a log entry
console.log(data.toString());
}
});
The emitted data is chunked, so if you want to parse a JSON you will need to join all the chunks, and on end perform JSON.parse.
By default, pipes for stdin, stdout, and stderr are established
between the parent Node.js process and the spawned child. These pipes
have limited (and platform-specific) capacity. If the child process
writes to stdout in excess of that limit without the output being
captured, the child process will block waiting for the pipe buffer to
accept more data.
In linux each chunk is limited to 65536 bytes.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 65536 bytes.
let result = '';
pythonProcess.stdout.on('data', data => {
result += data.toString();
// Or Buffer.concat if you prefer.
});
pythonProcess.stdout.on('end', () => {
try {
// If JSON handle the data
console.log(JSON.parse(result));
} catch (e) {
// Otherwise treat as a log entry
console.log(result);
}
});
Related
I want to use boto3 to run a command on an ECS Fargate container which generates a lot of binary output, and stream that output into a file on my local machine.
My attempt is based on the recommendation here, and looks like this:
import json
import uuid
import boto3
import construct as c
import websocket
# Define Structs
AgentMessageHeader = c.Struct(
"HeaderLength" / c.Int32ub,
"MessageType" / c.PaddedString(32, "ascii"),
)
AgentMessagePayload = c.Struct(
"PayloadLength" / c.Int32ub,
# This only works with my test command. It won't work with my real command that returns binary data
"Payload" / c.PaddedString(c.this.PayloadLength, "ascii"),
)
# Define initial payload
init_payload = {
"MessageSchemaVersion": "1.0",
"RequestId": str(uuid.uuid4()),
"TokenValue": session["tokenValue"],
}
# Define the container you want to talk to
cluster = "..."
task = "..."
container = "..."
# Send command with large response (large enough to span multiple messages)
result = client.execute_command(
cluster=cluster,
task=task,
container=container,
# This is a sample command that returns text. My real command returns hundreds of megabytes of binary data
command="python -c 'for i in range(1000):\n print(i)'",
interactive=True,
)
# Get session info
session = result["session"]
# Create websocket connection
connection = websocket.create_connection(session["streamUrl"])
try:
# Send initial response
connection.send(json.dumps(init_payload))
while True:
# Receive data
response = connection.recv()
# Decode data
message = AgentMessageHeader.parse(response)
payload_message = AgentMessagePayload.parse(response[message.HeaderLength:])
if 'channel_closed' in message.MessageType:
raise Exception('Channel closed before command output was received')
# Print data
print("Header:", message.MessageType)
print("Payload Length:", payload_message.PayloadLength)
print("Payload Message:", payload_message.Payload)
finally:
connection.close()
This almost works, but has a problem - I can't tell when I should stop reading.
If you read the final message from aws, and call connection.recv() again, aws seems to loop around and send you the initial data - the same data you would have received the first time you called connection.recv().
One semi-hackey way to try to deal with this is by adding an end marker to the command. Sort of like:
result = client.execute_command(
...
command="""bash -c "python -c 'for i in range(1000):\n print(i)'; echo -n "=== END MARKER ===""""",
)
This idea works, but to be used properly, becomes really difficult to use. There's always a chance that the end marker text gets split up between two messages, and dealing with that becomes a pain, since you can no longer write a payload immediately to disk until you verify that the end of the payload, along with the beginning of the next payload, isn't your end marker.
Another hackey way is to checksum the first payload, and every subsequent payload, comparing the checksum of each payload to the checksum of the first payload. That will tell you if you've looped around. Unfortunately, this also has a chance of having a collision, if the binary data in 2 messages just happens to repeat, although the chances of that in practice would probably be slim.
Is there a simpler way to determine when to stop reading?
Or better yet, a simpler way to have boto3 give me a stream of binary data from the command I ran?
In Node.JS, I spawn a child Python process to be piped. I want to send a UInt8Array through stdin. So as to notify the size of the buffer data to be read, I send the size of it before. But it doesn't stop reading for the actual data from the buffer properly after a specified size. As a result, the Python process doesn't terminate forever. I've checked that it takes bufferSize properly and converts it into an integer. In the absence of size = int(input()) and python.stdin.write(bufferSize.toString() + "\n") and when the size of the buffer is hardcoded, it works correctly. I couldn't figure out why it does not end waiting after reading for the specified amount of bytes.
// Node.JS
const python_command = command.serializeBinary()
const python = spawn('test/production_tests/py_test_scripts/protocolbuffer/venv/bin/python', ['test/production_tests/py_test_scripts/protocolbuffer/command_handler.py']);
const bufferSize = python_command.byteLength
python.stdin.write(bufferSize.toString() + "\n")
python.stdin.write(python_command)
# Python
size = int(input())
data = sys.stdin.buffer.read(size)
In a nutshell, the problem arises from the fact that putting normal stdin input() firstly and then sys.stdin.buffer.read. I guess the preceding one conflicts with the successive one and precludes it to work normally.
There are two potential problems here. The first is that the pipe between node.js and the python script is block buffered. You won't see any data on the python side until either a block's worth of data is filled (system dependent) or the pipe is closed. The second is that there is a decoder between input and the byte stream coming in on stdin. This decoder is free to read ahead in the stream as it wishes. Reading sys.stdin.buffer may miss whatever happens to be buffered in the decoder.
You can solve the second problem by doing all of your reads from the buffer as shown below. The first problem needs to be solved on the node.js side - likely by closing its subprocess stdin. You may be better off just writing the size as a binary number, say uint64.
import struct
import sys
# read size - assuming its coming in as ascii stream
size_buf = []
while True:
c = sys.stdin.buffer.read(1)
if c == b"\n":
size = int(b"".join(size_buf))
break
size_buf.append(c)
fmt = "B" # read unsigned char
fmtsize = struct.calcsize(fmt)
buf = [struct.unpack(fmt, sys.stdin.buffer.read(fmtsize))[0] for _ in range(size)]
print(buf)
I want to let the user start downloading a file when it's not ready yet. I don't want to send user to some page saying "Wait 30 seconds, file is being prepared." I can't generate the file in advance. I need user to click/send form, choose download location and start downloading. Generated file will be zip, so I imagine, that it should be possible to send file name with first few bytes of zips (which are always same) and before the file is generated don't confirm that tcp packet was send correctly or something like that, and after the file is generated send the rest.
How do I do that? Is there any tool which can do that. Or is there some better way? More high level solution better, C isn't my strong suit. Preferably in Python. Thanks.
File being generated is zip and before it's prepared there isn't really anything to send yet. Basically according to input I generate set of files (which takes few dozens seconds), than I zip them and serve to user. My application is in python on linux but which server I'll use isn't really important.
The client would most likely timeout(or wait 30 seconds without notification) while the file was prepared. I would use a stream compression algorithm(gzip) to compress the file(s) while in transit. This would not result in the best compression, but would serve the files in a predictable manner.
Investigate the "Content-Encoding: gzip" HTTP header.
Usually, application like this will be implement in two parts, like a ticket system.
When user click/submit the form, the form will send request to a service that will start generating the file as a background process, then (without waiting for the file to generate) it will generate a ticket/hash that represent the new file and then redirect user to a new URL, e.g. /files/<random-hash>
On this new URL, /files/<random-hash>, while the files is not ready, it'll return a simple HTML page that show message to user to wait, and have a script on the page that will keep refreshing the page every few seconds. As long as the files is not ready, it'll keep showing this message, but once the file is ready, this URL will return the actual files content in its response with appropriate mime-header instead.
The solution is quite simple to implement using a database and some programming. Though, if you're looking for an already-made tool ready to use, I'm sorry I'm not familiar with one. Hope this help.
Despite claims it's impossible, I managed to find a way. I learned a bit of Go in a meantime, so I used that, but I guess it won't be too different in other languages.
Basically first byte is written in the writer and then flushed. Browser than waits for rest.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"net/http"
"os"
"strings"
"time"
)
func Zip(w http.ResponseWriter, r *http.Request) {
file_name := r.URL.Path
file_name = strings.TrimPrefix(file_name, "/files/")
w.Header().Set("Content-type", "application/zip")
w.Write([]byte{80})
if f, ok := w.(http.Flusher); ok {
f.Flush()
}
for {
if _, err := os.Stat("./files/" + file_name); err == nil {
fmt.Println("file found, breaking")
break
}
time.Sleep(time.Second)
}
stream_file_bytes, err := ioutil.ReadFile("./files/" + file_name)
if err != nil {
fmt.Println(err)
return
}
b := bytes.NewBuffer(stream_file_bytes)
b.Next(1)
b.WriteTo(w)
fmt.Println("end")
}
func main() {
http.HandleFunc("/files/", Zip)
if err := http.ListenAndServe(":8090", nil); err != nil {
panic(err)
}
}
My Node & Python backend is running just fine, but I now encountered an issue where if a JSON I'm sending from Python back no Node is too long, it gets split into two and my JSON.parse at the Node side fails.
How should I fix this? For example, the first batch clips at
... [1137.6962355826706, -100.78015825640887], [773.3834338399517, -198
and the second one has the remaining few entries
.201506231888], [-87276.575065248, -60597.8827676457], [793.1850250453127,
-192.1674702207991], [1139.4465453979683, -100.56741252031816],
[780.498416769341, -196.04064849430705]]}
Do I have to create some logic on the Node side for long JSONs or is this some sort of a buffering issue I'm having on my Python side that I can overcome with proper settings? Here's all I'm doing on the python side:
outPoints, _ = cv2.projectPoints(inPoints, np.asarray(rvec),
np.asarray(tvec), np.asarray(camera_matrix), np.asarray(dist_coeffs))
# flatten the output to get rid of double brackets per result before JSONifying
flattened = [val for sublist in outPoints for val in sublist]
print(json.dumps({'testdata':np.asarray(flattened).tolist()}))
sys.stdout.flush()
And on the Node side:
// Handle python data from print() function
pythonProcess.stdout.on('data', function (data){
try {
// If JSON handle the data
console.log(JSON.parse(data.toString()));
} catch (e) {
// Otherwise treat as a log entry
console.log(data.toString());
}
});
The emitted data is chunked, so if you want to parse a JSON you will need to join all the chunks, and on end perform JSON.parse.
By default, pipes for stdin, stdout, and stderr are established
between the parent Node.js process and the spawned child. These pipes
have limited (and platform-specific) capacity. If the child process
writes to stdout in excess of that limit without the output being
captured, the child process will block waiting for the pipe buffer to
accept more data.
In linux each chunk is limited to 65536 bytes.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 65536 bytes.
let result = '';
pythonProcess.stdout.on('data', data => {
result += data.toString();
// Or Buffer.concat if you prefer.
});
pythonProcess.stdout.on('end', () => {
try {
// If JSON handle the data
console.log(JSON.parse(result));
} catch (e) {
// Otherwise treat as a log entry
console.log(result);
}
});
I have a large text file that I am extracting URLs from. If I run:
import re
with open ('file.in', 'r') as fh:
for match in re.findall(r'http://matchthis\.com', fh.read()):
print match
it runs in a second or so user time and gets the URLs I was wanting, but if I run either of these:
var regex = /http:\/\/matchthis\.com/g;
fs.readFile('file.in', 'ascii', function(err, data) {
while(match = regex.exec(data))
console.log(match);
});
OR
fs.readFile('file.in', 'ascii', function(err, data) {
var matches = data.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
I get:
FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory
What is happening with the node.js regex engine? Is there any way I can modify things such that they work in node?
EDIT: The error appears to be fs centric as this also produces the error:
fs.readFile('file.in', 'ascii', function(err, data) {
});
file.in is around 800MB.
You should process the file line by line using the streaming file interface. Something like this:
var fs = require('fs');
var byline = require('byline');
var input = fs.createReadStream('tmp.txt');
var lines = input.pipe(byline.createStream());
lines.on('readable', function(){
var line = lines.read().toString('ascii');
var matches = line.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
In this example, I'm using the byline module to split the stream into lines so that you won't miss matches by getting partial chunks of lines per .read() call.
To elaborate more, what you were doing is allocating ~800MB of RAM as a Buffer (outside of V8's heap) and then converting that to an ASCII string (and thus transferring it into V8's heap), which will take at least 800MB and likely more depending on V8's internal optimizations. I believe V8 stores strings as UCS2 or UTF16, which means each character will be 2 bytes (given ASCII input) so your string would really be about 1600MB.
Node's max allocated heap space is 1.4GB, so by trying to create such a large string, you cause V8 to throw an exception.
Python does not have this problem because it does not have a maximum heap size and will chew through all of your RAM. As others have pointed out, you should also avoid fh.read() in Python since that will copy all the file data into RAM as a string instead of streaming it line by line with an iterator.
Given that both programs are trying to read the entire 1400000 file into memory, I'd suggest it would be a difference between how Node and Python handle large strings. Try doing a line by line search and the problem should disappear.
For example, in Python you can do this:
import re
with open ('file.in', 'r') as file:
for line in file:
for match in re.findall(r'http://matchthis\.com', line):
print match