As you know sometimes we can't know what the size of the data(if there is no Content-Length in http response header).
What is the best way to receive http response data(use socket)?
The follow code can get all the data but it will blocking at buf = sock.recv(1024).
from socket import *
import sys
sock = socket(AF_INET, SOCK_STREAM)
sock.connect(('www.google.com', 80))
index = "GET / HTTP/1.1\r\nHOST:www.google.com\r\nConnection:keep-alive\r\n\r\n"
bdsock.send(index)
data = ""
while True:
buf = bdsock.recv(1024)
if not len(buf):
break
data += buf
I'm assuming you are writing the sender as well.
A classic approach is to prefix any data sent over the wire with the length of the data. On the receive side, you just greedily append all data received to a buffer, then iterate over the buffer each time new data is received.
So if I send 100 bytes of data, I would prefix an int 100 to the beginning of the packet, and then transmit. Then, the receiver knows exactly what it is looking for. IF you want to get fancy, you can use a special endline sequence like \x00\x01\x02 to indicate the proper end of packet. This is an easily implemented form of error checking.
Use a bigger size first, do a couple of tests, then see what is the lenght of those buffers, you will then have an idea about what would the maximum size be. Then just use that number +100 or so just to be sure.
Testing different scenarios will be your best bet on finding your ideal buf size.
It would also help to know what protocol you are using the sockets for, then we would have a better idea and response for you.
Today I got the same question again.
And I found the simple way is use httplib.
r = HTTPResponse(sock)
r.begin()
# now you can use HTTPResponse method to get what you want.
print r.read()
Related
I am new to programming and started with Python about 2 weeks ago using a course on FCC, I am currently in the networking chapter.
The exercise was about creating a program which counts the maximum number of characters in a document of a website and only display the first 3000 characters of that document using the socket library in Python. The next exercise was to do the same with the urllib library. I have noticed that, when using socket, I was sometimes missing some letters in the file when the bufsize parameter of the sock.recv(bufsize,[flag]) method wasn't set to the total length of received bytes from the document. For example when I used 1024 as the value for bufsize, there were some letters missing here and there from the retrieved document, but when I put the bufsize to 95000 (exact number of bytelength of that document), I got all the letters and everything worked fine.
Please don't be too harsh on me with the code, I am just starting to write something, but here is my example:
import socket
import re
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
userinp = input("Enter a URL: ")
try:
if userinp.startswith("http"):
url = userinp.split("/")[2]
#print(url)
sock.connect((url, 80))
#print("http start connected")
break
elif userinp.startswith("www"):
url = userinp.split("/")[0]
#print(url)
sock.connect((url, 80))
#print("www start connected")
break
else:
url = userinp.split("/")[0]
#print(url)
sock.connect((url, 80))
#print("else start connected")
break
except:
print("Please enter a valid URL")
continue
if userinp.startswith("http:"):
cmd0 = "GET " + userinp + " HTTP/1.0\r\n\r\n"
cmd = cmd0.encode()
#print("http bytes: ", cmd)
elif userinp.startswith("https:"):
cmd0 = "GET " + userinp + " HTTP/1.1\r\nHost: " + url + "\r\n\r\n"
cmd = cmd0.encode()
#print("https bytes: ", cmd)
else:
cmd0 = "GET http://" + userinp + " HTTP/1.0\r\n\r\n"
cmd = cmd0.encode()
sock.send(cmd)
#print("cmd request sent")
count = 0
str = ""
while True:
data = sock.recv(95000) ##536 magic number in romeo.txt, 95000 in mbox-short.txt
if len(data) < 1: ##http://data.pr4e.org/mbox-short.txt
break
#print("Byte length:", len(data))
data = data.decode()
pos = data.find("\r\n\r\n") + 4
for each in data[pos:]:
count += 1
if count <= 3000:
str += each
print(str, "Total characters:", count, len(str))
sock.shutdown(socket.SHUT_RDWR)
sock.close()
The first if statements are meant for the first exercise in the chapter, which was handling userinput URLs using the socket library. On many websites I have some problems with that too, since it often says
301: Moved Permanently
But the location specified in the document says it moved to the exact same location.\
So my questions are:
Why do I have to set the bufsize parameter to the exact bytelength of the retrieved document in order to get all letters out of it? Is there a way around this using the socket library?
Why do some websites specify that they are moved permanently, but show the exact same location of the website?
With the urllib library it is much easier, since it does "all the stuff" for me, but I would like to know how I need to write the program with the socket library too, just to get a better understanding of it.
I'm sorry for the noob questions, but I've read that beginner questions are welcome aswell! I hope you can help me with my problem, thank you in advance! :)
Actually, before HTTP/1.1, Content-Length header is SHOULD on RFC1945, of course this means the header was not required. How did application distinguish end of file, closing of TCP connection was regarded as it. Therefore, there are files of which we can't know
size before downloading even now. This story is about HTTP, layer 7, application layer in OSI model.
Sockets which you use belong TCP, layer 5 and 4. TCP doesn't have how to know size of files. It just manages connections and sends bytes only. It doesn't think any other thing. If TCPs work correctly each other, other layers are guaranteed to work. This is same as HTTP too.
How network works? is itself able theme to be written a thick book. If you are interested in, I recommend to read some books about network.
If anyone is interested in the answer to this question (probably not, because it is a complete beginner question):
I played around with the program a little and added print statements basically everywhere to be able to see what it is doing at what point exactly. The received data every time in the sock.recv is set to 512, so it sends 512 bytes worth of information per iteration of that loop. Then those 512 bytes of information will be used by the for loop after decoding it to a string, iterating through every character of the string but only up to the end of those 512 bytes of information, which in this case (romeo.txt) ends with the "s" in the last line of the poem. Then the sock.recv starts receiving the rest of the information of the document and the for loop starts iterating through the rest again, but this time because of the "data[pos:]" (initially used to remove the header) it starts at the beginning position of those remaining bytes + 4. So with this I would have 3 letters less for each iteration of the loops.
I'm having a problem with a block of Python code reading in a string from an Arduino connected over USB. I understand that serial doesn't know what a string is or care. I'm using serial.readline, which from the documentation sounds like the perfect match, but my string isn't always complete. The weird problem is, the string doesn't always have the front of the string, but it always has the end of the string. I'm really lost on this and I'm sure it's just my lack of understanding about the nuances of reading serial data or how Python handles it.
In the code below, I loop through the serial interfaces until I find the one I'm looking for. I flush the input and give it a sleep for a couple seconds to make sure it has time to get a new read.
arduinoTemp = serial.Serial(iface, 9600, timeout=1)
arduinoTemp.flushInput()
arduinoTemp.flushOutput()
arduinoTemp.write("status\r\n".encode())
time.sleep(2)
read = arduinoTemp.readline().strip()
if read != "":
#check the string to make sure it's what I'm expecting.
I'm sending the string in JSON.
I'm expecting something in line with this:
{"id": "env monitor","distance": {"forward": {"num":"0","unit": "inches"}},"humidity": {"num":"0.00","unit": "%"},"temp": {"num":"0.00","unit": "fahrenheit"},"heatIndex": {"num":"0.00","unit": "fahrenheit"}}
I might get something back like this:
": t": "%"},"temp": {"num":"69.80","unit": "fahrenheit"},"heatIndex": {"num":"68.13","unit": "fahrenheit"}}
or this:
atIndex": {"num":"0.00","unit": "fahrenheit"}}
At first I thought it was the length of the string that might be causing some issues, but the cut off isn't always consistent, and since it has the end of the string, it stands to reason that it should have gotten everything before that.
I've verified that my Arduino is broadcasting correctly by interfacing with it directly and the Arduino IDE and serial monitor. This is definitely an issue with my Python code.
In (serial) communications you should always expect to receive partial answers.
A usual solution in this case is to add whatever you read from the serial to a string/buffer until you can parse it successfully with json.loads.
import serial
import json
import time
ser = serial.Serial('/dev/ttyACM0', 9600)
buffer = ''
while True:
buffer += ser.read()
try:
data = json.loads(buffer)
print(data)
buffer = ''
except json.JSONDecodeError:
time.sleep(1)
(From this answer).
Note that if you flush, you will lose data!
Also note that this is a somewhat simplified solution. Ideally the buffer should be reset to whatever remains after the successful parse. But as far as I know, the json module doesn't offer that functionality.
I am trying to send multiple strings using the socket.send() and socket.recv() function.
I am building a client/server, and "in a perfect world" would like my server to call the socket.send() function a maximum of 3 times (for a certain server command) while having my client call the socket.recv() call 3 times. This doesn't seem to work the client gets hung up waiting for another response.
server:
clientsocket.send(dir)
clientsocket.send(dirS)
clientsocket.send(fileS)
client
response1 = s.recv(1024)
if response1:
print "\nRecieved response: \n" , response1
response2 = s.recv(1024)
if response2:
print "\nRecieved response: \n" , response2
response3 = s.recv(1024)
if response3:
print "\nRecieved response: \n" , response3
I was going through the tedious task of joining all my strings together then reparsing them in the client, and was wondering if there was a more efficient way of doing it?
edit:
My output of response1 gives me unusual results. The first time I print response1, it prints all 3 of the responses in 1 string (all mashed together). The second time I call it, it gives me the first string by itself. The following calls to recv are now glitched/bugged and display the 2nd string, then the third string. It then starts to display the other commands but as if it was behind and in a queue.
Very unusual, but I will likely stick to joining the strings together in the server then parsing them in the client
You wouldn't send bytes/strings over a socket like that in a real-world app.
You would create a messaging protocol on-top of the socket, then you would put your bytes/strings in messages and send messages over the socket.
You probably wouldn't create the messaging protocol from scratch either. You'd use a library like nanomsg or zeromq.
server
from nanomsg import Socket, PAIR
sock = Socket(PAIR)
sock.bind('inproc://bob')
sock.send(dir)
sock.send(dirS)
sock.send(fileS)
client
from nanomsg import Socket, PAIR
sock = Socket(PAIR)
sock.bind('inproc://bob')
response1 = sock.recv()
response2 = sock.recv()
response3 = sock.recv()
In nanomsg, recv() will return exactly what was sent by send() so there is a one-to-one mapping between send() and recv(). This is not the case when using lower-level Python sockets where you may need to call recv() multiple times to get everything that was sent with send().
TCP is a streaming protocol and there are no message boundaries. Whether a blob of data was sent with one or a hundred send calls is unknown to the receiver. You certainly can't assume that 3 sends can be matched with 3 recvs. So, you are left with the tedious job of reassembling fragments at the receiver.
One option is to layer a messaging protocol on top of the pure TCP stream. This is what zeromq does, and it may be an option for reducing the tedium.
The answer to this has been covered elsewhere.
There are two solutions to your problem.
Solution 1:
Mark the end of your strings. send(escape(dir) + MARKER) Your client then keeps calling recv() until it gets the end-of-message marker. If recv() returns multiple strings, you can use the marker to know where they start and end. You need to escape the marker if your strings contain it. Remember to escape on the client too.
Solution 2:
Send the length of your strings before you send the actual string. Your client then keeps calling recv() until its read all the bytes. If recv() returns multiple strings. You know where they start and end since you know how long they are. When sending the length of your string, make you you use a fixed number of bytes so you can distinguish the string lenght from the string in the byte stream. You will find struct module useful.
I found this code to detect the length of encrypted data in the frame :
header = self.request.recv(5)
if header == '':
#print 'client disconnected'
running = False
break
(content_type, version, length) = struct.unpack('>BHH', header)
data = self.request.recv(length)
Souce :
https://github.com/EiNSTeiN-/poodle/blob/master/samples/poodle-sample-1.py
https://gist.github.com/takeshixx/10107280
https://gist.github.com/ixs/10116537
This code, listen the connection between a client and a server. When the client talk to the server, self.request.recv(5) can get you the length of the header in the frame. Then we use that length to take the data.
If we print the exchange between the client and the server :
Client --> [proxy] -----> Server
length : 24 #why 24 ?
Client --> [proxy] -----> Server
length: 80 #length of the data
Client <-- [proxy] <----- Server
We can see that the client will send two packet to the server.
If i change
data = self.request.recv(length)
to
data = self.request.recv(4096)
Only one exchange is made.
Client --> [proxy] -----> Server
length: 109 #length of the data + the header
Client <-- [proxy] <----- Server
My question is why we only need to take a size of 5 to get the lenght, content_type informations ? Is there an understandable doc about this ?
Why there is two request: one with 24 and another with the lenght of our data ?
why we only need to take a size of 5 to get the lenght, content_type
informations ?
Because obviously that's the way the protocol was designed.
Binary streams only guarantee that when some bytes are put into one end of the stream, they arrive in the same order on the other end of the stream. For message transmission through binary streams the obvious problem is: where are the message boundaries? The classical solution to this problem is to add a prefix to messages, a so-called header. This header has a fixed size, known to both communication partners. That way, the recipient can safely read header, message, header, message (I guess you grasp the concept, it is an alternating fashion). As you see, the header does not contain message data -- it is just communication "overhead". This overhead should be kept small. The most efficient (space-wise) way to store such information is in binary form, using some kind of code that must, again, be known to both sides of the communication. Indeed, 5 bytes of information is quite a lot.
The '>BHH' format string indicates that this 5 byte header is built up like this:
unsigned char (1 Byte)
unsigned short (2 Bytes)
unsigned short (2 Bytes)
Plenty of room for storing information such as length and content type, don't you think? This header can encode 256 different content types, 65536 different versions, and a message length between 0 and 65535 bytes.
Why there is two request: one with 24 and another with the lenght of
our data ?
If your network forensics / traffic analysis does not correspond to what you have inferred from code, one of both types of analyses is wrong/incomplete. In this case, I guess that your traffic analysis is correct, but that you have not understood all relevant code for this kind of communication. Note that I did not look at the source code you linked to.
I would like to parse the first two bytes of a packets payload using Scapy. What would be the best way to accomplish this? Are offset calculations required?
First the payload needs to be parsed though the following will parse the whole PCAP file, is there a more efficient way to obtain the first two bytes of every payload? link:
>>> fp = open("payloads.dat","wb")
>>> def handler(packet):
... fp.write(str(packet.payload.payload.payload))
...
>>> sniff(offline="capture1.dump",prn=handler,filter="tcp or udp")
I see. That looks pretty efficient from here.
You might try fp.write(str(packet.payload.payload.payload)[:2]) to get just the first 2 bytes.
You could also do fp.write(str(packet[TCP].payload)[:2]) to skip past all those payloads.
Alternately, you could define an SSL Packet object, bind it to the appropriate port, then print the SSL layer.
class SSL(Packet):
name = "SSL" fields_desc = [ ShortField("firstBytes", None) ]
bind_layers( TCP, SSL, sport=443 )
bind_layers( TCP, SSL, dport=443 )
def handler(packet):
... fp.write(str(packet[SSL]))
...but this seems like overkill.