Python socket receive bufsize parameter value - python

I am new to programming and started with Python about 2 weeks ago using a course on FCC, I am currently in the networking chapter.
The exercise was about creating a program which counts the maximum number of characters in a document of a website and only display the first 3000 characters of that document using the socket library in Python. The next exercise was to do the same with the urllib library. I have noticed that, when using socket, I was sometimes missing some letters in the file when the bufsize parameter of the sock.recv(bufsize,[flag]) method wasn't set to the total length of received bytes from the document. For example when I used 1024 as the value for bufsize, there were some letters missing here and there from the retrieved document, but when I put the bufsize to 95000 (exact number of bytelength of that document), I got all the letters and everything worked fine.
Please don't be too harsh on me with the code, I am just starting to write something, but here is my example:
import socket
import re
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
userinp = input("Enter a URL: ")
try:
if userinp.startswith("http"):
url = userinp.split("/")[2]
#print(url)
sock.connect((url, 80))
#print("http start connected")
break
elif userinp.startswith("www"):
url = userinp.split("/")[0]
#print(url)
sock.connect((url, 80))
#print("www start connected")
break
else:
url = userinp.split("/")[0]
#print(url)
sock.connect((url, 80))
#print("else start connected")
break
except:
print("Please enter a valid URL")
continue
if userinp.startswith("http:"):
cmd0 = "GET " + userinp + " HTTP/1.0\r\n\r\n"
cmd = cmd0.encode()
#print("http bytes: ", cmd)
elif userinp.startswith("https:"):
cmd0 = "GET " + userinp + " HTTP/1.1\r\nHost: " + url + "\r\n\r\n"
cmd = cmd0.encode()
#print("https bytes: ", cmd)
else:
cmd0 = "GET http://" + userinp + " HTTP/1.0\r\n\r\n"
cmd = cmd0.encode()
sock.send(cmd)
#print("cmd request sent")
count = 0
str = ""
while True:
data = sock.recv(95000) ##536 magic number in romeo.txt, 95000 in mbox-short.txt
if len(data) < 1: ##http://data.pr4e.org/mbox-short.txt
break
#print("Byte length:", len(data))
data = data.decode()
pos = data.find("\r\n\r\n") + 4
for each in data[pos:]:
count += 1
if count <= 3000:
str += each
print(str, "Total characters:", count, len(str))
sock.shutdown(socket.SHUT_RDWR)
sock.close()
The first if statements are meant for the first exercise in the chapter, which was handling userinput URLs using the socket library. On many websites I have some problems with that too, since it often says
301: Moved Permanently
But the location specified in the document says it moved to the exact same location.\
So my questions are:
Why do I have to set the bufsize parameter to the exact bytelength of the retrieved document in order to get all letters out of it? Is there a way around this using the socket library?
Why do some websites specify that they are moved permanently, but show the exact same location of the website?
With the urllib library it is much easier, since it does "all the stuff" for me, but I would like to know how I need to write the program with the socket library too, just to get a better understanding of it.
I'm sorry for the noob questions, but I've read that beginner questions are welcome aswell! I hope you can help me with my problem, thank you in advance! :)

Actually, before HTTP/1.1, Content-Length header is SHOULD on RFC1945, of course this means the header was not required. How did application distinguish end of file, closing of TCP connection was regarded as it. Therefore, there are files of which we can't know
size before downloading even now. This story is about HTTP, layer 7, application layer in OSI model.
Sockets which you use belong TCP, layer 5 and 4. TCP doesn't have how to know size of files. It just manages connections and sends bytes only. It doesn't think any other thing. If TCPs work correctly each other, other layers are guaranteed to work. This is same as HTTP too.
How network works? is itself able theme to be written a thick book. If you are interested in, I recommend to read some books about network.

If anyone is interested in the answer to this question (probably not, because it is a complete beginner question):
I played around with the program a little and added print statements basically everywhere to be able to see what it is doing at what point exactly. The received data every time in the sock.recv is set to 512, so it sends 512 bytes worth of information per iteration of that loop. Then those 512 bytes of information will be used by the for loop after decoding it to a string, iterating through every character of the string but only up to the end of those 512 bytes of information, which in this case (romeo.txt) ends with the "s" in the last line of the poem. Then the sock.recv starts receiving the rest of the information of the document and the for loop starts iterating through the rest again, but this time because of the "data[pos:]" (initially used to remove the header) it starts at the beginning position of those remaining bytes + 4. So with this I would have 3 letters less for each iteration of the loops.

Related

Inconsistent sockets output

I've tried asking before and got a snarky response so I thought I'd try again, with a new problem that I've more recently run into. Basically, the same code is used for all 4 clients, and the same thing is being sent to each of them using a for loop. However, sometimes the output is different on certain clients and this changes every time I run the code. The client where the error(s) occur is also different. In the client script:
def receieve_message():
while True:
command = client.recv(2048).decode(FORMAT)
if command == "VOTE":
round_one_vote = input("Who would you like to remove: ")
message = ("VOTE1 "+round_one_vote)
send(message)
else:
print(command)
I am sending in either the word VOTE from the server or a string of text. If it is a string of text it should print it out to the client. If it is VOTE it will take an input. I am sending the data from the server like this:
for client in clients:
client.send("VOTE".encode(FORMAT))
When I've just run it, 3 out of 4 clients begin the input check, however the first client prints out VOTE, which should not be happening. When I run it a second, time the first two print out VOTE. There doesn't seem to be a pattern.
Additionally, there are random line breaks sometimes, also arbritarily

Separate outlook getproperty into variables like message id, in-reply and so on

I working on some analytics for our email help line. I can see the headers and everything that is in them, but I need to separate each header component into its own field/variable. What is the best way to accomplish this.
here is the the code i currently have.
import win32com.client
import win32com
import pandas as pd
M_date = []
M_sender = []
M_sub = []
M_flag = []
M_cat = []
M_folder = []
outlook = win32com.client.Dispatch("outlook.application").GetNamespace("MAPI")
for i in range(0, 20):
try:
inbox = outlook.getdefaultfolder(6).folders[i]
try:
for message in inbox.items:
try:
Folder = str(inbox) + " " + str(i)
Sender= message.sendername
Subject= message.subject
Dates= message.ReceivedTime
M_import = message.Importance
if message.FlagRequest == None :
Flag = ""
else:
Flag = message.FlagRequest
if message.Categories == None:
cat = ""
else:
cat = message.Categories
msg = message.PropertyAccessor.GetProperty("http://schemas.microsoft.com/mapi/proptag/0x007D001F")
print(msg) #debug header
M_folder.append(Folder)
M_date.append(Dates.strftime("%b %d %Y %H:%M:"))
M_sender.append(Sender)
M_sub.append(Subject)
M_flag.append(Flag)
M_cat.append(cat)
except:
pass
except:
pass
except:
pass
df = pd.DataFrame({
'In folder': M_folder,
'Date': M_date,
'Sender': M_sender,
'Subject': M_sub,
'flags': M_flag,
'Categrories': M_cat})
df.to_csv('email_data.csv', index=False)
Thanks
Transport headers is a string which contains properties and their values separated by ":". Basically you need to loop through all lines backwards. If the line starts with space or tab, append it to the previous line and delete the current line. Then loop through all lines and separate them into the header name (left of the first ":") and the header value (right of the first ":").
I do not know Python so I cannot provide any code, but I can tell you about the format of the Transport Message Headers. (I must learn Python, my son-in-law swears by it.)
The Transport Message Headers contain an indefinite number of lines separated by carriage return linefeed. In VBA to access the individual lines, you would have something like:
Dim msgParts() As String
msgParts = Split(msg, vbCrLf)
If a line starts with one or more spaces and or horizontal tabs, it is a continuation of the previous line. Replace all the spaces and tabs at the beginning of a continuation line with one space and append to the previous line.
A line, together with any continuation lines, starts “Xxxx: ”. “Xxxx” will be “To” or “From” or any of the other specified identifiers or a private identifier.
The specification of the lines are RFCs (Request For Comments). I would start with RFC 5321 and follow the references to the related RFCs. Or perhaps I would not.
I have not looked at the RFCs for SMTP (Simple Mail Transfer Protocol) for many years. My recollection is that they were once much simpler. For example, my recollection is that the specification dealt with the continuations and then dealt with the combined line; this would have been standard practice when I was young. I was looking at the specification for email addresses which seemed overly complicated with lots of CRLFs that I did not remember as being allowed within a line. I finally realised that the specification for an email address allowed for a continuation line break between any two elements. In my humble opinion, this made for an unnecessarily complex specification. I would also expect the processing code to be slower since it would be attempting to solve two separate problems at the same time.
In the end, I gave up on the SMTP RFCs. Partly because of the continuation line issue but mainly because they now handle a lot of specialised situations that are quite outside the needs of the simple emails I send and receive. I decided it was easier to analyse the emails I had sent or received than attempt to simplify the specification down to my requirements.
My interest in looking at the Transport Message Headers was because I wanted to identify the other party of every email. For every email in my Outlook folders, I was either the sender or I was one of the recipients. If I was the sender, I wanted the first or only recipient. If I was a recipient, I wanted the sender. This proved difficult or impossible from the properties such as To and From because they usually contain display names. The display names for myself, were every possible variation of my name. If this issue is relevant to you, I am happy to share how I handled it.

Python's serial.readline is not receiving my entire line

I'm having a problem with a block of Python code reading in a string from an Arduino connected over USB. I understand that serial doesn't know what a string is or care. I'm using serial.readline, which from the documentation sounds like the perfect match, but my string isn't always complete. The weird problem is, the string doesn't always have the front of the string, but it always has the end of the string. I'm really lost on this and I'm sure it's just my lack of understanding about the nuances of reading serial data or how Python handles it.
In the code below, I loop through the serial interfaces until I find the one I'm looking for. I flush the input and give it a sleep for a couple seconds to make sure it has time to get a new read.
arduinoTemp = serial.Serial(iface, 9600, timeout=1)
arduinoTemp.flushInput()
arduinoTemp.flushOutput()
arduinoTemp.write("status\r\n".encode())
time.sleep(2)
read = arduinoTemp.readline().strip()
if read != "":
#check the string to make sure it's what I'm expecting.
I'm sending the string in JSON.
I'm expecting something in line with this:
{"id": "env monitor","distance": {"forward": {"num":"0","unit": "inches"}},"humidity": {"num":"0.00","unit": "%"},"temp": {"num":"0.00","unit": "fahrenheit"},"heatIndex": {"num":"0.00","unit": "fahrenheit"}}
I might get something back like this:
": t": "%"},"temp": {"num":"69.80","unit": "fahrenheit"},"heatIndex": {"num":"68.13","unit": "fahrenheit"}}
or this:
atIndex": {"num":"0.00","unit": "fahrenheit"}}
At first I thought it was the length of the string that might be causing some issues, but the cut off isn't always consistent, and since it has the end of the string, it stands to reason that it should have gotten everything before that.
I've verified that my Arduino is broadcasting correctly by interfacing with it directly and the Arduino IDE and serial monitor. This is definitely an issue with my Python code.
In (serial) communications you should always expect to receive partial answers.
A usual solution in this case is to add whatever you read from the serial to a string/buffer until you can parse it successfully with json.loads.
import serial
import json
import time
ser = serial.Serial('/dev/ttyACM0', 9600)
buffer = ''
while True:
buffer += ser.read()
try:
data = json.loads(buffer)
print(data)
buffer = ''
except json.JSONDecodeError:
time.sleep(1)
(From this answer).
Note that if you flush, you will lose data!
Also note that this is a somewhat simplified solution. Ideally the buffer should be reset to whatever remains after the successful parse. But as far as I know, the json module doesn't offer that functionality.

Python dectect the length of the data with socket

I found this code to detect the length of encrypted data in the frame :
header = self.request.recv(5)
if header == '':
#print 'client disconnected'
running = False
break
(content_type, version, length) = struct.unpack('>BHH', header)
data = self.request.recv(length)
Souce :
https://github.com/EiNSTeiN-/poodle/blob/master/samples/poodle-sample-1.py
https://gist.github.com/takeshixx/10107280
https://gist.github.com/ixs/10116537
This code, listen the connection between a client and a server. When the client talk to the server, self.request.recv(5) can get you the length of the header in the frame. Then we use that length to take the data.
If we print the exchange between the client and the server :
Client --> [proxy] -----> Server
length : 24 #why 24 ?
Client --> [proxy] -----> Server
length: 80 #length of the data
Client <-- [proxy] <----- Server
We can see that the client will send two packet to the server.
If i change
data = self.request.recv(length)
to
data = self.request.recv(4096)
Only one exchange is made.
Client --> [proxy] -----> Server
length: 109 #length of the data + the header
Client <-- [proxy] <----- Server
My question is why we only need to take a size of 5 to get the lenght, content_type informations ? Is there an understandable doc about this ?
Why there is two request: one with 24 and another with the lenght of our data ?
why we only need to take a size of 5 to get the lenght, content_type
informations ?
Because obviously that's the way the protocol was designed.
Binary streams only guarantee that when some bytes are put into one end of the stream, they arrive in the same order on the other end of the stream. For message transmission through binary streams the obvious problem is: where are the message boundaries? The classical solution to this problem is to add a prefix to messages, a so-called header. This header has a fixed size, known to both communication partners. That way, the recipient can safely read header, message, header, message (I guess you grasp the concept, it is an alternating fashion). As you see, the header does not contain message data -- it is just communication "overhead". This overhead should be kept small. The most efficient (space-wise) way to store such information is in binary form, using some kind of code that must, again, be known to both sides of the communication. Indeed, 5 bytes of information is quite a lot.
The '>BHH' format string indicates that this 5 byte header is built up like this:
unsigned char (1 Byte)
unsigned short (2 Bytes)
unsigned short (2 Bytes)
Plenty of room for storing information such as length and content type, don't you think? This header can encode 256 different content types, 65536 different versions, and a message length between 0 and 65535 bytes.
Why there is two request: one with 24 and another with the lenght of
our data ?
If your network forensics / traffic analysis does not correspond to what you have inferred from code, one of both types of analyses is wrong/incomplete. In this case, I guess that your traffic analysis is correct, but that you have not understood all relevant code for this kind of communication. Note that I did not look at the source code you linked to.

How to receive http response data use socket?

As you know sometimes we can't know what the size of the data(if there is no Content-Length in http response header).
What is the best way to receive http response data(use socket)?
The follow code can get all the data but it will blocking at buf = sock.recv(1024).
from socket import *
import sys
sock = socket(AF_INET, SOCK_STREAM)
sock.connect(('www.google.com', 80))
index = "GET / HTTP/1.1\r\nHOST:www.google.com\r\nConnection:keep-alive\r\n\r\n"
bdsock.send(index)
data = ""
while True:
buf = bdsock.recv(1024)
if not len(buf):
break
data += buf
I'm assuming you are writing the sender as well.
A classic approach is to prefix any data sent over the wire with the length of the data. On the receive side, you just greedily append all data received to a buffer, then iterate over the buffer each time new data is received.
So if I send 100 bytes of data, I would prefix an int 100 to the beginning of the packet, and then transmit. Then, the receiver knows exactly what it is looking for. IF you want to get fancy, you can use a special endline sequence like \x00\x01\x02 to indicate the proper end of packet. This is an easily implemented form of error checking.
Use a bigger size first, do a couple of tests, then see what is the lenght of those buffers, you will then have an idea about what would the maximum size be. Then just use that number +100 or so just to be sure.
Testing different scenarios will be your best bet on finding your ideal buf size.
It would also help to know what protocol you are using the sockets for, then we would have a better idea and response for you.
Today I got the same question again.
And I found the simple way is use httplib.
r = HTTPResponse(sock)
r.begin()
# now you can use HTTPResponse method to get what you want.
print r.read()

Categories

Resources