python retrieving web data

python retrieving web data - python

I am new at Python and I have been trying to figure out the following exercise.
Exercise 5: (Advanced) Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that recv is receiving characters (newlines and all), not lines.
I attached below the code I came up with, unfortunately I don't think it is working:
import socket
mysocket=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
mysocket.connect(('data.pr4e.org', 80))
mysocket.send('GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode())
count=0
while True:
data = mysocket.recv(200)
if (len(data) < 1): break
count=count+len(data.decode().strip())
print(len(data),count)
if count >=399:
print(data.decode(),end="")
mysocket.close()

Instead of counting the number of lines received, just grab all the data you get and then split on the first double CRLF you find.
resp = []
while True:
data = mysocket.recv(200)
if not data: break
resp.append(data.decode())
mysocket.close()
resp = "".join(resp)
body = resp.partition('\r\n\r\n')[2]
print(body)

Related

Using a custom socket recvall function works only, if thread is put to sleep

I have the following socket listening on my local network:
def recvall(sock):
BUFF_SIZE = 4096 # 4 KiB
fragments = []
while True:
chunk = sock.recv(BUFF_SIZE)
fragments.append(chunk)
# if the following line is removed, data is omitted
time.sleep(0.005)
if len(chunk) < BUFF_SIZE:
break
data = b''.join(fragments)
return data
def main():
pcd = o3d.geometry.PointCloud()
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('192.168.0.22', 2525))
print("starting listening...")
s.listen(1)
counter = 0
while True:
clientsocket, address = s.accept()
print(f"Connection from {address} has been established!")
received_data = recvall(clientsocket)
clientsocket.send(bytes(f"response nr {counter}!", "utf-8"))
counter += 1
print(len(received_data))
if __name__ == "__main__":
main()
To this port, I'm sending byte data with a length of 172800 bytes from an app on my mobile phone.
As one can see, I'm printing the amount of data received. The amount is only correct, if I use time.sleep() as shown in the code above. If I don't use this method, only a part of the data is received.
Obviously this is some timing issue, the question is: How can I be sure to receive all the data all the time without using time.sleep() (since this is also not 100% certain to work, depending on the sleeping time set)

sock.recv() returns the data that is available. The relevant piece from the man page of recv(2) is:
The receive calls normally return any data available, up to the requested amount,
rather than waiting for receipt of the full amount requested.
In your case, time.sleep(0.005) seems to allow for all the remaining data of the message to arrive and be stored in the buffer.
There are some options to eliminate the need for time.sleep(0.005). Which one is the most appropriate depends on your needs.
If the sender sends data, but does not expect a response, you can have the sender close the socket after it sends the data, i.e., sock.close() after sock.sendall(). recv() will return an empty string that can be used to break out of the while loop on the receiver.
def recvall(sock):
BUFF_SIZE = 4096
fragments = []
while True:
chunk = sock.recv(BUFF_SIZE)
if not chunk:
break
fragments.append(chunk)
return b''.join(fragments)
If the sender sends messages of fixed length, e.g., 172800 bytes, you can use recv() in a loop until the receiver receives an entire message.
def recvall(sock, length=172800):
fragments = []
while length:
chunk = sock.recv(length)
if not chunk:
raise EOFError('socket closed')
length -= len(chunk)
fragments.append(chunk)
return b''.join(fragments)
Other options include a. adding a delimiter, e.g., a special character that cannot be part of the data, at the end of the messages that the sender sends; the receiver can then run recv() in a loop until it detects the delimiter and b. prefixing the messages on the sender with their length; the receiver will then know how many bytes to expect for each message.

Why is socket.sendall() not working?

With my program, I am attempting to connect to a IP address using socket.socket(), and when it connects to capture a bit of morse code, decode it, and then push the answer back through the socket with socket.sendall(). I have it so I can connect to the IP address, decode the message, and even send back my answer, but when I send back the answer it says that it's wrong, even though I know for a fact it isn't. I'm wondering if maybe, when I'm sending back my answer, if I'm sending back an additional set of quotation marks around it or something? Any help would be appreciated.
import socket
def morse(code):
decoded = []
CODE = [['.-', 'A'],['-...', 'B'],['-.-.', 'C'],['-..', 'D'],['.', 'E'],['..-.', 'F'],['--.', 'G'],['....', 'H'],['..', 'I'],['.---', 'J'],['-.-', 'K'],['.-..', 'L'],['--', 'M'],['-.', 'N'],['---', 'O'],['.--.', 'P'],['--.-', 'Q'],['.-.', 'R'],['...', 'S'],['-', 'T'],['..-', 'U'],['...-', 'V'],['.--', 'W'],['-..-', 'X'],['-.--', 'Y'],['--..', 'A']]
for i in CODE:
if i[0] == code:
decoded.append(i[1].lower())
if code == '':
decoded.append('.')
return decoded
def netcat(hostname, port, content):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((hostname, port))
while 1:
data = s.recv(1024)
if data == "":
break
if "text:" in repr(data):
s.sendall(content)
print("Received:", repr(data))
if "-" in repr(data):
splitMorse = repr(data).split(' ')
splitMorse = splitMorse[8:len(splitMorse)-2]
decoded = []
for i in splitMorse:
decoded.extend(morse(i))
strDecoded = ''.join(decoded)
strDecoded = strDecoded.replace("....................................................", " ")
print("{}\n".format(strDecoded))
#HERE IS WHERE I AM SENDING THE STRING BACK
print(s.sendall("{}\n".format(strDecoded)))
print("Connection closed.")
s.shutdown(socket.SHUT_WR)
s.close()
content = "GET\n"
netcat('146.148.102.236', 24069, content)
At the end of sending my string through the socket, I added an "\n" because otherwise it won't accept my string and it'll sit there forever (because you have to press enter after typing. Here is my output:
('Received:', "'------------------------------------------\\nWelcome to
The Neverending Crypto!\\nQuick, find Falkor and get through this!\\nThis
is level 1, the Bookstore\\nRound 1. Give me some text:'")
None
('Received:', "'GET encrypted is --. . - \\nWhat is ..-. .-. .- --. -- .
-. - .- - .. --- -. decrypted?\\n:'")
fragmentation
None
('Received:', "'No... I am leaving.\\n'")
Connection closed.

I think your logic is flawed. The first message contains text: and it also contains -. I think you want elif for your final if.
For your sequence of if statements in netcat(), try this:
if data == "":
break
print("Received:", repr(data))
if "text:" in repr(data):
...
elif "-" in repr(data):
...

"IndexError: string index out of range" when loop is already ended! - Python 3

I'm recently studiying sockets trying to make them work inside a Python script (Python3), inside Windows.
Here the Python script of the server side.
import socket
import time
MSGLEN = 2048
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.bind(('localhost', 8000))
server.listen(1)
while 1:
#accept connections from outside
(clientsocket, address) = server.accept()
chunks = []
bytes_recd = 0
while bytes_recd < MSGLEN:
chunk = clientsocket.recv(min(MSGLEN - bytes_recd, 2048)) #should enough this row without checks if transmission guaranteed inside buffer dimension
#print(chunk)
#i=0
chunk = chunk.decode()
bytes_recd = bytes_recd + len(chunk)
chunks.append(chunk)
for i in range(bytes_recd):
if(chunk[i] == "_"):
print("Breaking(_Bad?)")
break
buff_str = chunk[:i]
print(buff_str)
if chunk == '':
print("Server notification: connection broken")
break
mex = ''.join(chunks)
print("Server notification: \n\tIncoming data: " + mex)
i=1;
while i==1:
chunk = clientsocket.recv(128)
chunk = chunk.decode()
if chunk == '':
i = 0
totalsent = 0
msg = "Server notification: data received"
while totalsent < MSGLEN:
sent = clientsocket.send(bytes(msg[totalsent:], 'UTF-8'))
if sent == 0 :
print ("Server notification: end transmitting")
break
totalsent = totalsent + sent
I'm checking when a "_" is received and make some decision in it. This because I'm using blocking sockets. You should forget the very last part and the whole program functionality since I'm working on it and the incriminated part is here:
for i in range(bytes_recd):
if(chunk[i] == "_"):
print("Breaking(_Bad?)")
break
buff_str = chunk[:i]
Something weird happens: the check works fine and break the loop by printing the rest at the right index value. BUT! This wild and apparently non-sense error appears:
>>>
Breaking(_Bad?), i: 2
13
Traceback (most recent call last):
File "C:\Users\TheXeno\Dropbox\Firmwares\Altri\server.py", line 24, in <module>
if(chunk[i] == "_"):
IndexError: string index out of range
As you can see from the console output, it finds the number before the "_", in this case is the string "13" and is located at i = 2, which is compliant with the receiving string format form the socket: "charNumber_String". But seems to keep counting until exit from bounds.
EDIT: I will not rename the variables, but next time, better use improved names, and not "chunk" and "chunks".

Let's look at this block of code:
while bytes_recd < MSGLEN:
chunk = clientsocket.recv(min(MSGLEN - bytes_recd, 2048))
chunk = chunk.decode()
bytes_recd = bytes_recd + len(chunk)
chunks.append(chunk)
for i in range(bytes_recd):
if(chunk[i] == "_"):
print("Breaking(_Bad?)")
break
Lets say you read 100 bytes, and lets assume that the decoded chunk is the same length as the encoded chunk. bytes_recd will be 100, and your for loop goes from zero to 99, and all is well.
Now you read another 100 bytes. chunk is again 100 bytes long, and chunks (with an "s") is 200 bytes. bytes_recd is now 200. Your for loop now goes from 0 to 199, and you're checking chunk[i]. chunk is only 100 bytes long, so when i gets past 99, you get the error.
Maybe you meant to compare chunks[i] (with an "s")?

try:
for i, chunk in enumerate(chunks):
if(chunk == "_"):
print("Breaking(_Bad?)")
break
This way you never go out of bounds. So one error less :)

Reconstruct HTTP Webpage from libpcap python script

I am trying to reconstruct a webpage from a libpcap file from a python script. I have all the packets so the goal I guess is to have a libpcap file as input and you find all the necessary packets and somehow have a webpage file as output with all pictures and data from that page. Can anyone get me started off in the right direction. I think I will need dkpt and/or scaPY.
Update 1: Code is below! Here is the code I have come up so far with in Python. It is suppose to grab the first set of packets from a single HTTP session beginning with a packet with the SYN and ACK flags set to 1 and ends with a packet that has the FIN flag set to 1.
Assuming there is only one website visited during the packet capture does this code append all the necessary packets needed to reconstruct the visited webpage?
Assuming I have all the necessary packets how do I reconstruct the webpage?
import scaPy
pktList = list() #create a list to store the packets we want to keep
pcap = rdpcap('myCapture.pcap') #returns a packet list with every packet in the pcap
count = 0 #will store the index of the syn-ack packet in pcap
for pkt in pcap: #loops through packet list named pcap one packet at a time
count = count + 1 #increments by 1
if pkt[TCP].flags == 0x12 and pkt[TCP].sport == 80: #if it is a SYN-ACK packet session has been initiated as http
break #breaks out of the for loop
currentPkt = count #loop from here
while pcap[currentPkt].flags&0x01 != 0x01: #while the FIN bit is set to 0 keep loops stop when it is a 1
if pcap[currentPkt].sport == 80 and pcap[currentPkt].dport == pcap[count].dport and pcap[currentPkt].src == pcap[count].src and pcap[currentPkt].dst == pcap[count].dst:
#if the src, dst ports and IP's are the same as the SYN-ACK packet then the http packets belong to this session and we want to keep them
pktList.append(pcap[currentPkt])
#once the loop exits we have hit the packet with the FIN flag set and now we need to reconstruct the packets from this list.
currentPkt = currentPkt + 1

Perhaps something like tcpick -r your.pcap -wRS does the job for you.
http://tcpick.sourceforge.net/?t=1&p=OPTIONS

This python script will extract all unencrypted HTTP webpages that are in a PCAP File and output them as HTML Files. It uses scaPY to work with the individual packets (another good python module is dpkt).
from scapy.all import *
from operator import *
import sys
def sorting(pcap):
newerList = list()
#remove everything not HTTP (anything not TCP or anything TCP and not HTTP (port 80)
#count = 0 #dont need this it was for testing
for x in pcap:
if x.haslayer(TCP) and x.sport == 80 and bin(x[TCP].flags)!="0b10100":
newerList.append(x);
newerList = sorted(newerList, key=itemgetter("IP.src","TCP.dport"))
wrpcap("sorted.pcap", newerList)
return newerList
def extract(pcap,num, count):
listCounter = count
counter = 0
#print listCounter
#Exit if we have reached the end of the the list of packets
if count >= len(pcap):
sys.exit()
#Create a new file and find the packet with the payload containing the beginning HTML code and write it to file
while listCounter != len(pcap):
thisFile = "file" + str(num) + ".html"
file = open(thisFile,"a")
s = str(pcap[listCounter][TCP].payload)
#print "S is: ", s
x,y,z = s.partition("<")
s = x + y + z #before was y+z
if s.find("<html") != -1:
file.write(s)
listCounter = listCounter + 1
break
listCounter = listCounter + 1
#Continue to loop through packets and write their contents until we find the close HTML tag and
#include that packet as well
counter = listCounter
while counter != len(pcap):
s = str(pcap[counter][TCP].payload)
if s.find("</html>") != -1:
file.write(s)
file.close
break
else:
file.write(s)
counter = counter + 1
#Recursively call the function incrementing the file name by 1
#and giving it the last spot in the PCAP we were in so we continue
#at the next PCAP
extract(pcap, num+1, counter)
if __name__ == "__main__":
#Read in file from user
f = raw_input("Please enter the name of your pcap file in this directory. Example: myFile.pcap")
pcapFile = rdpcap(f)
print "Filtering Pcap File of non HTTP Packets and then sorting packets"
#Sort and Filter the PCAP
pcapFile = sorting(pcapFile)
print "Sorting Complete"
print "Extracting Data"
#Extract the Data
extract(pcapFile,1,0)
Print "Extracting Complete"

Python checking for serial string in If statement

As a newbie to python, I'm trying to use it to read a file and write each line of the file to the RS-232 port. My code bellow seems to work for the most part, except for my listen and react segments. From poking around, it seems that my if statements can't read if I've received a "Start\r", or "End\r" string from my device (RS-232). Can anyone provide feedback on what is missing?
import serial
import time
port = "/dev/ttyS0"
speed = 9600
print("\n\n\n\nScript Starting\n\n\n")
ser = serial.Serial(port, speed, timeout=0)
ser.flushInput() #flush input buffer, discarding all its contents
ser.flushOutput()#flush output buffer, aborting current output and discard all that is in buffer
text_file = open("my.file", "r")
lines = text_file.read().split('\n')
i = 0
counter = 0
while i<len(lines):
response = ser.readline()
if (counter == 0):
print("\n\nProbing With Off Data\n")
ser.write('FFF')
ser.write('\r')
counter+=1
if (response == 'Start'):
ser.write('FFF')
ser.write('\r')
if (response == 'End'):
print("\nString Transmitted:")
print lines
make_list_a_string = ''.join(map(str, lines))
ser.write(make_list_a_string)
ser.write('\r')
print("\n")
i+=1
text_file.close()
exit(0)

Try using strip() to get rid of any trailing or preceding '\r's:
if (response.strip() == 'Start'):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python retrieving web data - python

Related

Using a custom socket recvall function works only, if thread is put to sleep

Why is socket.sendall() not working?

"IndexError: string index out of range" when loop is already ended! - Python 3

Reconstruct HTTP Webpage from libpcap python script

Python checking for serial string in If statement

Categories

Resources