Python read website data line by line when available

Python read website data line by line when available - python

I am using urllib2 to read the data from the url, below is the code snippet :
data = urllib2.urlopen(urllink)
for lines in data.readlines():
print lines
Url that I am opening is actually a cgi script which does some processing and prints the data in parallel. CGI script takes around 30 minutes to complete. So with the above code, I could see the output only after 3o minutes when the execution of CGI script is completed.
How can I read the data from the url as soon as it is available and print it.

Just loop directly over the file object:
for line in data:
print line
This reads the incoming data stream line by line (internally, the socket fileobject calls .readline() every time you iterate). This does assume that your server is sending data as soon as possible.
Calling .readlines() (plural) guarantees that you read the whole request before you start looping, don't do that.
Alternatively, use the requests library, which has more explicit support for request streaming:
import requests
r = requests.get(url, stream=True)
for line in r.iter_lines():
if line: print line
Note that this only will work if your server starts streaming data immediately. If your CGI doesn't produce data until the process is complete, there is no point in trying to stream the data.

Related

Python Serial Readline() vs Readlines()

I need to use some At commands & read the data for that I am using below cmds , the response I would be getting is in multiple lines
serialPort.write(b"AT+CMD\r\n")
time.sleep(1)
response = serialPort.readlines()
if I am using only readline() i dont get full expected response but if i do read lines() i do get the full data but some lines skipped sometime , i need to know the difference between these 2 methods & also for the
timeout flag
how does it effects functionality of these

readline(): reads one line, if you use multiple times, each time it will keep giving next line until it reaches end of file or file is closed.
readlines(): returns a list of lines from a file
If you want the output of a whole file, you can use read() instead.
About the timeout flag, I don't think timeout flag exists for I/O operations in Python? But if you are talking about the one in Serial, then it specifies the maximum time to wait for serial data.

Python - read huge online csv through proxy

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help

You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.

The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line

According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

Read lines of file over HTTP on demand

What I need to do is to read a file over HTTP in chunks (iterate over lines to be specific). I want to not read the entire file (or a large part of it) and then split it into lines, but rather read a small (<=8kB) chunk and then split this into lines. When all the lines in chunk are consumed, then receive the next chunk.
I have tried the following:
with urllib.request.urlopen(url) as f:
yield from f
Which didn't work. In Wireshark I see that about 140kB of total ~220kB are received just by calling urlopen(url).
The next thing I tried was to use requests:
with requests.get(url, stream=True) as req:
yield from req.iter_lines()
Which also reads about 140kB just by calling get(url, stream=True). According to the documentation this should not happen. Other than that, I did not find any information about this behavior or how to control it. I'm using Requests 2.21.0, CPython 3.7.3, on Windows 10.

According to the docs and docs 2 (and given that the source is actually working in chunks) I think you should use iter_content, which accepts the chunk_size parameter which you have to set to None:
with requests.get(url, stream=True) as req:
yield from req.iter_content(chunk_size=None)
I haven't tried, but is seems that somewhere in you code something accesses req.content before iter_lines, therefore loading the entire payload.
edit_ added example

Read Big text online using python

I have to read a text file of 50 GB. I have to do some processing with that file. I cannot download the text file as I am doing the processing on a remote server. Is it possible using Python to stream the content of the file using its URL and read it line by line ?

Actually the simplest way is :
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You could even shorten it to
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded :
import urllib2
data = urllib2.urlopen(target_url).read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
In python 3 and up use urllib.request instead of urllib2

You can do with urllib2,
urlopen will work as like files and files are iterators yielding 1 line at a time until there are no more lines to yield.
import urllib2
for line in urllib2.urlopen("http://www.myhost.com/SomeFile.txt"):
print line

How to read large file (socket programming and python)?

I'm a beginner in socket programming and python. I would like to learn how to send a large text file (e.g., > 5MB) from the server to client. I keep getting an error that says
Traceback (most recent call last):
File "fserver.py", line 50, in <module>
reply = f.read()
ValueError: Mixing iteration and read methods would lose data
Below is a partial of my code. Can someone take a look and give me some hints on how to resolve this issue? Thank you for your time.
myserver.py
#validate filename
if os.path.exists(filename):
with open(filename) as f:
for line in f:
reply = f.read()
client.send(reply)
#f = open(filename, 'r')
#reply = f.read()
#client.send(piece)
else:
reply = 'File not found'
client.send(reply)
myclient.py
while True:
print 'Enter a command: list or get <filename>'
command = raw_input()
if command.strip() == 'quit':
break
client_socket.send(command)
data = client_socket.recv(socksize)
print data

The problem here has nothing to do with sockets, or with how big the file is. When you do this:
for line in f:
reply = f.read()
The for line in f is trying to read one line of the file at a time, and then for each line you're trying to read the entire file. That won't work.
If you didn't get this error (which you won't in many cases), the first time through the loop you would read and ignore the first line, and then read and send everything but the first line (or, possibly, everything but the first, say, 4KB) as one giant reply, and then the loop would be done.
What you want is either one or the other:
for line in f:
reply = line
… or …
# no for loop
reply = f.read()
Meanwhile, on your client side, you're only doing one recv. That's going to get the first 4K (or whatever socksize is) or less, and then you never receive anything else.
What you need is a loop. Like this:
while True:
data = client_socket.recv(socksize)
print data
But now you have a new problem. Once the file is done, the client will sit there waiting forever for the next chunk of data, which will never come. So the client needs to know when it's done. And the only way it can know that is if the server puts that information into the data stream.
One way to do this is to send the length before the file. One standardized way to do this is to use the netstring protocol. You can find libraries that do this for you, but it's simple enough to do by hand. Or maybe do something more like HTTP, where the headers are just separated by newlines, and separated from the body by a blank line; then you can use socket.makefile as your protocol implementation. Or even a binary protocol, where you just send the length as four bytes.
There's another problem we might as well fix while we're here: send(reply) doesn't necessarily send the whole reply; it sends anywhere from 1 byte to the whole thing, and returns a number telling you what got sent. The simple fix to that is to use sendall(reply), which guarantees to send all of it.
And finally: Your server is expecting that each recv will get exactly one command, as sent by send. But sockets don't work that way. Sockets are byte streams, not message streams; there's nothing preventing recv from getting, say, just half a command, and then your server will break. So, you need some kind of protocol in that direction as well. Again, you could use netstring, or newline-separated messages, or a binary length prefix, but you have to do something.
(The blog post linked above has very simple example code for using binary length prefixes as a protocol.)

you can do for line in file.readlines()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.