Python - read huge online csv through proxy - python

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help

You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.

The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line

According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

Related

Read lines of file over HTTP on demand

What I need to do is to read a file over HTTP in chunks (iterate over lines to be specific). I want to not read the entire file (or a large part of it) and then split it into lines, but rather read a small (<=8kB) chunk and then split this into lines. When all the lines in chunk are consumed, then receive the next chunk.
I have tried the following:
with urllib.request.urlopen(url) as f:
yield from f
Which didn't work. In Wireshark I see that about 140kB of total ~220kB are received just by calling urlopen(url).
The next thing I tried was to use requests:
with requests.get(url, stream=True) as req:
yield from req.iter_lines()
Which also reads about 140kB just by calling get(url, stream=True). According to the documentation this should not happen. Other than that, I did not find any information about this behavior or how to control it. I'm using Requests 2.21.0, CPython 3.7.3, on Windows 10.
According to the docs and docs 2 (and given that the source is actually working in chunks) I think you should use iter_content, which accepts the chunk_size parameter which you have to set to None:
with requests.get(url, stream=True) as req:
yield from req.iter_content(chunk_size=None)
I haven't tried, but is seems that somewhere in you code something accesses req.content before iter_lines, therefore loading the entire payload.
edit_ added example

Python - split files

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')
Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Read Big text online using python

I have to read a text file of 50 GB. I have to do some processing with that file. I cannot download the text file as I am doing the processing on a remote server. Is it possible using Python to stream the content of the file using its URL and read it line by line ?
Actually the simplest way is :
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You could even shorten it to
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded :
import urllib2
data = urllib2.urlopen(target_url).read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
In python 3 and up use urllib.request instead of urllib2
You can do with urllib2,
urlopen will work as like files and files are iterators yielding 1 line at a time until there are no more lines to yield.
import urllib2
for line in urllib2.urlopen("http://www.myhost.com/SomeFile.txt"):
print line

How can I take the html file of a website

I am trying to take the html of my website and see if it is the same as what I have on an offline version.
I have been researching this, and all I can find is either parsing or something that deals with only http://
So far I have this:
import urllib
url = "https://www.mywebsite.com/"
onlinepage = urllib.urlopen(url)
print(onlinepage.read())
offlinepage = open("offline.txt", "w+")
print(offlinepage.read())
if onlinepage.read() == offlinepage.read():
print("same") # for debugging
else:
print("different")
This always says that they are the same, even when I put in a different website entirely.
When you first print your online and offline pages with these lines:
print(onlinepage.read())
print(offlinepage.read())
...you have now consumed all of the text in each file object. Subsequent reads on either object will return an empty string. Two empty strings are equal, therefore your if condition will always evaluate to True.
If you were purely working with files, you could seek to the beginning of both files and read again. Since there is no seek method on the file object from urlopen, you'll need to either re-fetch the page with a new urlopen command or, better, save the original text in a variable and use that for your subsequent comparisons:
online = onlinepage.read()
print(online)
offline = offlinepage.read()
print(offline)
...
if online == offline:
...
As others have noted, you can't read the request object twice (and can't read the file twice without seeking); once read, the data you got back is no longer available, so you need to store it.
But they missed another problem: You opened the file with mode w+. w+ allows both reading and writing, but, just like mode w, it truncates the file on open. So your local file is always empty when you read it, which means you're both corrupting the local file and never getting a match (unless the online file is empty too).
You need to use mode r+ or a+ to get a read/write handle that doesn't truncate the existing file (r+ requires that the file already exist, a+ does not, but puts the write position at end of file, and on some systems, all writes are put at the end of the file).
So fixing both bugs, you get:
import urllib
url = "https://www.mywebsite.com/"
# Using with statements properly for safe resource cleanup
with urllib.urlopen(url) as onlinepage:
onlinedata = onlinepage.read()
print(onlinedata)
with open("offline.txt", "r+") as offlinepage: # DOES NOT TRUNCATE EXISTING FILE!
offlinedata = offlinepage.read()
print(offlinedata)
if onlinedata == offlinedata:
print("same") # for debugging
else:
print("different")
# I assume you want to rewrite the local page, or you wouldn't open with +
# so this is what you'd do to ensure you replace the existing data correctly
offlinepage.seek(0) # Ensure you're seeked to beginning of file for write
offlinepage.write(onlinedata)
offlinepage.truncate() # If online data smaller, don't keep offline extra data
You use .read() twice on each file.
>>> f.read()
'This is the entire file.\n'
>>> f.read()
''
"If the end of the file has been reached, f.read() will return an empty string ("")." (7.2.1 Docs).
Therefore, when two results are compared, they are equal because each is an empty string.

Read file object as string in python

I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.
I'm new to python so I'm struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?
You can use Python in interactive mode to search for solutions.
if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.
From the doc file.read() (my emphasis):
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).
Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 - The Missing Manual
What you are doing should be pretty straightforward, observe this sample code:
import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()

Categories

Resources