Counting lines in Azure Data Lake

Counting lines in Azure Data Lake - python

I have some files in Azure Data Lake and I need to count how many lines they have to make sure they are complete. What would be the best way to do it?
I am using Python:
from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')
file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
# 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
counter = 0
with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
# i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
# counter1 = sum(1 for line in f)
for line in f:
counter = counter + 1
print(counter)
This works, but it takes hours for files that are 1 or 2 gigabytes. Shouldn't this be faster? Might there be a better way?

Do you need to count lines? Maybe it is enough to get size of the file?
You have AzureDLFileSystem.stat to get the file size, If you know how long is an average line size you could calculate the expected line count.

You could try:
for file in adl.walk('path/to/folder'):
counter += len(adl.cat(file).decode().split('\n'))
I'm not sure if this is actually faster, but it uses the unix built ins to get file output which might be quicker than explicit I/O
EDIT: The one pitfall of this method is in the case that file sizes exceed the RAM of the device you run this on, as cat will throw the contents into memory explicitly

The only faster way i found, was to actually download the file locally to where the script is running with
adl.put(remote_file, locally)
and then count line by line with out putting all file into the memory, download 500mgs takes around 30secs and reading 1mill lines around 4 secs =)

Related

fastest method to read big data files in python

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.

I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards

If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.

(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Extracting a random line in a file without loading the file into RAM in python

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.
I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.
I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.
Could someone give me some hints? Thank you.
EDIT : forgot to say that I know the number of lines in my files beforehand.

You can use a heapq to select n records based on a random number, eg:
import heapq
import random
SIZE = 10
with open('yourfile') as fin:
sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())
This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.

One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.
import random
import os
import glob
for name in glob.glob("*.py"):
mode, ino, den, nlink, uid, gid, size, atime, mtime, ctime = os.stat(name)
inf = open(name, "r")
location = random.randint(0, size)
inf.seek(location)
while location > 0:
char = inf.read(1)
if char == "\n":
break
location -= 1
inf.seek(location)
line = inf.readline()
print name, ":", line[:-1]
As long as the lines aren't huge this shouldn't be unduly burdensome.

You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.
Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.

Python- find the unique values from a large json file efficienctly

I've a json file data_large of size 150.1MB. The content inside the file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each item.
This is what I'm doing:-
import ijson # since json file is large, hence making use of ijson
f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.
Can I make print set(i['score'] for i in content) line more efficient. Currently it's taking 201secs to execute. Can it be made more efficient?

This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).
import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t
Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:
import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(obj.findall(data))

I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.
The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.
So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.

On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?
#!/usr/local/cpython-2.7/bin/python
from __future__ import print_function
import json # since json file is large, hence making use of ijson
with open('data_large', 'r') as file_:
content = json.load(file_)
print(set(element['score'] for element in content))

Try using a set
set([x['score'] for x in scores])
For example
>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])

Loading 15GB file in Python

I have a 15GB text file containing 25000 lines.
I am creating a multi level dictionary in Python of the form :
dict1 = {'':int},
dict2 = {'':dict1}.
I have to use this entire dict2 multiple times (about 1000...in a for loop) in my program.
Can anyone please tell a good way to do that.
The same type of information is stored in the file
(count of different RGB values of 25000 images. 1 image per line)
eg : 1 line of the file would be like :
image1 : 255,255,255-70 ; 234,221,231-40 ; 112,13,19-28 ;
image2 : 5,25,25-30 ; 34,15,61-20 ; 102,103,109-228 ;
and so on.

The best way to do this is to use chunking.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
As a note as you start to process large files moving to a map-reduce idiom may help as you'll be able to work on separate chunked files independently without pulling the complete data set into memory.

In python, if you use a file object as an iterator, you can read a file line by line without opening the whole thing in memory.
for line in open("huge_file.txt"):
do_something_with(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting lines in Azure Data Lake - python

Do you need to count lines? Maybe it is enough to get size of the file? You have AzureDLFileSystem.stat to get the file size, If you know how long is an average line size you could calculate the expected line count.

The only faster way i found, was to actually download the file locally to where the script is running with adl.put(remote_file, locally) and then count line by line with out putting all file into the memory, download 500mgs takes around 30secs and reading 1mill lines around 4 secs =)

Related

fastest method to read big data files in python

MPI in Python: load data from a file by line concurrently

Extracting a random line in a file without loading the file into RAM in python

Python- find the unique values from a large json file efficienctly

Loading 15GB file in Python

Categories

Resources