How to parse very big files in python?

How to parse very big files in python? - python

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.

Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.

Related

Python: dictionary to collection

I have a file with 2 columns:
Anzegem Anzegem
Gijzelbrechtegem Anzegem
Ingooigem Anzegem
Aalst Sint-Truiden
Aalter Aalter
The first column is a town and the second column is the district of that town.
I made a dictionary of that file like this:
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
verzameling = set()
for line in file:
tmp = line.split()
dict[tmp[0]] = tmp[1]
return dict
If I set a variable 'writeTowns' equal to readTowns(text) and do writeTown['Anzegem'], I want to get a collection of {'Anzegem', 'Gijzelbrechtegem', 'Ingooigem'}.
Does anybody know how to do this?

I think you can just create another function that can create appropriate data structure for what you need. Because, at the end you will end up writing code which basically manipulates the dictionary returned by readTowns to generate data as per your requirement. Why not keep the code clean and create another function for that. You Just create a name to list dictionary and you are all set.
def writeTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
for line in file:
tmp = line.split()
dict[tmp[1]] = dict.get(tmp[1]) or []
dict.get(tmp[1]).append(tmp[0])
return dict
writeTown = writeTowns('file.txt')
print writeTown['Anzegem']
And if you are concerned about reading the same file twice, you can do something like this as well,
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict2town = {}
town2dict = {}
for line in file:
tmp = line.split()
dict2town[tmp[0]] = tmp[1]
town2dict[tmp[1]] = town2dict.get(tmp[1]) or []
town2dict.get(tmp[1]).append(tmp[0])
return dict2town, town2dict
dict2town, town2dict = readTowns('file.txt')
print town2dict['Anzegem']

You could do something like this, although, please have a look at #ubadub's answer, there are better ways to organise your data.
[town for town, region in dic.items() if region == 'Anzegem']

It sounds like you want to make a dictionary where the keys are the districts and the values are a list of towns.
A basic way to do this is:
def readTowns(text):
with open(text, 'r') as f:
file = input.readlines()
my_dict = {}
for line in file:
tmp = line.split()
if tmp[1] in dict:
my_dict[tmp[1]].append(tmp[0])
else:
my_dict[tmp[1]] = [tmp[0]]
return dict
The if/else blocks can also be achieved using python's defaultdict subclass (docs here) but I've used the if/else statements here for readability.
Also some other points: the variables dict and file are python types so it is bad practice to overwrite these with your own local variable (notice I've changed dict to my_dict in the code above.

If you build your dictionary as {town: district}, so the town is the key and the district is the value, you can't do this easily*, because a dictionary is not meant to be used in that way. Dictionaries allow you to easily find the values associated with a given key. So if you want to find all the towns in a district, you are better of building your dictionary as:
{district: [list_of_towns]}
So for example the district Anzegem would appear as {'Anzegem': ['Anzegem', 'Gijzelbrechtegem', 'Ingooigem']}
And of course the value is your collection.
*you could probably do it by iterating through the entire dict and checking where your matches occur, but this isn't very efficient.

Comparing csv files in python to see what is in both

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !

Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something

You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

Fastest way to find a re.compile match in list

I have a text file that has 120000 lines, where every line has exactly this format:
ean_code;plu;name;price;state
I tried various operations, including working with file straight away, and best results were given if file was just loaded in memory line by line with readlines() and written to list (at the start of the program).
so i have these 2 lines:
matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
line=[next(l.split(';') for l in list if matcher(l))]
do sth with line....
What these lines are trying to accomplish is, they are trying to find (as fast as possible) a plu/ean, which was given by user input in fields: ean_code or plu.
I am particulary interested in second line, as it impacts my performance on WinCE device (PyCE port of python 2.5).
I tried every possible solution there is to make it faster, but this is fastest way to iterate through a certain list to find a match that re.compile is generating.
Any faster way other than for in list comprehension to iterate over big list (120000 lines in my case)?
I am looking for any kind of way possible with any kind of data structure (that is supported until Python 2.5) that will give me faster result than above two lines...
Just to mention, that this is performed on Handheld device (630MHz ARM), with 256MB RAM, and without any kind of connection besides USB is present. Sadly, database access and existance is not an option.

I made a test file and tested a few variations. The fastest way of searching for a static string (as you appear to be doing) by iterating over the file is by using string in line.
However, if you'll be using the loaded data to search more than once (actually more than 30 times according to the testnumbers below), it's worth your (computational) time to produce lookup tables for the PLUs and EANs in the form of dicts and use these for future searches.
loaded 120000 lines
question regex 0.114868402481
simpler regex 0.417045307159
other regex 0.386662817001
startswith 0.236350297928
string in 0.020356798172 <-- iteration winner
dict construction 0.611148500443
dict lookup 0.000002503395 <-- best if you are doing many lookups
Test code follows:
import re
import timeit
def timefunc(function, times, *args):
def wrap():
function(*args)
t = timeit.Timer(wrap)
return t.timeit(times) / times
def question(lines):
eanic = "D41RP9"
matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
line=[next(l.split(';') for l in lines if matcher(l))]
return line
def splitstart(lines):
eanic = "D41RP9"
ret = []
for l in lines:
s = l.split(';')
if s[0].startswith(eanic) or s[1].startswith(eanic):
ret.append(l)
return ret
def simpler(lines):
eanic = "D41RP9"
matcher = re.compile('(^|;)' + eanic)
return [l for l in lines if matcher.search(l)]
def better(lines):
eanic = "D41RP9"
matcher = re.compile('^(?:' + eanic + '|[^;]*;' + eanic + ')')
return [l for l in lines if matcher.match(l)]
def strin(lines):
eanic = "D41RP9"
return [l for l in lines if eanic in l]
def mkdicts(lines):
ean = {}
plu = {}
for l in lines:
s = l.split(';')
ean[s[0]] = s
plu[s[1]] = s
return (ean, plu)
def searchdicts(ean, plu):
eanic = "D41RP9"
return (ean.get(eanic, None), plu.get(eanic, None))
with open('test.txt', 'r') as f:
lines = f.readlines()
print "loaded", len(lines), "lines"
print "question regex\t", timefunc(question, 10, lines)
print "simpler regex\t", timefunc(simpler, 10, lines)
print "other regex\t", timefunc(simpler, 10, lines)
print "startswith\t", timefunc(splitstart, 10, lines)
print "string in\t", timefunc(strin, 10, lines)
print "dict construction\t", timefunc(mkdicts, 10, lines)
ean, plu = mkdicts(lines)
print "dict lookup\t", timefunc(searchdicts, 10, ean, plu)

At first I looked up some modules, that are available for python 2.5:
You can use csv-module to read your data. Could be faster.
You can store your data via pickle or cPickle module. So you can store python-objects (like dict, tuples, ints and so on). Comparing ints is faster than searching in strings.
You iterate through a list, but you say your data are in a text file. Do not load your whole text file in a list. Maybe the following is fast enough and there is no need for using the modules I mentioned above.
f = open('source.txt','r') # note: python 2.5, no with-statement yet
stripped_eanic = eanic.strip()
for line in f:
if stripped_eanic in line: # the IDs have a maximum length, don't they? So maybe just search in line[:20]
# run further tests, if you think it is necessary
if further_tests:
print line
break
else:
print "No match"
Edit
I thought about what I mentioned above: Do not load the whole file in a list. I think that is only true, if your search is a onetime procedure and your script exits there. But if you want to search several times I suggest using dict (like beerbajay suggested) and cPickle-files instead of text-file.

Combining files in python using

I am attempting to combine a collection of 600 text files, each line looks like
Measurement title Measurement #1
ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289
with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this
Measurement title Measurement #1 Measurement #2
ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556
I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?

It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there -
but as it is, you can create a single default dictionary, where each element is a list -
read all your files and collect the measurements to this dictionary, and dump it to disk -
# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
... reader = csv.reader(open(filename))
... # throw away header row:
... reader.readrow()
... for name, value in reader:
... data[name].append(value)
...
>>> # and record everything down in another file:
...
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
... writer.writerow([name] + values)
...
>>> mydata.close()
>>>

Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.

I don't have comment privileges yet, therefore a separate answer.
jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).
In the following situation:
file1:
measID,meas1
a,1
b,2
file2:
measID,meas1
a,3
b,4
c,5
you would get:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5
instead of the desired:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5 # measurement c was missing in file1!
I'm using commas instead of spaces as delimiters for better visibility.

Grouping items by match set

I am trying to parse a large amount of configuration files and group the results into separate groups based by content - I just do not know how to approach this. For example, say I have the following data in 3 files:
config1.txt
ntp 1.1.1.1
ntp 2.2.2.2
config2.txt
ntp 1.1.1.1
config3.txt
ntp 2.2.2.2
ntp 1.1.1.1
config4.txt
ntp 2.2.2.2
The results would be:
Sets of unique data 3:
Set 1 (1.1.1.1, 2.2.2.2): config1.txt, config3.txt
Set 2 (1.1.1.1): config2.txt
Set 3 (2.2.2.2): config4.txt
I understand how to glob the directory of files, loop the glob results and open each file at a time, and use regex to match each line. The part I do not understand is how I could store these results and compare each file to a set of result, even if the entries are out of order, but a match entry wise. Any help would be appreciated.
Thanks!

filenames = [ r'config1.txt',
r'config2.txt',
r'config3.txt',
r'config4.txt' ]
results = {}
for filename in filenames:
with open(filename, 'r') as f:
contents = ( line.split()[1] for line in f )
key = frozenset(contents)
results.setdefault(key, []).append(filename)

from collections import defaultdict
#Load the data.
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"]
files = {}
for path in paths:
with open(path) as file:
for line in file.readlines():
... #Get data from files
files[path] = frozenset(data)
#Example data.
files = {
"config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]),
"config2.txt": frozenset(["1.1.1.1"]),
"config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]),
"config4.txt": frozenset(["2.2.2.2"]),
}
sets = defaultdict(list)
for key, value in files.items():
sets[value].append(key)
Note you need to use frozensets as they are immutable, and hence can be used as dictionary keys. As they are not going to change, this is fine.

This alternative is more verbose than others, but it may be more efficient depending on a couple of factors (see my notes at the end). Unless you're processing a large number of files with a large number of configuration items, I wouldn't even consider using this over some of the other suggestions, but if performance is an issue this algorithm might help.
Start with a dictionary from the configuration strings to the file set (call it c2f, and from the file to the configuration strings set (f2c). Both can be built as you glob the files.
To be clear, c2f is a dictionary where the keys are strings and the values are sets of files. f2c is a dictionary where the keys are files, and the values are sets of strings.
Loop over the file keys of f2c and one data item. Use c2f to find all files that contain that item. Those are the only files you need to compare.
Here's the working code:
# this structure simulates the files system and contents.
cfg_data = {
"config1.txt": ["1.1.1.1", "2.2.2.2"],
"config2.txt": ["1.1.1.1"],
"config3.txt": ["2.2.2.2", "1.1.1.1"],
"config4.txt": ["2.2.2.2"]
}
# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()
for file, data in cfg_data.iteritems():
data_set = set()
for item in data:
data_set.add(item)
if not item in c2f:
c2f[item] = set()
c2f[item].add(file)
f2c[file] = data_set;
# build the results as a list of pairs of lists:
results = []
# track the processed files
processed = set()
for file, data in f2c.iteritems():
if file in processed:
continue
size = len(data)
equivalence_list = []
# get one item from data, preferably the one used by the smallest list of
# files.
item = None
item_files = 0
for i in data:
if item == None:
item = i
item_files = len(c2f[item])
elif len(c2f[i]) < item_files:
item = i
item_files = len(c2f[i])
# All files with the same data as f must have at least the first item of
# data, just look at those files.
for other_file in c2f[item]:
other_data = f2c[other_file]
if other_data == data:
equivalence_list.append(other_file)
# No need to visit these files again
processed.add(other_file)
results.append((data, equivalence_list))
# Display the results
for data, files in results:
print data, ':', files
Adding a note on computational complexity: This is technically O((K log N)*(L log M)) where N is the number of files, M is the number of unique configuration items, K (<= N) is the number of groups of files with the same content and L (<= M) is the average number of files that have to be compared pairwise for each of the L processed files. This should be efficient if K << N and L << M.

I'd approach this like this:
First, get a dictionary like this:
{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }
Then loop over the file generating the sets:
{(file1) : ((1.1.1.1), (2.2.2.2)), etc }
The compare the values of the sets.
if val(file1) == val(file3):
Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }
This is probably not the fastest and mot elegant solution, but it should work.

You need a dictionary mapping the contents of the files to the filename. So you have to read each file,
sort the entries, build a tuple from them and use this as a key.
If you can have duplicate entries in a file: read the contents into a set first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse very big files in python? - python

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

Maybe, you can make it somewhat faster; change it: if ids[0] not in evalIDs.keys(): evalIDs[ids[0]] = [] evalIDs[ids[0]].append(ids[1]) to evalIDs.setdefault(ids[0],[]).append(ids[1]) The 1st solution searches 3 times in the "evalID" dictionary.

Related

Python: dictionary to collection

Comparing csv files in python to see what is in both

Fastest way to find a re.compile match in list

Combining files in python using

Grouping items by match set

Categories

Resources