Grouping items by match set - python

I am trying to parse a large amount of configuration files and group the results into separate groups based by content - I just do not know how to approach this. For example, say I have the following data in 3 files:
config1.txt
ntp 1.1.1.1
ntp 2.2.2.2
config2.txt
ntp 1.1.1.1
config3.txt
ntp 2.2.2.2
ntp 1.1.1.1
config4.txt
ntp 2.2.2.2
The results would be:
Sets of unique data 3:
Set 1 (1.1.1.1, 2.2.2.2): config1.txt, config3.txt
Set 2 (1.1.1.1): config2.txt
Set 3 (2.2.2.2): config4.txt
I understand how to glob the directory of files, loop the glob results and open each file at a time, and use regex to match each line. The part I do not understand is how I could store these results and compare each file to a set of result, even if the entries are out of order, but a match entry wise. Any help would be appreciated.
Thanks!

filenames = [ r'config1.txt',
r'config2.txt',
r'config3.txt',
r'config4.txt' ]
results = {}
for filename in filenames:
with open(filename, 'r') as f:
contents = ( line.split()[1] for line in f )
key = frozenset(contents)
results.setdefault(key, []).append(filename)

from collections import defaultdict
#Load the data.
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"]
files = {}
for path in paths:
with open(path) as file:
for line in file.readlines():
... #Get data from files
files[path] = frozenset(data)
#Example data.
files = {
"config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]),
"config2.txt": frozenset(["1.1.1.1"]),
"config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]),
"config4.txt": frozenset(["2.2.2.2"]),
}
sets = defaultdict(list)
for key, value in files.items():
sets[value].append(key)
Note you need to use frozensets as they are immutable, and hence can be used as dictionary keys. As they are not going to change, this is fine.

This alternative is more verbose than others, but it may be more efficient depending on a couple of factors (see my notes at the end). Unless you're processing a large number of files with a large number of configuration items, I wouldn't even consider using this over some of the other suggestions, but if performance is an issue this algorithm might help.
Start with a dictionary from the configuration strings to the file set (call it c2f, and from the file to the configuration strings set (f2c). Both can be built as you glob the files.
To be clear, c2f is a dictionary where the keys are strings and the values are sets of files. f2c is a dictionary where the keys are files, and the values are sets of strings.
Loop over the file keys of f2c and one data item. Use c2f to find all files that contain that item. Those are the only files you need to compare.
Here's the working code:
# this structure simulates the files system and contents.
cfg_data = {
"config1.txt": ["1.1.1.1", "2.2.2.2"],
"config2.txt": ["1.1.1.1"],
"config3.txt": ["2.2.2.2", "1.1.1.1"],
"config4.txt": ["2.2.2.2"]
}
# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()
for file, data in cfg_data.iteritems():
data_set = set()
for item in data:
data_set.add(item)
if not item in c2f:
c2f[item] = set()
c2f[item].add(file)
f2c[file] = data_set;
# build the results as a list of pairs of lists:
results = []
# track the processed files
processed = set()
for file, data in f2c.iteritems():
if file in processed:
continue
size = len(data)
equivalence_list = []
# get one item from data, preferably the one used by the smallest list of
# files.
item = None
item_files = 0
for i in data:
if item == None:
item = i
item_files = len(c2f[item])
elif len(c2f[i]) < item_files:
item = i
item_files = len(c2f[i])
# All files with the same data as f must have at least the first item of
# data, just look at those files.
for other_file in c2f[item]:
other_data = f2c[other_file]
if other_data == data:
equivalence_list.append(other_file)
# No need to visit these files again
processed.add(other_file)
results.append((data, equivalence_list))
# Display the results
for data, files in results:
print data, ':', files
Adding a note on computational complexity: This is technically O((K log N)*(L log M)) where N is the number of files, M is the number of unique configuration items, K (<= N) is the number of groups of files with the same content and L (<= M) is the average number of files that have to be compared pairwise for each of the L processed files. This should be efficient if K << N and L << M.

I'd approach this like this:
First, get a dictionary like this:
{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }
Then loop over the file generating the sets:
{(file1) : ((1.1.1.1), (2.2.2.2)), etc }
The compare the values of the sets.
if val(file1) == val(file3):
Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }
This is probably not the fastest and mot elegant solution, but it should work.

You need a dictionary mapping the contents of the files to the filename. So you have to read each file,
sort the entries, build a tuple from them and use this as a key.
If you can have duplicate entries in a file: read the contents into a set first.

Related

Converting JSON to CSV with part of JSON value as row headers

I have just started to learn Python and I have a task of converting a JSON to a CSV file as semicolon as the delimiter and with three constraints.
My JSON is:
{"_id": "5cfffc2dd866fc32fcfe9fcc",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
The output CSV should be in a form:
system1 ; system2 ; system3
system1/folder ; ~ ; system3/folder
system1/folder/text3.txt ; system2/folder/text3.txt ; ~
~ ; system2/folder/text2.txt ; ~
~ ; system2/folder ; ~
system1/folder/text1.txt ; system2/folder/text1.txt ; ~
So the three constraints are that the tupleSize will indicate the number of rows, the first part of the array elements i.e., sys1, sys2 and sys3 will be the array elements and finally only those elements belonging to a particular system will have the values in the CSV file (rest is ~).
I found a few posts regarding the conversion in Python like this and this. None of them had any constraints any way related to these and I am unable to figure out how to approach this.
Can someone help?
EDIT: I should mention that the array elements are dynamic and thus the row headers may vary in the CSV file.
What you want to do is fairly substantial, so if it's just a Python learning exercise, I suggest you begin with more elementary tasks.
I also think you've got what most folks call rows and columns reversed — so be warned that everything below, including the code, is using them in the opposite sense to the way you used them in your question.
Anyway, the code below first preprocesses the data to determine what the columns or fieldnames of the CSV file are going to be and to make sure there are the right number of them as specified by the 'tupleSize' key.
Assuming that constraint is met, it then iterates through the data a second time and extracts the column/field values from each key value, putting them into a dictionary whose contents represents a row to be written to the output file — and then does that when finished.
Updated
Modified to remove all keys that start with "_id" in the JSON object dictionary.
import csv
import json
import re
SEP = '/' # Value sub-component separator.
id_regex = re.compile(r"_id\d*")
json_string = '''
{"_id1": "5cfffc2dd866fc32fcfe9fc1",
"_id2": "5cfffc2dd866fc32fcfe9fc2",
"_id3": "5cfffc2dd866fc32fcfe9fc3",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
'''
data = json.loads(json_string) # Convert JSON string into a dictionary.
# Remove non-path items from dictionary.
tupleSize = data.pop('tupleSize')
_ids = {key: data.pop(key)
for key in tuple(data.keys()) if id_regex.search(key)}
#print(f'_ids: {_ids}')
max_columns = int(tupleSize) # Use to check a contraint.
# Determine how many columns are present and what they are.
columns = set()
for key in data:
paths = data[key]
if not paths:
raise RuntimeError('key with no paths')
for path in paths:
comps = path.split(SEP)
if len(comps) < 2:
raise RuntimeError('component with no subcomponents')
columns.add(comps[0])
if len(columns) > max_columns:
raise RuntimeError('too many columns - conversion aborted')
# Create CSV file.
with open('converted_json.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, delimiter=';', restval='~',
fieldnames=sorted(columns))
writer.writeheader()
for key in data:
row = {}
for path in data[key]:
column, *_ = path.split(SEP, maxsplit=1)
row[column] = path
writer.writerow(row)
print('Conversion complete')

Comparing 2 huge (5-6 GB) csv files and count the number of matching and unmatched no. of rows

There are 2 huge (5-6 GB) each csv files. Now the objective is to compare both these files. how many rows are matching and how many rows are not matching?
Lets say file1.csv contains 5 similar lines, we need to count it as 1 but not 5.
Similarly, for file2.csv if there are redundant data, we need to count it as 1.
I expect the output to display the number of rows that are matching and the no. of rows that are different.
I have written a file comparer in python that can optimally compare huge files and get matching lines count and different lines count. Replace the input_file1 and input_file2 with your 2 large files and run it. Let me know the results.
input_file1 = r'input_file.txt'
input_file2 = r'input_file.1.txt'
__author__ = 'https://github.com/praveen-kumar-rr'
# Simple Memory Efficient high perfomance file comparer.
# Can be used to efficiently compare large files.
# Alogrithm:
# Hashes the lines and compared first.
# Non matching lines are picked as different count.
# All the matching lines are taken and the exact lines are read from file
# These strings undergo same comparison process based on string itself
def accumulate_index(values):
'''
Returns dict like key: [indexes]
'''
result = {}
for i, v in enumerate(values):
indexes = result.get(v, [])
result[v] = indexes + [i]
return result
def get_lines(fp, line_numbers):
'''
Reads lines from the file pointer based on the lines_numbers list of indexes
'''
return (v for i, v in enumerate(fp) if i in line_numbers)
def get_match_diff(left, right):
'''
Compares the left and right iterables and returns the matching and different items
'''
left_set = set(left)
right_set = set(right)
return left_set ^ right_set, left_set & right_set
if __name__ == '__main__':
# Gets hashes of all lines for both files
dict1 = accumulate_index(map(hash, open(input_file1)))
dict2 = accumulate_index(map(hash, open(input_file2)))
diff_hashes, matching_hashes = get_match_diff(
dict1.keys(), dict2.keys())
diff_lines_count = len(diff_hashes)
matching_lines_count = 0
for h in matching_hashes:
with open(input_file1) as fp1, open(input_file2) as fp2:
left_lines = get_lines(fp1, dict1[h])
right_lines = get_lines(fp2, dict2[h])
d, m = get_match_diff(left_lines, right_lines)
diff_lines_count += len(d)
matching_lines_count += len(m)
print('Total number of matching lines is : ', matching_lines_count)
print('Total number of different lines is : ', diff_lines_count)
I hope this algorithm work
create hash of every line in both file
now create set of that hash
difference and intersection of that set.

How to parse very big files in python?

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.

Combining files in python using

I am attempting to combine a collection of 600 text files, each line looks like
Measurement title Measurement #1
ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289
with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this
Measurement title Measurement #1 Measurement #2
ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556
I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?
It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there -
but as it is, you can create a single default dictionary, where each element is a list -
read all your files and collect the measurements to this dictionary, and dump it to disk -
# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
... reader = csv.reader(open(filename))
... # throw away header row:
... reader.readrow()
... for name, value in reader:
... data[name].append(value)
...
>>> # and record everything down in another file:
...
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
... writer.writerow([name] + values)
...
>>> mydata.close()
>>>
Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.
I don't have comment privileges yet, therefore a separate answer.
jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).
In the following situation:
file1:
measID,meas1
a,1
b,2
file2:
measID,meas1
a,3
b,4
c,5
you would get:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5
instead of the desired:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5 # measurement c was missing in file1!
I'm using commas instead of spaces as delimiters for better visibility.

Shuffle the records of a list of text files in one single file

I have a list of text files file1.txt, file2.txt, file3.txt .. filen.txt that I need to shuffle creating one single big file as result *.
Requirements:
1. The records of a given file need to be reversed before being shuffled
2. The records of a given file should keep the reversed order in the destination file
3. I don't know how many files I need to shuffle so the code should be generic as possible (allowing to declare the file names in a list for example)
4. Files could have different sizes
Example:
File1.txt
---------
File1Record1
File1Record2
File1Record3
File1Record4
File2.txt
---------
File2Record1
File2Record2
File3.txt
---------
File3Record1
File3Record2
File3Record3
File3Record4
File3Record5
the output should be something like this:
ResultFile.txt
--------------
File3Record5 -|
File2Record2 |
File1Record4 |
File3Record4 -|
File2Record1 |
File1Record3 |-->File3 records are shuffled with the other records and
File3Record3 -| are correctly "reversed" and they kept the correct
File1Record2 | ordering
File3Record2 -|
File1Record1 |
File3Record1 -|
* I'm not crazy; I have to import these files (blog posts) using the resultfile.txt as input
EDIT:
the result could have any sort you want, completely or partially shuffled, uniformly interleaved, it does not matter. What it does matter is that points 1. and 2. are both honoured.
What about this:
>>> l = [["1a","1b","1c","1d"], ["2a","2b"], ["3a","3b","3c","3d","3e"]]
>>> while l:
... x = random.choice(l)
... print x.pop(-1)
... if not x:
... l.remove(x)
1d
1c
2b
3e
2a
3d
1b
3c
3b
3a
1a
You could optimize it in various ways, but that's the general idea. This also works if you cannot read the files at once but need to iterate them because of memory restrictions. In that case
read a line from the file instead of popping from a list
check for EOF instead of empty lists
you could try the following: in a first step you zip() the reversed() items of the list:
zipped = zip(reversed(lines1), reversed(lines2), reversed(lines3))
then you can concatenate the items in zipped again:
lst = []
for triple in zipped:
lst.append(triple)
finally you have to remove all Nones added by zip()
lst.remove(None)
filenames = [ 'filename0', ... , 'filenameN' ]
files = [ open(fn, 'r') for fn in filenames ]
lines = [ f.readlines() for f in files ]
output = open('output', 'w')
while len(lines) > 0:
l = random.choice( lines )
if len(l)==0:
lines.remove(l)
else:
output.write( l.pop() )
output.close()
One bite may seem magical here: the lines read from files don't need reversing, because when we write them to output file we use list.pop() which takes items from the end of the list (here the contents of the file).
A simple solution might be to create a list of lists, and then pop a line off a random list until they're all exhausted:
>>> import random
>>> filerecords = [['File{0}Record{1}'.format(i, j) for j in range(5)] for i in range(5)]
>>> concatenation = []
>>> while any(filerecords):
... selection = random.choice(filerecords)
... if selection:
... concatenation.append(selection.pop())
... else:
... filerecords.remove(selection)
...
>>> concatenation
['File1Record4', 'File3Record4', 'File0Record4', 'File0Record3', 'File0Record2',
'File4Record4', 'File0Record1', 'File3Record3', 'File4Record3', 'File0Record0',
'File4Record2', 'File2Record4', 'File4Record1', 'File3Record2', 'File4Record0',
'File2Record3', 'File1Record3', 'File2Record2', 'File2Record1', 'File3Record1',
'File3Record0', 'File1Record2', 'File2Record0', 'File1Record1', 'File1Record0']
I strongly recommend investing some time to read Generator Tricks for Systems Programmers(PDF). It's from a presentation at PyCon 08 and it deals specifically with processing arbitrarily large log files. The reversal aspect is an interesting wrinkle, but the rest of the presentation should speak directly to your problem.
filelist = (
'file1.txt',
'file2.txt',
'file3.txt',
)
all_records = []
max_records = 0
for f in filelist:
fp = open(f, 'r')
records = fp.readlines()
if len(records) > max_records:
max_records = len(records)
records.reverse()
all_records.append(records)
fp.close()
all_records.reverse()
res_fp = open('result.txt', 'w')
for i in range(max_records):
for records in all_records:
try:
res_fp.write(records[i])
except IndexError:
pass
i += 1
res_fp.close()
I'm not a python zen master, but here's my take.
import random
#You have you read everything into a list from at least one of the files.
fin = open("filename1","r").readlines()
# tuple of all of the files.
fls = ( open("filename2","r"),
open("filename3","r"), )
for fl in fls: #iterate through tuple
curr = 0
clen = len(fin)
for line in fl: #iterate through a file.
# If we're at the end or 1 is randomly chosen, insert at current position.
if curr > clen or round(random.random()):
fin.insert(curr,line)
clen = len(fin)
curr +=1 #increment current index.
# when you're *done* reverse. It's easier.
fin.reverse()
Unfortunately with this it becomes obvious that this is a weighted distrobution. This can be fixed by calculating the length of each of the files and multiplying the call to random by certain probability based on that. I'll see if I can't provide that at some later point.
A possible merging function is available in the standard library. It's intended to merge sorted lists to make sorted combined lists; garbage in, garbage out, but it does have the desired property of maintaining sublist order no matter what.
def merge_files(output, *inputs):
# all parameters are opened files with appropriate modes.
from heapq import merge
for line in heapq.merge(*(reversed(tuple(input)) for input in inputs)):
output.write(line)

Categories

Resources