I have a folder with several text file (ex: 164400). Each file has several lines (ex: x,y,z) in numeric floating format. My code reads a group of 3000 files at times and stores the values in the lines a dictionary (see example).
The code is quite slow when opening 3000 files.
[[points_dict[os.path.split(x)[1]].append(p) for p in open(x,"r")] for x in lf]
I wish to know if someone has a more efficient and fast approach to read the files
file_folder = "C:\\junk" #where i stored my file
points_dict = defaultdict(list)
groups = groupby(file_folder, key=lambda k, line=count(): next(line) // 3000)
for k, group in groups:
lf = [p for p in group]
[[points_dict[os.path.split(x)[1]].append(p) for p in open(x,"r")] for x in lf]
# do other
where the function **os.path.split(x)[1]** stores the lines with the same file name (id) in the dictionary and **lf** is the list of the files to open
What about using numpy ? Something along those lines (edited answer, tested code)
[points_dict[os.path.split(x)[1]].append(numpy.loadtxt(x, delimiter=",")) for x in lf]
for x, np_arrays in points_dict.iteritems():
points_dict[x]=numpy.vstack(np_arrays)
At the end you get the points in a nice numpy array.
Related
I'm writing a program that takes .dat files from directory one at a time, verifies some condition, and if verification is okay copies the files to another directory.
The code below shows how I import the files and create a list of lists. I'm having trouble with the verification step. I tried with a for loop but when set if condition, operation with elements in the list of lists seems impossible.
In particular I need the difference between consecutive elements matrix[i][3] and matrix[i+1][3] to be less than 5.
for filename in glob.glob(os.path.join(folder_path, '*.dat')):
with open(filename, 'r') as f:
matrix =[]
data = f.readlines()
for raw_line in data:
split_line1= raw_line.replace(":",";")
split_line2= split_line1.replace("\n","")
split_line3 = split_line2.strip().split(";")
matrix.append(split_line3)
Hello and welcome at Stack Overflow.
You did not provide a sample of your data files. After looking at your code, I assume your data looks like this:
9;9;7;5;0;9;5;8;4;2
9;1;1;5;1;3;4;1;8;7
2;8;4;5;5;2;1;4;6;4
6;4;1;5;5;8;1;4;6;1
0;1;0;5;7;1;7;4;1;9
4;9;6;5;3;2;6;2;9;6
8;0;6;0;8;9;3;1;6;6
A few general remarks:
For parsing a csv file, use the csv module. It is easy to use and less error-prone than writing your own parser.
If you do a lot of data-processing and matrix calculations, you want to have a look at the pandas and numpy libraries. Processing matrices line by line in plain Python is slower by some orders of magnitude.
I understand your description of the verification step as follows:
A matrix matches if all consecutive elements
matrix[i][3] and matrix[i+1][3] differ by less than 5.
My suggested code looks like this:
import csv
from glob import glob
from pathlib import Path
def read_matrix(fn):
with open(fn) as f:
c = csv.reader(f, delimiter=";")
m = [[float(c) for c in row] for row in c]
return m
def verify_condition(matrix):
col = 3
pairs_of_consecutive_rows = zip(matrix[:-1], matrix[1:])
for row_i, row_j in pairs_of_consecutive_rows:
if abs(row_i[col] - row_j[col]) >= 5:
return False
return True
if __name__ == '__main__':
folder_path = Path("../data")
for filename in glob(str(folder_path / '*.dat')):
print(f"processsing {filename}")
matrix = read_matrix(filename)
matches = verify_condition(matrix)
if matches:
print("match")
# copy_file(filename, target_folder)
I am not going into detail about the function read_matrix. Just note that I convert the strings to float with the statement float(c) in order to be able to do numerical calculations later on.
I iterate over all consecutive rows by iterating over 'matrix[:-1]and 'matrix[1:] at the same time using zip. See the effect of zip in this example:
>>> list(zip("ABC", "XYZ"))
[('A', 'X'), ('B', 'Y'), ('C', 'Z')]
And the effect of the [:-1] and [1:] indices here:
>>> "ABC"[:-1], "ABC"[1:]
('AB', 'BC')
When verify_condition finds the first two consecutive rows that differ by at least 5, it returns false.
I am confident that this code should help you going on.
PS: I could not resist using the pathlib library because I really prefer to see code like folder / subfolder / "filename.txt" instead of path.join(folder, subfolder, "filename.txt") in my scripts.
I have a list of files, I split each file on a list of words and calculated the quantity of each word in a file; My tuple now looks like:
[('fileName1',[('word1',n), ('word2',n), ('word3',n), ('word4',n)],
[('fileName2',[('word1',n), ('word2',n), ('word3',n), ('word4',n)],
...
[('fileNameM',[('word1',n), ('word2',n), ('word3',n), ('word4',n)]]
(n - number of words in a file)
Now I need to divide each n (for each fileName) into total number like:
(tn - total number)
[('fileNameM',[('word1',n/tn), ('word2',n/tn), ('word3',n/tn),
('word4',n/tn)]]
And don't change structure of the tuple
I managed to create new RDD like:
[(('fileNameM',[('word1',n), ('word2',n), ('word3',n), ('word4',n)]), tn]
by
newRDD = oldRDD.map(lambda f: (f,sum(n for w,n in f[1]))
But if you have a look attentively, the structure of tuble changed significantly and it doesn't solve my issue.
I'm tried this one but still have troubles.
newRDD = oldRDD.map(lambda f: (f[0], [w,n/sum(n for w,n in f[1]) for w,n in f[1]]))
Many thanks
I have a device that stores three data sets in a .DAT file, they always have the same heading and number of columns, but the number of rows vary.
They are (n x 4), (m x 4), (L x 3).
I need to extract the three data sets into seperate arrays for plotting.
I have been trying to use numpy.genfromtxt and numpy.loadtxt, but the only way I can get them to work for this format is to manually define the row which each data set starts.
As I will regularly need to deal with this format I have been trying to automate it.
If someone could suggest a method which might work I would greatly appreciate it. I have attached an example file.
example file
Just a quck and dirty solution. At your file size, you might run into performance issues. If you know m, n and L, initialize the output vectors with the respective length.
here is the strategy: Load the whole File in a variable. Read the variable line by line. As soon as you discover a keyword, raise a flag that you are in the specific block. In the next line, read out the line to the correct variables.
isblock1 = isblock2 = isblock3 = False
fout = [] # construct also all the other variables that you want to collect.
with open(file, 'r') as file:
lines = file.readlines() #read all the lines
for line in lines:
if isblock1:
(f, psd, ipj, itj) = line.split()
fout.append(f) #do this also with the other variables
if isblock2:
(t1, p1, p2, p12) = line.split()
if isblock3:
(t2, v1, v2) = line.split()
if 'Frequency' is in line:
isblock1 = True
isblock2 = isblock3 = False
if 'Phasor' is in line:
isblock2 = True
isblock1 = isblock3 = False
if 'Voltage' is in line:
isblock3 = True
isblock1 = isblock2 = False
Hope that helps.
I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))
What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).
I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))