What's the fastest way to merge multiple csv files by column? - python

I have about 50 CSV files with 60,000 rows in each, and a varying number of columns. I want to merge all the CSV files by column. I've tried doing this in MATLAB by transposing each csv file and re-saving to disk, and then using the command line to concatenate them. This took my computer over a week and the final result needs to transposed once again! I have to do this again, and I'm looking for a solution that won't take another week. Any help would be appreciated.

[...] transposing each csv file and re-saving to disk, and then using the command line to concatenate them [...]
Sounds like Transpose-Cat-Transpose. Use paste for joining files horizontally.
paste -d ',' a.csv b.csv c.csv ... > result.csv

The Python csv module can be set up so that each record is a dictionary with the column names as keys. You should that way be able to read in all the files as dictionaries, and write them to an out-file that has all columns.
Python is easy to use, so this should be fairly trivial for a programmer of any language.
If your csv-files doesn't have column headings, this will be quite a lot of manual work, though, so then it's perhaps not the best solution.
Since these files are fairly big, it's best not to read all of them into memory once. I'd recommend that you first open them only to collect all column names into a list, and use that list to create the output file. Then you can concatenate each input file to the output file without having to have all of the files in memory.

Horizontal concatenation really is trivial. Considering you know C++, I'm surprised you used MATLAB. Processing a GB or so of data in the way you're doing should be in the order of seconds, not days.
By your description, no CSV processing is actually required. The easiest approach is to just do it in RAM.
vector< vector<string> > data( num_files );
for( int i = 0; i < num_files; i++ ) {
ifstream input( filename[i] );
string line;
while( getline(input, line) ) data[i].push_back(line);
}
(Do obvious sanity checks, such as making sure all vectors are the same length...)
Now you have everything, dump it:
ofstream output("concatenated.csv");
for( int row = 0; row < num_rows; row++ ) {
for( int f = 1; f < num_files; f++ ) {
if( f == 0 ) output << ",";
output << data[f][row];
}
output << "\n";
}
If you don't want to use all that RAM, you can do it one line at a time. You should be able to keep all files open at once, and just store the ifstream objects in a vector/array/list. In that case, you just read one line at a time from each file and write it to the output.

import csv
import itertools
# put files in the order you want concatentated
csv_names = [...whatever...]
readers = [csv.reader(open(fn, 'rb')) for fn in csv_names]
writer = csv.writer(open('result.csv', 'wb'))
for row_chunks in itertools.izip(*readers):
writer.writerow(list(itertools.chain.from_iterable(row_chunks)))
Concatenates horizontally. Assumes all files have the same length. Has low memory overhead and is speedy.
Answer applies to Python 2. In Python 3, opening csv files is slightly different:
readers = [csv.reader(open(fn, 'r'), newline='') for fn in csv_names]
writer = csv.writer(open('result.csv', 'w'), newline='')

Use Go: https://github.com/chrislusf/gleam
Assume there are file "a.csv" has fields "a1, a2, a3, a4, a5".
And assume file "b.csv" has fields "b1, b2, b3".
We want to join the rows where a1 = b2. And the output format should be "a1, a4, b3".
package main
import (
"os"
"github.com/chrislusf/gleam"
"github.com/chrislusf/gleam/source/csv"
)
func main() {
f := gleam.New()
a := f.Input(csv.New("a.csv")).Select(1,4) // a1, a4
b := f.Input(csv.New("b.csv")).Select(2,3) // b2, b3
a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run() // a1, a4, b3
}

Related

Operation between element in the same list of list generate from imported .dat files

I'm writing a program that takes .dat files from directory one at a time, verifies some condition, and if verification is okay copies the files to another directory.
The code below shows how I import the files and create a list of lists. I'm having trouble with the verification step. I tried with a for loop but when set if condition, operation with elements in the list of lists seems impossible.
In particular I need the difference between consecutive elements matrix[i][3] and matrix[i+1][3] to be less than 5.
for filename in glob.glob(os.path.join(folder_path, '*.dat')):
with open(filename, 'r') as f:
matrix =[]
data = f.readlines()
for raw_line in data:
split_line1= raw_line.replace(":",";")
split_line2= split_line1.replace("\n","")
split_line3 = split_line2.strip().split(";")
matrix.append(split_line3)
Hello and welcome at Stack Overflow.
You did not provide a sample of your data files. After looking at your code, I assume your data looks like this:
9;9;7;5;0;9;5;8;4;2
9;1;1;5;1;3;4;1;8;7
2;8;4;5;5;2;1;4;6;4
6;4;1;5;5;8;1;4;6;1
0;1;0;5;7;1;7;4;1;9
4;9;6;5;3;2;6;2;9;6
8;0;6;0;8;9;3;1;6;6
A few general remarks:
For parsing a csv file, use the csv module. It is easy to use and less error-prone than writing your own parser.
If you do a lot of data-processing and matrix calculations, you want to have a look at the pandas and numpy libraries. Processing matrices line by line in plain Python is slower by some orders of magnitude.
I understand your description of the verification step as follows:
A matrix matches if all consecutive elements
matrix[i][3] and matrix[i+1][3] differ by less than 5.
My suggested code looks like this:
import csv
from glob import glob
from pathlib import Path
def read_matrix(fn):
with open(fn) as f:
c = csv.reader(f, delimiter=";")
m = [[float(c) for c in row] for row in c]
return m
def verify_condition(matrix):
col = 3
pairs_of_consecutive_rows = zip(matrix[:-1], matrix[1:])
for row_i, row_j in pairs_of_consecutive_rows:
if abs(row_i[col] - row_j[col]) >= 5:
return False
return True
if __name__ == '__main__':
folder_path = Path("../data")
for filename in glob(str(folder_path / '*.dat')):
print(f"processsing {filename}")
matrix = read_matrix(filename)
matches = verify_condition(matrix)
if matches:
print("match")
# copy_file(filename, target_folder)
I am not going into detail about the function read_matrix. Just note that I convert the strings to float with the statement float(c) in order to be able to do numerical calculations later on.
I iterate over all consecutive rows by iterating over 'matrix[:-1]and 'matrix[1:] at the same time using zip. See the effect of zip in this example:
>>> list(zip("ABC", "XYZ"))
[('A', 'X'), ('B', 'Y'), ('C', 'Z')]
And the effect of the [:-1] and [1:] indices here:
>>> "ABC"[:-1], "ABC"[1:]
('AB', 'BC')
When verify_condition finds the first two consecutive rows that differ by at least 5, it returns false.
I am confident that this code should help you going on.
PS: I could not resist using the pathlib library because I really prefer to see code like folder / subfolder / "filename.txt" instead of path.join(folder, subfolder, "filename.txt") in my scripts.

Clean text files - remove unwanted content in LOOP (R/python)

I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)
This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)

fast approach to read several files in a list comprehension

I have a folder with several text file (ex: 164400). Each file has several lines (ex: x,y,z) in numeric floating format. My code reads a group of 3000 files at times and stores the values in the lines a dictionary (see example).
The code is quite slow when opening 3000 files.
[[points_dict[os.path.split(x)[1]].append(p) for p in open(x,"r")] for x in lf]
I wish to know if someone has a more efficient and fast approach to read the files
file_folder = "C:\\junk" #where i stored my file
points_dict = defaultdict(list)
groups = groupby(file_folder, key=lambda k, line=count(): next(line) // 3000)
for k, group in groups:
lf = [p for p in group]
[[points_dict[os.path.split(x)[1]].append(p) for p in open(x,"r")] for x in lf]
# do other
where the function **os.path.split(x)[1]** stores the lines with the same file name (id) in the dictionary and **lf** is the list of the files to open
What about using numpy ? Something along those lines (edited answer, tested code)
[points_dict[os.path.split(x)[1]].append(numpy.loadtxt(x, delimiter=",")) for x in lf]
for x, np_arrays in points_dict.iteritems():
points_dict[x]=numpy.vstack(np_arrays)
At the end you get the points in a nice numpy array.

What's wrong with my Python for loop?

I have two files open, EQE_data and Refl_data. I want to take each line of EQE_data, which will have eight tab-delimited columns, and find the line in Refl_data which corresponds to it, then do the data analysis and write the results to output. So for each line in EQE_data, I need to search the entire Refl_data until I find the right one. This code is successful the first time, but it is outputting the same results for the Refl_data every subsequent time. I.e., I get the correct columns for Wav1 and QE, but it seems to only be executing the nested for loop once, so I get the same R, Abs, IQE, which is correct for the first row, but incorrect thereafter.
for line in EQE_data:
try:
EQE = line.split("\t")
Wav1, v2, v3, QE, v5, v6, v7, v8 = EQE
for line in Refl_data:
Refl = line.split("\t")
Wav2, R = Refl
if float(Wav2) == float(Wav1):
Abs = 1 - (float(R) / 100)
IQE = float(QE) / Abs
output.write("%d\t%f\t%f\t%f\t%f\n" % (int(float(Wav1)), float(QE), float(R) / 100, Abs, IQE))
except:
pass
If Refl_data is a file, you need to put the read pointer back to the beginning in each loop (using Refl_data.seek(0)), or just re-open the file.
Alternatively, read all of Refl_data into a list first and loop over that list instead.
Further advice: use the csv module for tab-separated data, and don't ever use blank try:-except:; always only catch specific exceptions.

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Categories

Resources