Combine data from csv files - python

I have 100 csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am really breaking my head on how to solve this... Could any body give me some advise?
Thanks a lot and best regards,
Julian
Edit:
I want to thank 'pepr' a lot for his very elaborated answer but I would like to find a solution using pandas as suggested by 'furas'.
I have found a way to create the variables for all my files like this:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
but if I try the suggested
df1['column_A'] += df2['column_*A*']
Because I have 100 files in my case it would have to be something like
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I solve this? I think groupby.sum by panda could be an option but I don't understand how to use it.
PS: residues is a list which contains all the column headers.

The solution with standard modules can be like this:
#!python3
import csv
import itertools
fname1 = 'file1.csv'
fname2 = 'file2.csv'
fname_out = 'output.csv'
with open(fname1, newline='') as f1,\
open(fname2, newline='') as f2,\
open(fname_out, 'w', newline='') as fout:
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
writer = csv.writer(fout)
for row1, row2 in itertools.zip_longest(reader1, reader2, fillvalue=['0', '0', '0']):
row_out = [int(a) + int(b) for a, b in zip(row1, row2)]
writer.writerow(row_out)
The itertools implements the zip_longest(), that is similar to the built-in zip(); however, it can process the sequences of different lengths. Here the third parameter fillvalue is a quick hack -- 3 columns hardwired. Actually, it can be set to [0, 0, 0] (that is integers instead of strings) because int(0) is also zero.
Each zip_longest() extract a tuple of two rows -- the elements are assigned to row1 and row2. Inside the loop, the normal zip() can be used as you will always have the row from the file or the fillvalue with zeros. You always get tupple with one element from the first row, and second element from the second row. They have to be converted from string to int and then they are added to form a single element in row_out.
A better solution of the loop, that does not rely on the fixed number of columns, uses the default None as the fillvalue. If one of the rows is None, then it is set to the list with the same number of zeros that has the other row. It means that you can even have rows of different length in the same file (but must be the same i both files; the opposite could also be solved easily using zip_longest() also in the body of the loop.
for row1, row2 in itertools.zip_longest(reader1, reader2):
if row1 is None:
row1 = [0] * len(row2)
elif row2 is None:
row2 = [0] * len(row1)
row_out = [int(a) + int(b) for a, b in zip(row1, row2)]
writer.writerow(row_out)

Use pandas.
It can read CSV files and it can add two columns.
import pandas as pd
df1 = pd.read_csv(filename_1)
df2 = pd.read_csv(filename_2)
df1['column_A'] += df2['column_*A*']

Related

Compare 2 large CSVs using python - output the differences

I am writing a program to compare all files and directories between two filepaths (basically the files metadata, content, and internal directories should match)
File content comparison is done row by row. Dimensions of the csv may or may not be the same, but below approaches generally manages scenerios whereby dimensions are not the same.
The problem is that processing time is too slow.
Some context:
The two files are identified to be different using filecmp
This particular problematic csv is ~11k columns and 800 rows.
My program will not know what is the data type within
the csv beforehand, so defining the dtype for pandas is
not an option
Difflib does an excellent job if the csv file is small, but not for this particular usecase
I've looked at all the related questions on SO, and tried these approaches, but the processing time was terrible. Approach 3 gives weird results
Approach 1 (Pandas) - Terrible wait and I keep getting this error
UserWarning: You are merging on int and float columns where the float values are not equal to their int representation.
import pandas as pd
import numpy as np
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
diff = df1.merge(df2, how='outer', indicator='exists').query("exists!='both'")
print(diff)
Approach 2 (Difflib) - Terrible wait for this huge csv
import difflib
def CompareUsingDiffLib(f1, f2 ):
html = h.make_file(file1_lines, file2_lines, context=True,numlines=0)
htmlfilepath = filePath + "\\htmlFiles"
with open(htmlfilepath, 'w') as fh:
fh.write(html)
with open (file1) as f, open(file2) as z:
f1 = f.readlines()
f2 = z.readlines()
CompareUsingDiffLib(f1, f2 )
Approach 3 (Pure python) - Incorrect results
with open (f1) as f, open(f2) as z:
file1 = f.readlines()
file2 = z.readlines()
# check row number of diff in file 1
for line in file1:
if line not in file2:
print(file1.index(line))
# it shows from all the row from row number 278 to last row
# is not in file 2, which is incorrect
# I checked using difflib, and using excel as well
# no idea why the results are like that
# running below code shows the same result as the first block of code
for line in file2:
if line not in file1:
print(file2.index(line))
Approach 4 (csv-diff) - Terrible wait
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv")),
load_csv(open("two.csv"))
)
Can anybody please help on either:
An approach with less processing time
Debugging Approach 3
Comparing the files with readlines() and just testing for membership ("this in that?") does not equal diff'ing the lines.
with open (f1) as f, open(f2) as z:
file1 = f.readlines()
file2 = z.readlines()
for line in file1:
if line not in file2:
print(file1.index(line))
Consider these two CSVs:
file1.csv file2.csv
----------- -----------
a,b,c,d a,b,c,d
1,2,3,4 1,2,3,4
A,B,C,D i,ii,iii,iv
i,ii,iii,iv A,B,C,D
That script will produce nothing (and give the false impression there's no diff) because every line in file 1 is in file 2, even though the files differ line-for-line. (I cannot say why you think you were getting false positives, though, without seeing the files.)
I recommend using the CSV module and iterating the files row by row, and then even column by column:
import csv
path1 = "file1.csv"
path2 = "file2.csv"
with open(path1) as f1, open(path2) as f2:
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
for i, row1 in enumerate(reader1):
try:
row2 = next(reader2)
except StopIteration:
print(f"Row {i+1}, f1 has this extra row compared to f2")
continue
if row1 == row2:
continue
if len(row1) != len(row2):
print(f"Row {i+1} of f1 has {len(row1)} cols, f2 has {len(row2)} cols")
continue
for j, cell1 in enumerate(row1):
cell2 = row2[j]
if cell1 != cell2:
print(f'Row {i+1}, Col {j+1} of f1 is "{cell1}", f2 is "{cell2}"')
for row2 in reader2:
i += 1
print(f"Row {i+1}, f2 has this extra row compared to f1")
This uses an iterator of file1 to drive an iterator for file2, accounts for any difference in row counts between the two files by just noting a StopIteration exception if file1 has more rows than file2, and printing a difference if there are any rows left to read in file2 (reader2) at the very bottom.
When I run that against these files:
file1 file2
----------- ----------
a,b,c,d a,b,c
1,2,3,4 1,2,3,4
A,B,C,D A,B,C,Z
i,ii,iii,iv i,ii,iii,iv
x,xo,xox,xoxo
I get:
Row 1 of f1 has 4 cols, f2 has 3 cols
Row 3, Col 4 of f1 is "D", f2 is "Z"
Row 5, f2 has an extra row compared to f1
If I swap path1 and path2, I get this:
Row 1 of f1 has 3 cols, f2 has 4 cols
Row 3, Col 4 of f1 is "Z", f2 is "D"
Row 5, f1 has this extra row compared to f2
And it does this fast. I mocked up two 800 x 11_000 CSVs with very, very small differences between rows (if any) and it processed all diffs in under a second of user time (not counting printing).
use can use filecmp to byte by byte compare two files. docs
implementation
>>> import filecmp
>>> filecmp.cmp('somepath/file1.csv', 'otherpath/file1.csv')
True
>>> filecmp.cmp('somepath/file1.csv', 'otherpath/file2.csv')
True
note: file name doesnt matter.
speed comparison against hashing: https://stackoverflow.com/a/1072576/16239119

Print 2D array in txt file saves only last row - incorrect use of .join?

I have a 2D array (yieldpergroup) and trying to save vertically each array with something like
with open('txt/All_numbers.txt', 'w') as f:
lines = [' \t'.join([str(x[i]) for x in yieldpergroup]) for i in range(0,len(max(yieldpergroup)))]
i.e. my array is something like that
yieldpergroup([a,b,c,d][1,2,3,4][!,#,#,$])
and I want to have it in the format
a 1 !
b 2 #
c 3 #
d 4 $
However, the txt although correctly includes all the columns, however only the last one is shown correctly, all other columns are filled with zeroes (like when I first initialized with yieldpergroup = [[0 for i in range(cols)] for j in range(rows)]). What am I doing wrong while using join ?
You can use zip transposition pattern to get the desired result. Use:
yieldpergroup = (['A','B','C', 'D'], [1,2,3,4], ['!','#','#','$'])
with open('txt/All_numbers.txt', 'w') as f:
for t in zip(*yieldpergroup):
f.write("\t".join(map(str, t)) + "\n")
After executing the above code your .txt file should look like:
A 1 !
B 2 #
C 3 #
D 4 $
You could try something likes this:
with open('All_numbers.txt', 'a') as f:
f.write("\n".join(["\t".join([group[index] for group in yieldpergroup]) for index, entry in enumerate(range(0,len(max(yieldpergroup))))]))
But if you want to be able to comprehend your code again in a couple of days, there's also no shame in this:
with open('All_numbers.txt', 'a') as f:
row = ""
rows = []
biggest_group = range(0,len(max(yieldpergroup)))
for index, entry in enumerate(biggest_group):
for group in yieldpergroup:
row += f"{group[index]}\t"
rows.append(row)
row = ""
f.write("\n".join(rows))
You probably also want to account for the case if your groups don't share an equal length, since you are already checking for the list with the most entries, which would make the list comprehension even harder to read.

For basic maths calculations on very large csv files how can I do this faster when I have mixed datatypes in my csv - with python

I have some very large CSV files (+15Gb) that contain 4 initial rows of meta data / header info and then the data. The first 3 columns are 3D Cartesian coordinates and are the values I need to change with basic maths operations. e.g. Add, subtract, multiple, divide. I need to do this on mass to each of the coordinate columns. The first 3 columns are float type values
The rest of the columns in the CSV could be of any type, e.g. string, int, etc....
I currently use a script where I can read in each row of the csv and make the modification, then write to a new file and it seems to work fine. But the problem is it takes days on a large file. The machine I'm running on has plenty of memory (120Gb), but mu current method doesn't utilise that.
I know I can update a column on mass using a numpy 2D array if I skip the 4 metadata rows.
e.g
arr = np.genfromtxt(input_file_path, delimiter=',', skip_header=4)
arr[:,0]=np.add(arr[:,0],300)
this will update the first column by adding 300 to each value. But the issue I have with trying to use numpy is
Numpy arrays don't support mixed data types for the rest of the columns that will be imported (I don't know what the other columns will hold so I can't use structured arrays - or rather i want it to be a universal tool so I don't have to know what they will hold)
I can export the numpy array to csv (providing it's not mixed types) and just using regular text functions I can create a separate CSV for the 4 rows of metadata, but then I need to somehow concatenate them and I don't want to have read through all the lines of the data csv just to append it to the bottom of the metadata csv.
I know if I can make this work with Numpy it will greatly increase the speed by utilizing the machine's large amount of memory, by holding the entire csv in memory while I do operations. I've never used pandas but would also consider using it for a solution. I've had a bit of a look into pandas thinking I maybe able to do it with dataframes but I still need to figure out how to have 4 rows as my column header instead of one and additionally I haven't seen a way to apply a mass update to the whole column (like I can with numpy) without using a python loop - not sure if that would make it slow or not if it's already in memory.
The metadata can be empty for rows 2,3,4 but in most cases row 4 will have the data type recorded. There could be up to 200 data columns in addition to the initial 3 coordinate columns.
My current (slow) code looks like this:
import os
import subprocess
import csv
import numpy as np
def move_txt_coords_to(move_by_coords, input_file_path, output_file_path):
# create new empty output file
open(output_file_path, 'a').close()
with open(input_file_path, newline='') as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
if idx < 4:
append_row(output_file_path, row)
else:
new_x = round(float(row[0]) + move_by_coords['x'], 3)
new_y = round(float(row[1]) + move_by_coords['y'], 3)
new_z = round(float(row[2]) + move_by_coords['z'], 3)
row[0] = new_x
row[1] = new_y
row[2] = new_z
append_row(output_file_path, row)
def append_row(output_file, row):
f = open(output_file, 'a', newline='')
writer = csv.writer(f, delimiter=',')
writer.writerow(row)
f.close()
if __name__ == '__main__':
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
input_file_path = r'D:\incoming_data\large_data_set1.csv'
output_file_path = r'D:\outgoing_data\large_data_set_relocated.csv'
move_txt_coords_to(move_by_coords, input_file_path, output_file_path)
Okay so I've got an almost complete answer and it was so much easier than trying to use numpy.
import pandas pd
input_file_path = r'D:\input\large_data.csv'
output_file_path = r'D:\output\large_data_relocated.csv'
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
df = pd.read_csv(input_file_path, header=[0,1,2,3])
df.centroid_x += move_by_coords['x']
df.centroid_y += move_by_coords['y']
df.centroid_z += move_by_coords['z']
df.to_csv(output_file_path,sep=',')
But I have one remaining issue (possibly 2). The blanks cells in my header are being populated with Unnamed. I somehow need it to sub in a blank string for those in the header row.
Also #FBruzzesi has warned me I made need to use batchsize to make it more efficient which i'll need to check out.
---------------------Update-------------
Okay I resolved the multiline header issue. I just use the regular csv reader module to read the first 4 rows into a list of rows, then I transpose this to be a list of columns where I convert the column list to tuples at the same time. Once I have a list of column header tuples (where the tuples consist of each of the rows within that column header), I can use the list to name the header. I there fore skip the header rows on reading the csv to the data frame, and then update each column by it's index. I also drop the index column on export back to csv once done.
It seems work very well.
import csv
import itertools
import pandas as pd
def make_first_4rows_list_of_tuples(input_csv_file_path):
f = open(input_csv_file_path, newline='')
reader = csv.reader(f)
header_rows = []
for row in itertools.islice(reader, 0, 4):
header_rows.append(row)
header_col_tuples = list(map(tuple, zip(*header_rows)))
print("Header columns: \n", header_col_tuples)
return header_col_tuples
if __name__ == '__main__':
move_by_coords = {
'x': 1695381.5,
'y': 5376792.5,
'z': 100
}
input_file_path = r'D:\temp\mydata.csv'
output_file_path = r'D:\temp\my_updated_data.csv'
column_headers = make_first_4rows_list_of_tuples(input_file_path)
df = pd.read_csv(input_file_path, skiprows=4, names=column_headers)
df.iloc[:, 0] += move_by_coords['x']
df.iloc[:, 1] += move_by_coords['y']
df.iloc[:, 2] += move_by_coords['z']
df.to_csv(output_file_path, sep=',', index=False)

Smart way to read big input file with multiple unmarked variables (assorted in columns) in python

I have the following code that runs for over a million lines. But this takes a lot of time. Is there a better way to read in such files? The current code looks like this:
for line in lines:
line = line.strip() #Strips extra characters from lines
columns = line.split() #Splits lines into individual 'strings'
x = columns[0] #Reads in x position
x = float(x) #Converts the strings to float
y = columns[1] #Reads in y
y = float(y) #Converts the strings to float
z = columns[2] #Reads in z
z = float(z) #Converts the strings to float
The file data looks like this:
347.528218024 354.824474847 223.554247185 -47.3141937738 -18.7595743981
317.843928028 652.710791858 795.452586986 -177.876355361 7.77755408015
789.419369714 557.566066378 338.090799912 -238.803813301 -209.784710166
449.259334688 639.283337249 304.600907059 26.9716202117 -167.461497735
739.302109761 532.139588049 635.08307865 -24.5716064556 -91.5271790951
I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.
It would be helpful to know what you are planning on doing with the data, but you might try:
data = [map(float, line.split()) for line in lines]
This will give you a list of lists with your data.
Pandas is built for this (among many other things)!
It uses numpy, which uses C under the hood and is very fast. (Actually, depending on what you're doing with the data, you may want to use numpy directly instead of pandas. However, I'd only do that after you've tried pandas; numpy is lower level and pandas will make your life easier.)
Here's how you could read in your data:
import pandas as pd
with open('testfile', 'r') as f:
d = pd.read_csv(f, delim_whitespace=True, header=None,
names=['delete me','col1','col2','col3','col4','col5'])
d = d.drop('delete me',1) # the first column is all spaces and gets interpreted
# as an empty column, so delete it
print d
This outputs:
col1 col2 col3 col4 col5
0 347.528218 354.824475 223.554247 -47.314194 -18.759574
1 317.843928 652.710792 795.452587 -177.876355 7.777554
2 789.419370 557.566066 338.090800 -238.803813 -209.784710
3 449.259335 639.283337 304.600907 26.971620 -167.461498
4 739.302110 532.139588 635.083079 -24.571606 -91.527179
The result d in this case is a powerful data structure called a dataframe that gives you a lot of options for manipulating the data very quickly.
As a simple example, this adds the two first columns and gets the mean of the result:
(d['col1'] + d['col2']).mean() # 1075.97544372
Pandas also handles missing data very nicely; if there are missing/bad values in the data file, pandas will simply replace them with NaN or None as appropriate when it reads them in.
Anyways, for fast,easy data analysis, I highly recommend this library.

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Categories

Resources