Python - merging of csv files with one axis in common - python

I need to merge two csv files, A.csv and B.csv, with one axis in common, extract:
9.358,3.0
9.388,2.0
and
8.551,2.0
8.638,2.0
I want the final file C.csv to have the following pattern:
8.551,0.0,2.0
8.638,0.0,2.0
9.358,3.0,0.0
9.388,2.0,0.0
How to you suggest to do it? Should I go for a for loop?

Just read from each file, writing out to the output file and adding in the 'missing' column:
import csv
with open('c.csv', 'wb') as outcsv:
# Python 3: use open('c.csv', 'w', newline='') instead
writer = csv.writer(outcsv)
# copy a.csv across, adding a 3rd column
with open('a.csv', 'rb') as incsv:
# Python 3: use open('a.csv', newline='') instead
reader = csv.reader(incsv)
writer.writerows(row + [0.0] for row in reader)
# copy b.csv across, inserting a 2nd column
with open('b.csv', 'rb') as incsv:
# Python 3: use open('b.csv', newline='') instead
reader = csv.reader(incsv)
writer.writerows(row[:1] + [0.0] + row[1:] for row in reader)
The writer.writerows() lines do all the work; a generator expression loops over the rows in each reader, either appending a column or inserting a column in the middle.
This works with whatever size of input CSVs you have, as only some read and write buffers are held in memory. Rows are processed in iterative fashion without ever needing to hold all of the input or output files in memory.

import numpy as np
dat1 = np.genfromtxt('dat1.txt', delimiter=',')
dat2 = np.genfromtxt('dat2.txt', delimiter=',')
dat1 = np.insert(dat1, 2, 0, axis=1)
dat2 = np.insert(dat2, 1, 0, axis=1)
dat = np.vstack((dat1, dat2))
np.savetxt('dat.txt', dat, delimiter=',', fmt='%.3f')

Here's a simple solution using a dictionary, which will work for any number of files:
from __future__ import print_function
def process(*filenames):
lines = {}
index = 0
for filename in filenames:
with open(filename,'rU') as f:
for line in f:
v1, v2 = line.rstrip('\n').split(',')
lines.setdefault(v1,{})[index] = v2
index += 1
for line in sorted(lines):
print(line, end=',')
for i in range(index):
print(lines[line].get(i,0.0), end=',' if i < index-1 else '\n')
process('A.csv','B.csv')
prints
8.551,0.0,2.0
8.638,0.0,2.0
9.358,3.0,0.0
9.388,2.0,0.0

Related

Create single CSV from two CSV files with X and Y values in python

I have two CSV files; one containing X(longitude) and the other Y(latitude) values (they are 'float' data type)
I am trying to create a single CSV with all possible combinations (e.g. X1,Y1; X1, Y2; X1,Y3; X2,Y1; X2,Y2; X2,Y3... etc)
I have written the following which partly works. However the CSV file created has lines in between values and i also get the values stored like this with there list parentheses ['20.7599'] ['135.9028']. What I need is 20.7599, 135.9028
import csv
inLatCSV = r"C:\data\Lat.csv"
inLongCSV = r"C:\data\Long.csv"
outCSV = r"C:\data\LatLong.csv"
with open(inLatCSV, 'r') as f:
reader = csv.reader(f)
list_Lat = list(reader)
with open(inLongCSV, 'r') as f:
reader = csv.reader(f)
list_Long = list(reader)
with open(outCSV, 'w') as myfile:
for y in list_Lat:
for x in list_Long:
combVal = (y,x)
#print (combVal)
wr = csv.writer(myfile)
wr.writerow(combVal)
Adding a argument to the open function made the difference:
with open(my_csv, 'w', newline="") as myfile:
combinations = [[y,x] for y in list_Lat for x in list_Long]
wr = csv.writer(myfile)
wr.writerows(combinations)
Any time you're doing something with csv files, Pandas is a great tool
import pandas as pd
lats = pd.read_csv("C:\data\Lat.csv",header=None)
lons = pd.read_csv("C:\data\Long.csv",header=None)
lats['_tmp'] = 1
lons['_tmp'] = 1
df = pd.merge(lats,lons,on='_tmp').drop('_tmp',axis=1)
df.to_csv('C:\data\LatLong.csv',header=False,index=False)
We create a dataframe for each file, and merge them on a temporary column, which produces the cartesian product. https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

Python: Combine two list, but it combines one line per the other list

I'm trying to combine two lists into a csv, and have it output a line per each line of a second list.
a.csv
1
2
3
b.csv
a,x
b,y
c,z
Output:
c.csv
1|a|x
2|a|x
3|a|x
1|b|y
2|b|y
3|b|y
1|c|z
2|c|z
3|c|z
So for each line of "a" combine each line of "b", and get a list in "c".
Note, I have no need to separate "b" to reorder the columns, keeping the original order is fine.
A loop seems needed, but I'm having zero luck doing it.
Answered (output is not perfect, but ok for what i was needing):
import csv
from itertools import product
def main():
with open('a.csv', 'rb') as f1, open('b.csv', 'rb') as f2:
reader1 = csv.reader(f1, dialect=csv.excel_tab)
reader2 = csv.reader(f2, dialect=csv.excel_tab)
with open('output.csv', 'wb') as output:
writer = csv.writer(output, delimiter='|', dialect=csv.excel_tab)
writer.writerows(row1 + row2 for row1, row2 in product(reader1, reader2))
if __name__ == "__main__":
main()
Output file:
1|a,x
1|b,y
1|c,z
2|a,x
2|b,y
2|c,z
3|a,x
3|b,y
3|c,z
Yes the "|" is only one of the separators.
It would be nice to know how to get "1|a|x" and so on.
One way is to use pandas:
import pandas as pd
df = pd.concat([pd.read_csv(f, header=None) for f in ('a.csv', 'b.csv')], axis=1)
df.to_csv('out.csv', sep='|', index=False, header=False)
A native Python approach, using itertools.product:
from itertools import product
#read file a, remove newline, replace commas with new delimiter and ignore empty lines
a = [line[:-2].strip().replace(",", "|") for line in open("a.csv", "r") if line[:-2].strip()]
#read file b, leave newline in string
b = [line.replace(",", "|") for line in open("b.csv", "r") if line[:-2].strip()]
#combine the two lists
c = ["|".join([i, j]) for i, j in product(a, b)]
#write into a new file
with open("c.csv", "w") as f:
for item in c:
f.write(item)
#output
1|a|x
1|b|y
1|c|z
2|a|x
2|b|y
2|c|z
3|a|x
3|b|y
3|c|z

Merge CSVs in Python with different columns

I have hundreds of large CSV files that I would like to merge into one. However, not all CSV files contain all columns. Therefore, I need to merge files based on column name, not column position.
Just to be clear: in the merged CSV, values should be empty for a cell coming from a line which did not have the column of that cell.
I cannot use the pandas module, because it makes me run out of memory.
Is there a module that can do that, or some easy code?
The csv.DictReader and csv.DictWriter classes should work well (see Python docs). Something like this:
import csv
inputs = ["in1.csv", "in2.csv"] # etc
# First determine the field names from the top line of each input file
# Comment 1 below
fieldnames = []
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in fieldnames:
fieldnames.append(h)
# Then copy the data
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
Comments from above:
You need to specify all the possible field names in advance to DictWriter, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known before DictWriter can write the first line. This part would be more efficient using sets instead of lists (the in operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code.
The above code is for Python 3, where weird things happen in the CSV module without newline="". Remove this for Python 2.
At this point, line is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in the DictReader and DictWriter constructors.
This method should not run out of memory, because it never has the whole file loaded at once.
For those of us using 2.7, this adds an extra linefeed between records in "out.csv". To resolve this, just change the file mode from "w" to "wb".
The solution by #Aaron Lockey, which is the accepted answer has worked well for me except, there were no headers for the file. The out put had no headers and only the row data. Each column was without headings (keys). So I inserted following:
writer.writeheader()
and it worked perfectly fine for me! So now the entire code appears like this:
import csv
inputs = ["in1.csv", "in2.csv"] # etc
# First determine the field names from the top line of each input file
fieldnames = []
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in fieldnames:
fieldnames.append(h)
# Then copy the data
with open("out.csv", "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
writer.writeheader() #this is the addition.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
writer.writerow(line)
You can use the pandas module to do this pretty easily. This snippet assumes all your csv files are in the current folder.
import pandas as pd
import os
all_csv = [file_name for file_name in os.listdir(os.getcwd()) if '.csv' in file_name]
li = []
for filename in all_csv:
df = pd.read_csv(filename, index_col=None, header=0, parse_dates=True, infer_datetime_format=True)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('melted_csv.csv', index=False)
I've faced a situation where not only the number of columns are different, but also some column names are missing. For this kind of situation and obviously for your case, this code snippet can help you :)
The tricky part is naming the columns which have no names and adding them to dictionary. The read_csv_file function is playing the main role here.
def read_csv_file(csv_file_path):
headers = []
data = []
with open(csv_file_path, 'r') as f:
csv_reader = csv.reader(f)
rows = []
for i, row in enumerate(csv_reader):
if i == 0:
for j in range(len(row)):
if row[j].strip() == "":
col_name = f"col-{j+1}"
else:
col_name = row[j]
if col_name not in headers:
headers.append(col_name)
else:
rows.append(row)
if len(row) > len(headers):
for j in range(len(row)):
if j+1 > len(headers):
col_name = f"col-{j+1}"
if col_name not in headers:
headers.append(col_name)
for i, row in enumerate(rows):
row_data = {}
for j in range(len(headers)):
if len(row) > j:
row_data[headers[j]] = row[j]
else:
row_data[headers[j]] = ''
data.append(row_data)
return headers, data
def write_csv_file(file_path, rows):
if len(rows) > 0:
headers = list(rows[0].keys())
with open(file_path, 'w', newline='', encoding='UTF8') as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(rows)
# The list of the csv file paths which will be merged
files_to_be_merged = [
'file-1.csv',
'file-2.csv',
'file-3.csv'
]
# Read and store all the file data in new_file_data
final_headers = []
new_file_data = []
for f1 in files_to_be_merged:
single_file_data = read_csv_file(f1)
for h in single_file_data[0]:
if h not in final_headers:
final_headers.append(h)
new_file_data += single_file_data[1]
# Add the missing keys to the dictionaries
for d in new_file_data:
for h in final_headers:
if d.get(h) is None:
d[h] = ""
# Write a new file
target_file_name = 'merged_file.csv'
write_csv_file(target_file_name, new_file_data)

Merging CSV files with a single column into one CSV file with 14 columns

I currently have 14 CSV files, each containing one column of data for a day (14 because it goes back 2 weeks)
What I want to do is make one CSV file containing the data from all 14 of these CSVs
eg. if each CSV contains this:
1
2
3
4
I would want the outcome to be a csv file with
1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,
( the actual CSVs have 288 Rows)
I'm currently using some code I found in another question, it worked fine for 2 or 3 CSVs but when I added more it didn't do it for more than the first 3 and the code now looks extremely messy.
Apologies for the large chunk of code, but this is what I have so far.
def csvappend():
with open('C:\dev\OTQtxt\\result1.csv', 'rb') as csv1:
with open('C:\dev\OTQtxt\\result2.csv', 'rb') as csv2:
with open('C:\dev\OTQtxt\\result3.csv', 'rb') as csv3:
with open('C:\dev\OTQtxt\\result4.csv', 'rb') as csv4:
with open('C:\dev\OTQtxt\\result5.csv', 'rb') as csv5:
with open('C:\dev\OTQtxt\\result6.csv', 'rb') as csv6:
with open('C:\dev\OTQtxt\\result7.csv', 'rb') as csv7:
with open('C:\dev\OTQtxt\\result8.csv', 'rb') as csv8:
with open('C:\dev\OTQtxt\\result9.csv', 'rb') as csv9:
with open('C:\dev\OTQtxt\\result10.csv', 'rb') as csv10:
with open('C:\dev\OTQtxt\\result11.csv', 'rb') as csv11:
with open('C:\dev\OTQtxt\\result12.csv', 'rb') as csv12:
with open('C:\dev\OTQtxt\\result13.csv', 'rb') as csv13:
with open('C:\dev\OTQtxt\\result14.csv', 'rb') as csv14:
reader1 = csv.reader(csv1, delimiter=',')
reader2 = csv.reader(csv2, delimiter=',')
reader3 = csv.reader(csv3, delimiter=',')
reader4 = csv.reader(csv4, delimiter=',')
reader5 = csv.reader(csv5, delimiter=',')
reader6 = csv.reader(csv6, delimiter=',')
reader7 = csv.reader(csv7, delimiter=',')
reader8 = csv.reader(csv8, delimiter=',')
reader9 = csv.reader(csv9, delimiter=',')
reader10 = csv.reader(csv10, delimiter=',')
reader11 = csv.reader(csv11, delimiter=',')
reader12 = csv.reader(csv12, delimiter=',')
reader13 = csv.reader(csv13, delimiter=',')
reader14 = csv.reader(csv14, delimiter=',')
all = []
for row1, row2, row3, row4, row5, row6, row7, row8, row9, \
row10, row11, row12, row13, row14 in zip(reader1, \
reader2, reader3,\
reader4, reader5, \
reader7, reader8,\
reader9, reader10, \
reader11, reader12,\
reader13,reader14):
row14.append(row1[0])
row14.append(row2[0])
row14.append(row3[0])
row14.append(row4[0])
row14.append(row5[0])
row14.append(row6[0])
row14.append(row7[0])
row14.append(row8[0])
row14.append(row9[0])
row14.append(row10[0])
row14.append(row11[0])
row14.append(row12[0])
row14.append(row13[0])
all.append(row14)
with open('C:\dev\OTQtxt\TODAY.csv', 'wb') as output:
writer = csv.writer(output, delimiter=',')
writer.writerows(all)
I think some of my indenting has been messed up when copying it in, but you should get the idea. And I don't expect to read through all of that, it's very repetitive.
I have seen a few similar/related questions recommending unix tools. In case anybody was going to suggest that I'd better tell you this will be running on windows.
If anybody has any ideas of how I could clean this up and actually get it working. I'd be hugely grateful!
Creating files:
xxxx#xxxx:/tmp/files$ for i in {1..15}; do echo -e "1\n2\n3\n4" > "my_csv_$i.csv"; done
xxxx#xxxx:/tmp/files$ more my_csv_1.csv
1
2
3
4
xxxx#xxxx:/tmp/files$ ls
my_csv_10.csv my_csv_11.csv my_csv_12.csv my_csv_13.csv my_csv_14.csv my_csv_15.csv my_csv_1.csv my_csv_2.csv my_csv_3.csv my_csv_4.csv my_csv_5.csv my_csv_6.csv my_csv_7.csv my_csv_8.csv my_csv_9.csv
Using itertools.izip_longest:
with open('result.csv', 'w') as f_obj:
rows = []
files = os.listdir('.')
for f in files:
rows.append(open(f).readlines())
iter = izip_longest(*rows)
for row in iter:
f_obj.write(','.join([field.strip() for field in row if field is not None])+'\n')
Output:
xxxxx#xxxx:/tmp/files$ more result.csv
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
That's not the best solution since you will put all your data in memory. But you should get an idea how to do this. By the way if all your data is numeric, I would stays with numpy and play with multidimensional arrays.
You can use this, the files' names can also be specified in a loop:
import numpy as np
filenames = ['file1', 'file2', 'file3'] # all the files to be read in
data = [] # saves data from the files
for filename in filenames:
data.append(open(filename, 'r').readlines()) # append a list of all numbers in the current file
data = np.matrix(data).T # transpose the list of list using numpy
data_string = '\n'.join([','.join([k.strip() for k in j]) for j in data.tolist()]) # create a string by separating inner elements by ',' and outer list by '\n'
with open('newfile', 'w') as fp:
fp.write(data_string)
Have just tested:
import csv
import glob
files = glob.glob1("C:\\dev\\OTQtxt", "*csv")
rows=[]
with open('C:\\dev\\OTQtxt\\one.csv', 'a') as oneFile:
for file in files:
rows.append(open("C:\\dev\\OTQtxt\\" + file, 'r').read().splitlines())
for row in rows:
writer = csv.writer(oneFile)
writer.writerow(''.join(row))
This will result a file one.csv in your directory with csv's that will contain all merdged *csv files

reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

My source data is in a TSV file, 6 columns and greater than 2 million rows.
Here's what I'm trying to accomplish:
I need to read the data in 3 of the columns (3, 4, 5) in this source file
The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
I want to write the output of #2 to an output file in CSV format.
Below is what I came up with.
My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.
First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6
Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6
Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6
Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
then I have this code:
import csv
with open('sample.txt','r') as tsv:
AoA = [line.strip().split('\t') for line in tsv]
for a in AoA:
count = int(a[4])
while count > 0:
with open('sample_new.csv', 'a', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerow([a[2], a[3]])
count = count - 1
You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.
import csv
with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows([row[2:4] for _ in range(count)])
or, using the itertools module to do the repeating with itertools.repeat():
from itertools import repeat
import csv
with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows(repeat(row[2:4], count))

Categories

Resources