This question already has answers here:
Merging two CSV files using Python
(2 answers)
Closed 7 years ago.
I want to merge 2 csv file using some scripting language (like bash script or python).
1st.csv (this data is from mysql query)
member_id,name,email,desc
03141,ej,ej#domain.com,cool
00002,jes,jes#domain.com,good
00002,charmie,charm#domain.com,sweet
2nd.csv (from mongodb query)
id,address,create_date
00002,someCity,20150825
00003,newCity,20140102
11111,,20150808
The examples are not the actual, though i know that some of the member_id from qsl and the id from mongodb are the same.
(*and i wish my output will be something like this)
desiredoutput.csv
meber_id,name,email,desc,address,create_date
03141,ej,ej#domain.com,cool,,
00002,jes,jes#domain.com,good,someCity,20150825
00002,charmie,charm#domain.com,sweet,
11111,,,,20150808
help will be much appreciated. thanks in advance
#########################################################################
#!/usr/bin/python
import csv
import itertools as IT
filenames = ['1st.csv', '2nd.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('desiredoutput.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:1] # column where 1 know there are identical data
if len(row) == 1:
combined_row.extend(row)
else:
combined_row.extend(['']*1)
writer.writerow(combined_row)
for f in handles:
f.close()
#########################################################################
just read and tried this code(manipulate) in this site too
Since you haven't posted an attempt, I'll give you a general answer (using Python) to get you started.
Create a dict, d
Iterate over all the rows of the first file, convert each row into a list and store it in d using meber_id as the key and the list as the value.
Iterate over all the rows of the second file, convert each row into a list leaving out the id column and update the list under d[id] with the new list if d[id] exists, otherwise store the new list under d[id].
Finally, iterate over the values in d and print them out comma separated to a file.
Edit
In your attempt, you are trying to use izip_longest to iterate over the rows of both files at the same time. But this would work only if there were an equal number of rows in both files and they were in the same order.
Anyhow, here is one way of doing it.
Note: This is using the Python 3.4+ csv module. For 2.7 it might look a little different.
import csv
d = {}
with open("file1.csv", newline="") as f:
for row in csv.reader(f):
d.setdefault(row[0], []).append(row + [""] * 3)
with open("file2.csv", newline="") as f:
for row in csv.reader(f):
old_row = d.setdefault(row[0][0], [row[0], "", "", ""])
old_row[4:] = row[1:]
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
for rows in d.values():
writer.writerows(rows)
Here goes a suggestion using pandas I've got from this answer and pandas doc about merging.
import pandas as pd
first = pd.read_csv('1st.csv')
second = pd.read_csv('2nd.csv')
merged = pd.concat([first, second], axis=1)
This will output:
meber_id name email desc id address create_date
3141 ej ej#domain.com cool 2 someCity 20150825
2 jes jes#domain.com good 11 newCity 20140102
11 charmie charm#domain.com sweet 11111 NaN 20150808
Related
Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.
I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).
You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])
See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.
If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])
I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.
Ahoy, I'm writing a Python script to filter some large CSV files.
I only want to keep rows which meet my criteria.
My input is a CSV file in the following format
Locus Total_Depth Average_Depth_sample Depth_for_17
chr1:6484996 1030 1030 1030
chr1:6484997 14 14 14
chr1:6484998 0 0 0
I want to return lines where the Total_Depth is 0.
I've been following this answer to read the data. But am stuck trying to parse over the rows and pull out the lines that meet my condition.
Here is the code I have so far:
import csv
f = open("file path", 'rb')
reader = csv.reader(f) #reader object which iterates over a csv file(f)
headers = reader.next() #assign the first row to the headers variable
column = {} #list of columns
for h in headers: #for each header
column[h] = []
for row in reader: #for each row in the reader object
for h, v in zip(headers, row): #combine header names with row values (v) in a series of tuples
column[h].append(v) #append each value to the relevant column
I understand that my data is now in a dictionary format, and I want to filter it based on the "Total_Depth" key, but I am unsure how to do this. I'm aiming to use an 'if' statement to select the relevant rows, but not sure how to do this with the dictionary structure.
Any advice would be greatly appreciated. SB :)
Use list comprehension.
import csv
with open("filepath", 'rb') as f:
reader = csv.DictReader(f)
rows = [row for row in reader if row['Total_Depth'] != '0']
for row in rows:
print row
DictReader
If you store the full result of the zip, you can check the appropriate header before assigning:
...
for row in reader: #for each row in the reader object
r = zip(headers, row):
if r['Total_Depth'] == 0:
for h, v in r:
column[h].append(v)
The dictionary of lists that you are using makes row operations quite difficult because you have to mess with C parallel lists. namedtuples are a much more convenient way to collect and operate on tabular data.
The other answers satisfy the exact problem you have. Using a more friendly data structure will help with the problems you have tomorrow.
The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).
I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)