Merging two files by one common set of identifiers with python

Merging two files by one common set of identifiers with python - python

I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.
My desired output is gene ID tab module affiliation tab sample values separated by tabs.
Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.
expression_values = {}
matches = []
with open("identifiers.txt") as ids, open("target.txt") as target:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
csvfile = "modules.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in matches:
writer.writerow([val])
Thanks!

These lines of code are not doing what you are expecting them to do:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.
There are a few assumptions I am making with this rough solution I am suggesting:
The "genes" in the first file are the only "genes" that will appear
in the second file.
There could duplicate "genes" in the first file.
First construct a map of the data in the first file as:
import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
csv_reader = csv.reader(file_one, delimiter='\t')
for row in csv_reader:
gene_map[row[1]].append(row[0])
Read the second file and write to the output file simultaneously.
with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
csv_reader = csv.reader(file_two, delimiter='\t')
csv_writer = csv.writer(out_file, delimiter='\t')
for row in csv_reader:
values = gene_map.get(row[0], [])
op_list = []
op_list.append(row[0])
op_list.extend(values)
values.extend(row[1:])
csv_writer.writerow(op_list)

There are a number of problems with the existing approach, not least of which is that you are throwing away all data from the files except for the last line in each. The assignment under each "for line in" will replace the contents of the variable, so only the last assignment, for the last line, will have effect.
Assuming each gene appears in only one module, I suggest instead you read the "ids" into a dictionary, saving the module for each geneid:
geneMod = {}
for line in ids:
id = line.split()
geneMod[ id[0] ] = id[1]
Then you can just go though target lines, and for each line, split it, get the gene id gene= targetsplit[0] and save (or output) the same split fields but inserting the module value, e.g.: print gene, geneMod[gene], targetsplit[1:]

Related

In Python, how to compare two csv files based on values in one column and output records from first file that do not match second

Pretty new to python and coding in general. I've been searching for several csv comparison questions and answers and couldn't find anything that helped with this specific comparison problem.
I have two files that contain network asset info. Some devices have multiple IP addresses in one file, and only 1 address in another. Also they don't seem to share uppercase or lowercase format. I'm interested in their hostname values.
(files don't have headers)
file 1:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
hostname4,10.19.0.4,10.19.17.31,10.19.17.32,10.19.17.33,10.19.17.34
hostname5,10.19.0.40,10.19.17.51,10.19.17.52,10.19.17.53,10.19.17.54
hostname6,10.19.0.55,10.19.17.56,10.19.17.57,10.19.17.58,10.19.17.59
File 2
HOSTNAME4,10.19.0.4
HOSTNAME5,10.19.0.40
HOSTNAME6,10.19.0.55
hostname7,192.168.0.1
hostname8,192.168.0.2
hostname9,192.168.0.3
I'd like to compare these files based on hostname (column 0) and output to a third file that contains the rows in file1 that are NOT in file2, ignoring case, and ignoring if they have multiple IP's in file1 or file2.
desired output:
file3:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
I tried a simple comm command in bash to try and see if I could generate the desired result and had no luck, so I decided to try this in python
comm -23 --nocheck-order file1.csv file2.csv > file3.csv
Here's what i've tried in python:
with open('file1.csv', 'r') as f1, open('file2.csv', 'r') as f2:
fileone = f1.readlines()
filetwo = f2.readlines()
with open('file3.csv', 'w') as outFile:
for line in fileone:
if line not in filetwo:
outFile.write(line)
Problem is it isn't writing the rows where the IP list don't match exactly. Even if in column 1 they share a hostname, if the row has multiple ips in one file it isn't counted.
I'm not sure my code above is ignore case and it seems to be trying to match the entire string from a row, rather than "contains."
willing to try pandas package if that makes more sense for this kind of comparison

Your own code is not too far away from what you need to do.
Step 1 : Create a set from the list of hostnames in file2.csv. Here the hostnames are changed to uppercase.
with open('file2.csv') as check_file:
check_set = set([row.split(',')[0].strip().upper() for row in check_file])
Step 2 : Iterate through the lines of file1.csv and check if the hostname is in the set.
with open('file1.csv', 'r') as in_file, open('file3.csv', 'w') as out_file:
for line in in_file:
if line.split(',')[0].strip().upper() not in check_set:
out_file.write(line)
Generated file file3.csv contents:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3

Since you are interested to use Pandas I would suggest this.
Use read_csv to read the csv file and merge to join both and identify the mismatch. But for this the number of columns in both files should be same(or use names to define columns). Having said that,if you fine with only the first column comparison you can try this.
import pandas as pd
#Read the 2 csv files and take only the first column
file1_df = pd.read_csv('filename1.csv',usecols=[0],names=['Name'])
file2_df = pd.read_csv('filename2.csv',usecols=[0],names=['Name'])
#Converting both the files first column to uppercase to make it case insensitive
file1_df['Name'] = file1_df['Name'].str.upper()
file2_df['Name'] = file2_df['Name'].str.upper()
#Merging both the Dataframe using left join
comparison_result = pd.merge(file1_df,file2_df,on='Name',how='left',indicator=True)
#Filtering only the rows that are available in left(file1)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
print(comparison_result)
As I told, Since the number of columns are different(if separated by comma) in both csv, i'm reading only the first column. Hence output also will be only one column as shown below.
HOSTNAME1
HOSTNAME2
HOSTNAME3

you need to compare the first column only , try something like below
filetwo=[val.split(',')[0].strip().lower() for val in filetwo]
for line in fileone:
if line.split(',')[0].strip().lower() not in filetwo:
print(line)

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).

You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])

See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.

If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])

I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

Remove columns + keep certain rows in multiple large .csv files using python

Hello I'm really new here as well as in the world of python.
I have some (~1000) .csv files, including ~ 1800000 rows of information each. The files are in the following form:
5302730,131841,-0.29999999999999999,NULL,2013-12-31 22:00:46.773
5303072,188420,28.199999999999999,NULL,2013-12-31 22:27:46.863
5350066,131841,0.29999999999999999,NULL,2014-01-01 00:37:21.023
5385220,-268368577,4.5,NULL,2014-01-01 03:12:14.163
5305752,-268368587,5.1900000000000004,NULL,2014-01-01 03:11:55.207
So, i would like for all of the files:
(1) to remove the 4th (NULL) column
(2) to keep in every file only certain rows (depending on the value of the first column i.e.5302730, keep only the rows that containing that value)
I don't know if this is even possible, so any answer is appreciated!
Thanks in advance.

Have a look at the csv module
One can use the csv.reader function to generate an iterator of lines, with each lines cells as a list.
for line in csv.reader(open("filename.csv")):
# Remove 4th column, remember python starts counting at 0
line = line[:3] + line[4:]
if line[0] == "thevalueforthefirstcolumn":
dosomethingwith(line)

If you wish to do this sort of operation with CSV files more than once and want to use different parameters regarding column to skip, column to use as key and what to filter on, you can use something like this:
import csv
def read_csv(filename, column_to_skip=None, key_column=0, key_filter=None):
data_from_csv = []
with open(filename) as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
# Skip data in specific column
if column_to_skip is not None:
del row[column_to_skip]
# Filter out rows where the key doesn't match
if key_filter is not None:
key = row[key_column]
if key_filter != key:
continue
data_from_csv.append(row)
return data_from_csv
def write_csv(filename, data_to_write):
with open(filename, 'w') as csvfile:
csv_writer = csv.writer(csvfile)
for row in data_to_write:
csv_writer.writerow(row)
data = read_csv('data.csv', column_to_skip=3, key_filter='5302730')
write_csv('data2.csv', data)

Joining files by corresponding columns in outside table

I have a .csv file matching table names to categories, which I want to use to merge any files in a folder (as in cat) with names corresponding to column Sample_Name in the .csv according to Category, changing the final file's name to each Category.
The to-be merged files in the folder are not .csv; they're a kind of .fasta file.
The .csv is something as the following (will have more columns that will be ignored for this):
Sample_Name Category
1 a
2 a
3 a
4 b
5 b
After merging, the output should be two files: a (samples 1,2,3 merged) and b (samples 4 and 5).
The idea is to make this work for a large number of files and categories.
Thanks for any help!

Assuming that the files are in order in the input CSV file, this is about as simple as you could get:
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
for line in csvfile:
filename, category = fields(line.split())
with open(filename) as infile, open(category, 'a') as outfile:
outfile.write(infile.read())
One downside to this is that the output file is reopened for every input file. This might be a problem if there are a lot of files per category. If that works out to be an actual problem then you could try this, which holds the output file open for as long as there are input files in that category.
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
current_category = None
outfile = None
for line in csvfile:
filename, category = fields(line.split())
if category != current_category:
if outfile is not None:
outfile.close()
outfile = open(category, 'w')
current_category = category
with open(filename) as infile:
outfile.write(infile.read())

I would build a dictionary with keys of categories and values of lists of corresponding sample names.
d = {'a':['1','2','3'], 'b':['4','5']}
You can achieve this in a straightforward way by reading the csv file and building the dictionary line by line, i.e.
d = {}
with open('myfile.csv'):
for line in myfile.csv:
samp,cat = line.split()
try:
d[cat].append(samp)
except KeyError: # if there is no entry for cat, we will get a KeyError
d[cat] = [samp,]
For a more sophisticated way of doing this, have a look at collections.
Once this database is ready, you can create your new files from category to category:
for cat in d:
with open(cat,'w') as outfile:
for sample in d[cat]:
# copy sample file content to outfile
Copying one file's content to the other can be done in several ways, see this thread.

How to extract specific data from a downloaded csv file and transpose into a new csv file?

I'm working with an online survey application that allows me to download survey results into a csv file. However, the format of the downloaded csv puts each survey question and answer in a new column, whereas, I need the csv file to be formatted with each survey question and answer on a new row. There is also a lot of data in the downloaded csv file that I want to ignore completely.
How can I parse out the desired rows and columns of the downloaded csv file and write them to a new csv file in a specific format?
For example, I download data and it looks like this:
V1,V2,V3,Q1,Q2,Q3,Q4....
null,null,null,item,item,item,item....
0,0,0,4,5,4,5....
0,0,0,2,3,2,3....
The first row contains the 'keys' that I will need except V1-V3 must be excluded. Row 2 must be excluded altogether. Row 3 is my first subject so I need the values 4,5,4,5 to be paired with the keys Q1,Q2,Q3,Q4. And row 4 is a new subject which needs to be excluded as well since my program only handles one subject at a time.
The csv file that I need to create in order for my script to function properly looks like this:
Q1,4
Q2,5
Q3,4
Q4,5
I've tried using this izip to pivot the data, but I don't know how to specifically select the rows and columns I need:
from itertools import izip
a = izip(*csv.reader(open("CDI.csv", "rb")))
csv.writer(open("CDI_test.csv", "wb")).writerows(a)

Here is a simple python script that should do the job for you. It takes in arguments from the command line that designate the number of entries you want to skip at the beginning of the line,the input you want to skip at the end of the line, the input file and the output file. So for example, the command would look like
python question.py 3:7 input.txt output.txt
You can also substitute sys.argv[1] for 3, sys.argv[2] for "input.txt" and so on within the script if you don't want to state the arguments every time.
Text file version:
import sys
inputFile = open(sys.argv[2],"r")
outputFile = open(sys.argv[3], "w")
leadingRemoved=int(sys.argv[1])
#strips extra whitespace from each line in file then splits by ","
lines = [x.strip().split(",") for x in inputFile.readlines()]
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:],lines[2][leadingRemoved:])
for tuples in zipped:
#writes the question/ number pair to a file.
outputFile.write(",".join(tuples))
inputFile.close()
outputFile.close()
#input from command line: python questions.py leadingRemoved pathToInput pathToOutput
CSV file version:
import sys
import csv
with open(sys.argv[2],"rb") as inputFile:
#removes null bytes
reader = csv.reader((line.replace('\0','') for line in inputFile),delimiter="\t")
outputFile = open(sys.argv[3], "wb")
leadingRemoved,endingremoved=[int(x) for x in sys.argv[1].split(":")]
#creates a 2d array of all the elements for each row
lines = [x for x in reader]
print lines
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:endingremoved],lines[2][leadingRemoved:endingremoved])
writer = csv.writer(outputFile)
writer.writerows(zipped)
print zipped
outputFile.close()

Something similar I did using multiple values but could be changed to single values.
#!/usr/bin/env python
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
The format of the csv file and the headers included need to be know to extract the email addresses
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values= mf[1:] #choose row/rows you want
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the people list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list

I suggest you read up on pandas for this type of activity:
http://pandas.pydata.org/pandas-docs/stable/io.html
import pandas
input_dataframe = pandas.read_csv("input.csv")
transposed_df = input_dataframe.transpose()
# delete rows and edit data easily using pandas dataframe
# this is a good library to get some experience working with
transposed_df.to_csv("output.csv")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two files by one common set of identifiers with python - python

Related

In Python, how to compare two csv files based on values in one column and output records from first file that do not match second

Loop within loop when comparing csv files in Python

Remove columns + keep certain rows in multiple large .csv files using python

Joining files by corresponding columns in outside table

How to extract specific data from a downloaded csv file and transpose into a new csv file?

Categories

Resources