Joining files by corresponding columns in outside table

Joining files by corresponding columns in outside table - python

I have a .csv file matching table names to categories, which I want to use to merge any files in a folder (as in cat) with names corresponding to column Sample_Name in the .csv according to Category, changing the final file's name to each Category.
The to-be merged files in the folder are not .csv; they're a kind of .fasta file.
The .csv is something as the following (will have more columns that will be ignored for this):
Sample_Name Category
1 a
2 a
3 a
4 b
5 b
After merging, the output should be two files: a (samples 1,2,3 merged) and b (samples 4 and 5).
The idea is to make this work for a large number of files and categories.
Thanks for any help!

Assuming that the files are in order in the input CSV file, this is about as simple as you could get:
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
for line in csvfile:
filename, category = fields(line.split())
with open(filename) as infile, open(category, 'a') as outfile:
outfile.write(infile.read())
One downside to this is that the output file is reopened for every input file. This might be a problem if there are a lot of files per category. If that works out to be an actual problem then you could try this, which holds the output file open for as long as there are input files in that category.
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
current_category = None
outfile = None
for line in csvfile:
filename, category = fields(line.split())
if category != current_category:
if outfile is not None:
outfile.close()
outfile = open(category, 'w')
current_category = category
with open(filename) as infile:
outfile.write(infile.read())

I would build a dictionary with keys of categories and values of lists of corresponding sample names.
d = {'a':['1','2','3'], 'b':['4','5']}
You can achieve this in a straightforward way by reading the csv file and building the dictionary line by line, i.e.
d = {}
with open('myfile.csv'):
for line in myfile.csv:
samp,cat = line.split()
try:
d[cat].append(samp)
except KeyError: # if there is no entry for cat, we will get a KeyError
d[cat] = [samp,]
For a more sophisticated way of doing this, have a look at collections.
Once this database is ready, you can create your new files from category to category:
for cat in d:
with open(cat,'w') as outfile:
for sample in d[cat]:
# copy sample file content to outfile
Copying one file's content to the other can be done in several ways, see this thread.

Related

Remove columns + keep certain rows in multiple large .csv files using python

Hello I'm really new here as well as in the world of python.
I have some (~1000) .csv files, including ~ 1800000 rows of information each. The files are in the following form:
5302730,131841,-0.29999999999999999,NULL,2013-12-31 22:00:46.773
5303072,188420,28.199999999999999,NULL,2013-12-31 22:27:46.863
5350066,131841,0.29999999999999999,NULL,2014-01-01 00:37:21.023
5385220,-268368577,4.5,NULL,2014-01-01 03:12:14.163
5305752,-268368587,5.1900000000000004,NULL,2014-01-01 03:11:55.207
So, i would like for all of the files:
(1) to remove the 4th (NULL) column
(2) to keep in every file only certain rows (depending on the value of the first column i.e.5302730, keep only the rows that containing that value)
I don't know if this is even possible, so any answer is appreciated!
Thanks in advance.

Have a look at the csv module
One can use the csv.reader function to generate an iterator of lines, with each lines cells as a list.
for line in csv.reader(open("filename.csv")):
# Remove 4th column, remember python starts counting at 0
line = line[:3] + line[4:]
if line[0] == "thevalueforthefirstcolumn":
dosomethingwith(line)

If you wish to do this sort of operation with CSV files more than once and want to use different parameters regarding column to skip, column to use as key and what to filter on, you can use something like this:
import csv
def read_csv(filename, column_to_skip=None, key_column=0, key_filter=None):
data_from_csv = []
with open(filename) as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
# Skip data in specific column
if column_to_skip is not None:
del row[column_to_skip]
# Filter out rows where the key doesn't match
if key_filter is not None:
key = row[key_column]
if key_filter != key:
continue
data_from_csv.append(row)
return data_from_csv
def write_csv(filename, data_to_write):
with open(filename, 'w') as csvfile:
csv_writer = csv.writer(csvfile)
for row in data_to_write:
csv_writer.writerow(row)
data = read_csv('data.csv', column_to_skip=3, key_filter='5302730')
write_csv('data2.csv', data)

How to slice a single CSV file into several smaller ones grouped by a field and deleting columns in the final csv's?

Even thought this might sound as a repeated question, I have not found a solution. Well, I have a large .csv file that looks like:
prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....
The aim is to slice this .csv file into multiple smaller .csv files according to the last two columns ('ident' and 'country').
I have used a code from an answer in a previous post and is the following:
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))
for groupkey, groupdata in it.groupby(sorted_csv_contents,
key=op.itemgetter('prot_desc','ident','country')):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
dict_writer.writerows(groupdata)
However, I need that my output .csv's just contain the column 'pep_seq', a desired output like:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
What can I do?

Your code was almost correct, it just needed the fieldsnames to be set correctly and for extraaction='ignore' to be set. This tells the DictWriter to only write the fields you specify:
import itertools
import operator
import csv
outfile_path4 = 'input.csv'
outfile_path5 = r'my_output_folder\output.csv'
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
group = ['prot_desc','ident','country']
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=operator.itemgetter(*group))
for groupkey, groupdata in itertools.groupby(sorted_csv_contents, key=operator.itemgetter(*group)):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=['pep_seq'], extrasaction='ignore')
dict_writer.writeheader()
dict_writer.writerows(groupdata)
This will give you an output csv file containing:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW

The following would output a csv file per country containing only the field you need.
You could always add another step to group by the second field you need I think.
import csv
# use a dict so you can store the list of pep_seqs found for each country
# the country value with be the dict key
csv_rows_by_country = {}
with open('in.csv', 'rb') as csv_in:
csv_reader = csv.reader(csv_in)
for row in csv_reader:
if row[7] in csv_rows_by_country:
# add this pep_seq to the list we already found for this country
csv_rows_by_country[row[7]].append(row[4])
else:
# start a new list for this country - we haven't seen it before
csv_rows_by_country[row[7]] = [row[4],]
for country in csv_rows_by_country:
# create a csv output file for each country and write the pep_seqs into it.
with open('out_%s.csv' % (country, ), 'wb') as csv_out:
csv_writer = csv.writer(csv_out)
for pep_seq in csv_rows_by_country[country]:
csv_writer.writerow([pep_seq, ])

How to extract specific data from a downloaded csv file and transpose into a new csv file?

I'm working with an online survey application that allows me to download survey results into a csv file. However, the format of the downloaded csv puts each survey question and answer in a new column, whereas, I need the csv file to be formatted with each survey question and answer on a new row. There is also a lot of data in the downloaded csv file that I want to ignore completely.
How can I parse out the desired rows and columns of the downloaded csv file and write them to a new csv file in a specific format?
For example, I download data and it looks like this:
V1,V2,V3,Q1,Q2,Q3,Q4....
null,null,null,item,item,item,item....
0,0,0,4,5,4,5....
0,0,0,2,3,2,3....
The first row contains the 'keys' that I will need except V1-V3 must be excluded. Row 2 must be excluded altogether. Row 3 is my first subject so I need the values 4,5,4,5 to be paired with the keys Q1,Q2,Q3,Q4. And row 4 is a new subject which needs to be excluded as well since my program only handles one subject at a time.
The csv file that I need to create in order for my script to function properly looks like this:
Q1,4
Q2,5
Q3,4
Q4,5
I've tried using this izip to pivot the data, but I don't know how to specifically select the rows and columns I need:
from itertools import izip
a = izip(*csv.reader(open("CDI.csv", "rb")))
csv.writer(open("CDI_test.csv", "wb")).writerows(a)

Here is a simple python script that should do the job for you. It takes in arguments from the command line that designate the number of entries you want to skip at the beginning of the line,the input you want to skip at the end of the line, the input file and the output file. So for example, the command would look like
python question.py 3:7 input.txt output.txt
You can also substitute sys.argv[1] for 3, sys.argv[2] for "input.txt" and so on within the script if you don't want to state the arguments every time.
Text file version:
import sys
inputFile = open(sys.argv[2],"r")
outputFile = open(sys.argv[3], "w")
leadingRemoved=int(sys.argv[1])
#strips extra whitespace from each line in file then splits by ","
lines = [x.strip().split(",") for x in inputFile.readlines()]
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:],lines[2][leadingRemoved:])
for tuples in zipped:
#writes the question/ number pair to a file.
outputFile.write(",".join(tuples))
inputFile.close()
outputFile.close()
#input from command line: python questions.py leadingRemoved pathToInput pathToOutput
CSV file version:
import sys
import csv
with open(sys.argv[2],"rb") as inputFile:
#removes null bytes
reader = csv.reader((line.replace('\0','') for line in inputFile),delimiter="\t")
outputFile = open(sys.argv[3], "wb")
leadingRemoved,endingremoved=[int(x) for x in sys.argv[1].split(":")]
#creates a 2d array of all the elements for each row
lines = [x for x in reader]
print lines
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:endingremoved],lines[2][leadingRemoved:endingremoved])
writer = csv.writer(outputFile)
writer.writerows(zipped)
print zipped
outputFile.close()

Something similar I did using multiple values but could be changed to single values.
#!/usr/bin/env python
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
The format of the csv file and the headers included need to be know to extract the email addresses
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values= mf[1:] #choose row/rows you want
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the people list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list

I suggest you read up on pandas for this type of activity:
http://pandas.pydata.org/pandas-docs/stable/io.html
import pandas
input_dataframe = pandas.read_csv("input.csv")
transposed_df = input_dataframe.transpose()
# delete rows and edit data easily using pandas dataframe
# this is a good library to get some experience working with
transposed_df.to_csv("output.csv")

Merging two files by one common set of identifiers with python

I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.
My desired output is gene ID tab module affiliation tab sample values separated by tabs.
Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.
expression_values = {}
matches = []
with open("identifiers.txt") as ids, open("target.txt") as target:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
csvfile = "modules.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in matches:
writer.writerow([val])
Thanks!

These lines of code are not doing what you are expecting them to do:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.
There are a few assumptions I am making with this rough solution I am suggesting:
The "genes" in the first file are the only "genes" that will appear
in the second file.
There could duplicate "genes" in the first file.
First construct a map of the data in the first file as:
import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
csv_reader = csv.reader(file_one, delimiter='\t')
for row in csv_reader:
gene_map[row[1]].append(row[0])
Read the second file and write to the output file simultaneously.
with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
csv_reader = csv.reader(file_two, delimiter='\t')
csv_writer = csv.writer(out_file, delimiter='\t')
for row in csv_reader:
values = gene_map.get(row[0], [])
op_list = []
op_list.append(row[0])
op_list.extend(values)
values.extend(row[1:])
csv_writer.writerow(op_list)

There are a number of problems with the existing approach, not least of which is that you are throwing away all data from the files except for the last line in each. The assignment under each "for line in" will replace the contents of the variable, so only the last assignment, for the last line, will have effect.
Assuming each gene appears in only one module, I suggest instead you read the "ids" into a dictionary, saving the module for each geneid:
geneMod = {}
for line in ids:
id = line.split()
geneMod[ id[0] ] = id[1]
Then you can just go though target lines, and for each line, split it, get the gene id gene= targetsplit[0] and save (or output) the same split fields but inserting the module value, e.g.: print gene, geneMod[gene], targetsplit[1:]

Python- Import Multiple Files to a single .csv file

I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!

This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.

You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)

(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining files by corresponding columns in outside table - python

Related

Remove columns + keep certain rows in multiple large .csv files using python

How to slice a single CSV file into several smaller ones grouped by a field and deleting columns in the final csv's?

How to extract specific data from a downloaded csv file and transpose into a new csv file?

Merging two files by one common set of identifiers with python

Python- Import Multiple Files to a single .csv file

Categories

Resources