Python reading files in a directory - python

I have a .csv with 3000 rows of data in 2 columns like this:
uc007ayl.1 ENSMUSG00000041439
uc009mkn.1 ENSMUSG00000031708
uc009mkn.1 ENSMUSG00000035491
In another folder I have a graphs with name like this:
uc007csg.1_nt_counts.txt
uc007gjg.1_nt_counts.txt
You should notice those graphs have a name in the same format of my 1st column
I am trying to use python to identify those rows that have a graph and print the name of 2nd column in a new .txt file
These are the codes I have
import csv
with open("C:/*my dir*/UCSC to Ensembl.csv", "r") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
print row[0]
But this as far as I can get and I am stuck.

You're almost there:
import csv
import os.path
with open("C:/*my dir*/UCSC to Ensembl.csv", "rb") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
graph_filename = os.path.join("C:/folder", row[0] + "_nt_counts.txt")
if os.path.exists(graph_filename):
print (row[1])
Note that the repeated calls to os.path.exists may slow down the process, especially if the directory lies on a remote filesystem and does not significantly more files than the number of lines in the CSV file. You may want to use os.listdir instead:
import csv
import os
graphs = set(os.listdir("C:/graph folder"))
with open("C:/*my dir*/UCSC to Ensembl.csv", "rb") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] + "_nt_counts.txt" in graphs:
print (row[1])

First, try to see if print row[0] really gives the correct file identifier.
Second, concatenate the path to the files with row[0] and check if this full path exists (if the file exists, actually) with os.path.exists(path) (see http://docs.python.org/library/os.path.html#os.path.exists ).
If it exits, you can write the row[1] (the second column) to a new file with f2.write("%s\n" % row[1] (first you have to open f2 for writing of course).

Well, the next step would be to check if the file exists? There are a few ways, but I like the EAFP approach.
try:
with open(os.path.join(the_dir,row[0])) as f: pass
except IOError:
print 'Oops no file'
the_dir is the directory where the files are.

result = open('result.txt', 'w')
for line in open('C:/*my dir*/UCSC to Ensembl.csv', 'r'):
line = line.split(',')
try:
open('/path/to/dir/' + line[0] + '_nt_counts.txt', 'r')
except:
continue
else:
result.write(line[1] + '\n')
result.close()

import csv
import os
# get prefixes of all graphs in another directory
suff = '_nt_counts.txt'
graphs = set(fn[:-len(suff)] for fn in os.listdir('another dir') if fn.endswith(suff))
with open(r'c:\path to\file.csv', 'rb') as f:
# extract 2nd column if the 1st one is a known graph prefix
names = (row[1] for row in csv.reader(f, delimiter='\t') if row[0] in graphs)
# write one name per line
with open('output.txt', 'w') as output_file:
for name in names:
print >>output_file, name

Related

getting a specific row of multiple csv's and writing to a new csv

Have had a good search but can't quite find what I'm looking for. I have a number of csv files printed by a CFD simulation. The goal of my python script was to:
get the final row of each csv and
add the rows to a new file with the filename added to the start of each row
Currently I have
if file.endswith(".csv"):
with open(file, 'r') as f:
tsrNum = file.translate(None, '.csv')
print(tsrNum + ', ' + ', '.join(list(csv.reader(f))[-1]))
Which prints the correct values into the terminal, but I have to manually and paste it into a new file.
Can somebody help with the last step? I'm not familiar enough with the syntax of python, certainly on my to-do list once I finish this CFD project as so far it's been fantastic when I've managed to implement it correctly. I tried using loops and csv.dictWriter, but with little success.
EDIT
I couldn't get the posted solution working. Here's the code a guy helped me make
import csv
import os
import glob
# get a list of the input files, all CSVs in current folder
infiles = glob.glob("*.csv")
# open output file
ofile = open('outputFile.csv', "w")
# column names
fieldnames = ['parameter','time', 'cp', 'cd']
# access it as a dictionary for easy access
writer = csv.DictWriter(ofile, fieldnames=fieldnames)
# output file header
writer.writeheader()
# iterate through list of input files
for ifilename in infiles:
# open input file
ifile = open(ifilename, "rb+")
# access it as a dictionary for easy access
reader = csv.DictReader(ifile)
# get the rows in reverse order
rows = list(reader)
rows.reverse()
# get the last row
row = rows[0]
# output row to output csv
writer.writerow({'parameter': ifilename.translate(None, '.csv'), 'time': row['time'], 'cp': row['cp'], 'cd': row['cd']})
# close input file
ifile.close()
# close output file
ofile.close()
Split your problem in smaller pieces:
looping over directory
getting last line
writing to your new csv.
I have tried to be very verbose, so that you should try to do something like this:
import os
def get_last_line_of_this_file(filename):
with open(filename) as f:
for line in f:
pass
return line
def get_all_csv_filenames(directory):
for filename in os.listdir(directory):
if filename.endswith('.csv'):
yield filename
def write_your_new_csv_file(new_filename):
with open(new_filename, 'w') as writer:
for filename in get_all_csv_filenames('now the path to your dir'):
last_line = get_last_line_of_this_file(filename)
writer.write(filename + ' ' + last_line)
if __name__ == '__main__':
write_your_new_csv_file('your_created_filename.csv')

update file in a subdirectory from a list- filename gotten from csv

I've got a python script that pulls a filename from a csv file, and updates that file by adding a value to a field within the file. My problem is the file I need to update is actually in a subdirectory, with a folder name completely unrelated to the file I need to update.
my .csv list is like this:
file1, fieldx, value
file2, fieldx, value
and the files are in folders like this:
abcd/file1
efgh/file2
How can I update my code to find the file within the folder? I'm really new to Python, and I know it involves either glob, glob2, or os.walk, but I'm not sure how to nest / loop since I'm pulling the filename value from the .csv.
Here's my code:
import csv
startfile = raw_input("Please enter the name of the csv file: ")
with open(startfile, 'r') as f:
reader = csv.reader(f)
changelist = list(reader)
for x in changelist:
linnum = 0
fname=x[0]+".xml"
fieldlookup = x[1]
with open(fname) as f:
for num, line in enumerate(f, 1):
if fieldlookup in line:
linnum = num
f.close()
f = open(fname, 'r')
lines = f.readlines()
if linnum > 0:
lines[linnum-1] = " <"+fieldlookup+">"+str(x[2])+"</"+fieldlookup+">\n"
f.close()
f = open(fname, 'w')
f.writelines(lines)
f.close()
print "success!"+str(x[0])+"\n"
If there exist only one level of subdirs, you should be able to do it just by:
import glob
...
fname=x[0]+".xml"
real_fname=glob.glob("./*/"+fname)
And use real_fname afterwards. If you need cross platform compatibility you could use also os.path.join instead of +.

Reading data from one CSV and displaying parsed data on to another CSV file

I am very new to Python. I am trying to read a csv file and displaying the result to another CSV file. What I want to do is I want to write selective rows in the input csv file on to the output file. Below is the code I wrote so far. This code read every single row from the input file i.e. 1.csv and write it to an output file out.csv. How can I tweak this code say for example I want my output file to contain only those rows which starts with READ in column 8 and rows which are not equal to 0000 in column 10. Both of these conditions need to be met. Like start with READ and not equal to 0000. I want to write all these rows. Also this block of code is for a single csv file. Can anyone please tell me how I can do it for say 10000 csv files ? Also when I execute the code, I can see spaces between lines on my out csv. How can I remove those spaces ?
import csv
f1 = open("1.csv", "r")
reader = csv.reader(f1)
header = reader.next()
f2 = open("out.csv", "w")
writer = csv.writer(f2)
writer.writerow(header)
for row in reader:
writer.writerow(row)
f1.close()
f2.close()
Something like:
import os
import csv
import glob
class CSVReadWriter(object):
def munge(self, filename, suffix):
name,ext = os.path.split(filename)
return '{0}{1}.{2}'.format(name, suffix, ext)
def is_valid(self, row):
return row[8] == 'READ' and row[10] == '0000'
def filter_csv(fin, fout):
reader = csv.reader(fin)
writer = csv.writer(fout)
writer.write(reader.next()) # header
for row in reader:
if self.is_valid(row):
writer.writerow(row)
def read_write(self, iname, suffix):
with open(iname, 'rb') as fin:
oname = self.munge(filename, suffix)
with open(oname, 'wb') as fout:
self.filter_csv(fin, fout)
work_directory = r"C:\Temp\Data"
for filename in glob.glob(work_directory):
csvrw = CSVReadWriter()
csvrw.read_write(filename, '_out')
I've made it a class so that you can over ride the munge and is_valid methods to suit different cases. Being a class also means that you can store state better, for example if you wanted to output lines between certain criteria.
The extra spaces between lines that you mention are to do with \r\n carriage return and line feed line endings. Using open with 'wb' might resolve it.

CSV Writing to File Difficulties

I am supposed to add a specific label to my CSV file based off conditions. The CSV file has 10 columns and the third, fourth, and fifth columns are the ones that affect the conditions the most and I add my label on the tenth column. I have code here which ended in an infinite loop:
import csv
import sys
w = open(sys.argv[1], 'w')
r = open(sys.argv[1], 'r')
reader = csv.reader(r)
writer = csv.writer(w)
for row in reader:
if row[2] or row[3] or row[4] == '0':
row[9] == 'Label'
writer.writerow(row)
w.close()
r.close()
I do not know why it would end in an infinite loop.
EDIT: I made a mistake and my original infinite loop program had this line:
w = open(sys.argv[1], 'a')
I changed 'a' to 'w' but this ended up erasing the entire CSV file itself. So now I have a different problem.
You have a problem here if row[2] or row[3] or row[4] == '0': and here row[9] == 'Label', you can use any to check several variables equal to the same value, and use = to assign a value, also i would recommend to use with open.
Additionally you can't read and write at the same time in csv file, so you need to save your changes to a new csv file, you can remove the original one after and rename the new one using os.remove and os.rename:
import csv
import sys
import os
with open('some_new_file.csv', 'w') as w, open(sys.argv[1], 'r') as r:
reader, writer = csv.reader(r), csv.writer(w)
for row in reader:
if any(x == '0' for x in (row[2], row[3], row[4])):
row[9] = 'Label'
writer.writerow(row)
os.remove('{}'.format(sys.argv[1]))
os.rename('some_new_file.csv', '{}'.format(sys.argv[1]))
You can write to a tempfile.NamedTemporaryFile and just use in to test for the "0" as you are matching a full string not a substring so you won't save anything by using any as you create a tuple of three elements so you may as well slice or just test for membership regardless, then you just replace the original file with shutil.move:
import csv
import sys
from shutil import move
from tempfile import NamedTemporaryFile
with NamedTemporaryFile("w", dir=".", delete=False) as w, open(sys.argv[1]) as r:
reader, writer = csv.reader(r), csv.writer(w)
writer.writerows(row[:-1] + ['Label'] if "0" in row[2:5] else row
for row in reader)
move(w.name, sys.argv[1])
sys.argv[1] is also you file name and a string so that is all you need to pass.
I think the Problem is in lines
w = open(sys.argv[1], 'w')
r = open(sys.argv[1], 'r')
You are opening the same file for reading and writing.So try using different file name.

Python Hash not being updated in csv file output

I have working code that takes a directory of csv files and hashes one column of each line, then aggregates all files together. The issue is the output only displays the first hash value, not re-running the hash for each line. Here is the code:
import glob
import hashlib
files = glob.glob( '*.csv' )
output="combined.csv"
with open(output, 'w' ) as result:
for thefile in files:
f = open(thefile)
m = f.readlines()
for line in m[1:]:
fields = line.split()
hash_object = hashlib.md5(b'(fields[2])')
newline = fields[0],fields[1],hash_object.hexdigest(),fields[3]
joined_line = ','.join(newline)
result.write(joined_line+ '\n')
f.close()
You are creating a hash of a fixed bytestring b'(fields[2])'. That value has no relationship to your CSV data, even though it uses the same characters as are used in your row variable name.
You need to pass in bytes from your actual row:
hash_object = hashlib.md5(fields[2].encode('utf8'))
I am assuming your fields[2] column is a string, so you'd need to encoding it first to get bytes. The UTF-8 encoding can handle all codepoints that could possibly be contained in a string.
You also appear to be re-inventing the CSV reading and writing wheel; you probably should use the csv module instead:
import csv
# ...
with open(output, 'w', newline='') as result:
writer = csv.writer(result)
for thefile in files:
with open(thefile, newline='') as f:
reader = csv.reader(f)
next(reader, None) # skip first row
for fields in reader:
hash_object = hashlib.md5(fields[2].encode('utf8'))
newrow = fields[:2] + [hash_object.hexdigest()] + fields[3:]
writer.writerow(newrow)

Categories

Resources