I have a CSV file that looks like this
a,b,c
d1,g4,4m
t,35,6y
mm,5,m
I'm trying to replace all the m's and y's preceded by a number with 'month' and 'year' respectively. I'm using the following script.
import re,csv
out = open ("out.csv", "wb")
file = "in.csv"
with open(file, 'r') as f:
reader = csv.reader(f)
for ss in reader:
s = str(ss)
month_pair = (re.compile('(\d\s*)m'), 'months')
year_pair = (re.compile('(\d\s*)y'), 'years')
def substitute(s, pairs):
for (pattern, substitution) in pairs:
match = pattern.search(s)
if match:
s = pattern.sub(match.group(1)+substitution, s)
return s
pairs = [month_pair, year_pair]
print (substitute(s, pairs))
It does replace but it does that only on the last row, ignoring the ones before it. How can I have it iterate over all the rows and write to another csv file?
You can use positive look-behind :
>>> re.sub(r'(?<=\d)m','months',s)
'a,b,c\nd1,g4,4months\nt,35,6y\nmm,5,m'
>>> re.sub(r'(?<=\d)y','years',s)
'a,b,c\nd1,g4,4m\nt,35,6years\nmm,5,m'
In this line
print (substitute(s, pairs))
your variable s is only the last line in your file. Note how you update s in your file reading to be the current line.
Solutions (choose one):
You could try another for-loop to iterate over all lines.
Or move the substitution into the for-loop where you read the lines of the file. This is definitely the better solution!
You can easily lookup how to write a new file or change the file you are working on.
Related
I am trying to search through a list of files, look for the words 'type' and the following word. then put them into a list with the file name. So for example this is what I am looking for.
File Name, Type
[1.txt, [a, b, c]]
[2.txt, [a,b]]
My current code returns a list for every type.
[1.txt, [a]]
[1.txt, [b]]
[1.txt, [c]]
[2.txt, [a]]
[2.txt, [b]]
Here is my code, i know my logic will return a single value into the list but I'm not sure how to edit it to it will just be the file name with a list of types.
output = []
for file_name in find_files(d):
with open(file_name, 'r') as f:
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
output.append([file_name, match])
Learn to categorize your actions at the proper loop level.
In this case, you say that you want to accumulate all of the references into a single list, but then your code creates one output line per reference, rather than one per file. Change that focus:
with open(file_name, 'r') as f:
ref_list = []
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
ref_list.append(match)
# Once you've been through the entire file,
# THEN you add a line for that file,
# with the entire reference list
output.append([file_name, ref_list])
You might find it useful to use a dict here instead
output = {}
for file_name in find_files(d):
with open(file_name, 'r') as f:
output[file_name] = []
for line in f:
line = line.lower().strip()
match = re.findall('type ([a-z]+)', line)
if match:
output[file_name].append(*match)
I use this code below to combine all csv files : below each file has 10,000 rows :
billing_report_2014-02-01.csv
billing_report_2014-02-02.csv
:
fout=open("out.csv","a")
for num in range(1,10):
print num
for line in open("billing_report_2014-02-0"+str(num)+".csv"):
fout.write(line)
for num in range(10,29):
print num
for line in open("billing_report_2014-02-"+str(num)+".csv"):
fout.write(line)
fout.close()
but now I want to add new date column to the out.csv file how can I add date column and have value of "2014-02-01" to every row that I append billing_report_2014-02-01 to out.csv, and
value of "2014-02-02" to every row that I append billing_report_2014-02-02 to out.csv how can I approach this ?
List the filenames you want to work on, then take the data from that, build a generator over the input file that removes trailing new lines, and adds a new field with the date... eg:
filenames = [
'billing_report_2014-02-01.csv',
'billing_report_2014-02-02.csv'
]
with open('out.csv', 'w') as fout:
for filename in filenames:
to_append = filename.rpartition('_')[2].partition('.')[0]
with open(filename) as fin:
fout.writelines('{},{}\n'.format(line.rstrip(),to_append) for line in fin)
I think you can just add the date at the end:
for line in open("billing_report_2014-02-0"+str(num)+".csv"):
fout.write(line+',DATE INFORMATION')
I am presuming your CSV is really comma separated, if it is tab separted the characters should be \t
you could also use an intermediate step by changing line:
line = line + ', DATE INFORMATION'
as you are trying to add the file name date just add it per variable:
line = line + ', 2014-02-0'+ str(num//10)
you could use the replace function if it is always the ",LLC" string expression, see the example below
>>> string = "100, 90101, California, Example company,LLC, other data"
>>> string.replace(',LLC',';LLC')
'100, 90101, California, Example company;LLC, other data'
>>>
putting it all together and trying to bring some of the inspiration from #Jon CLements in as well (KUDOS!):
def combine_and_add_date(year, month, startday, endday, replace_dict):
fout=open("out.csv","a")
for num in range(startday,endday+1):
daynum = str(num)
if len(daynum) ==1:
daynum = '0'+daynum
date_info = str(year+'-'month+'-'+daynum)
source_name = 'billing_report_'+date_info+'.csv'
for line in open(source_name):
for key in replace_dict:
line.replace(key,replact_dict[key])
fout.write(line+','+date_info)
fout.close()
I hope this works and you should (hopefully I am a newb...) use it like this, note the dictionary is designed to allow you to make all kinds of replacements
combine_and_add_date("2014","02",1,28, {',LLC': ';LLC', ',PLC':';PLC'})
fingers crossed
I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']
def regexread():
import re
result = ''
savefileagain = open('sliceeverfile3.txt','w')
#text=open('emeverslicefile4.txt','r')
text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,'
pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
#with open('emeverslicefile4.txt') as text:
f = re.findall(pattern,text)
for item in f:
print(item)
savefileagain.write(item)
#savefileagain.close()
The above function as written parses the text and returns sets of seven numbers. I have three problems.
Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts.
Secondly, when I try to write results to the 'write' file, nothing is returned and
thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.
This should do the trick:
import re
filename = 'sliceeverfile3.txt'
pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
new_file = []
# Make sure file gets closed after being iterated
with open(filename, 'r') as f:
# Read the file contents and generate a list with each line
lines = f.readlines()
# Iterate each line
for line in lines:
# Regex applied to each line
match = re.search(pattern, line)
if match:
# Make sure to add \n to display correctly when we write it back
new_line = match.group() + '\n'
print new_line
new_file.append(new_line)
with open(filename, 'w') as f:
# go to start of file
f.seek(0)
# actually write the lines
f.writelines(new_file)
You're sort of on the right track...
You'll iterate over the file:
How to iterate over the file in python
and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.
Clarification:
So if my file has 10 lines:
THe first line is a heading, so I want to append some text at the end of first line
THen I have a list which contains 9 elements..
I want to read that list and append the end of each line with corresponding element..
So basically list[0] to second line, list[1] to third line and so on..
I have a file which is delimted by comma.
something like this:
A,B,C
0.123,222,942
......
Now I want to do something like this:
A,B,C,D #append "D" just once
0.123,222,942,99293
............
This "D" is actually saved in a list so yeah I have this "D"
How do I do this? I mean I know the naive way.
like go thru each line and do something like
string += str(list[i])
Basically how do i append something at the end of the file in pythonic way :)
Just create a new file:
data = ['header', 1, 2, 3, 4]
with open("infile", 'r') as inf, open("infile.2", 'w') as outf:
outf.writelines('%s,%s\n' % (s.strip(), n) for s, n in zip(inf, data))
If you want to "update" the input file, just rename the new one afterwards
import os
os.unlink("infile")
os.rename("infile.2", "infile")
Short answer: Use the csv module.
Long answer:
import csv
newvalues = [...]
with open("path/to/input.csv") as file:
data = list(csv.reader(file))
with open("path/to/input.csv", "w") as file:
writer = csv.writer(file)
for row, newvalue in zip(data, newvalues):
row.append(newvalue)
writer.writerow(row)
Naturally, this depends on the lines in the file and newvalues being the same length. If this isn't the case, you could use something like zip_longest to fill in the excess lines with a given value.
If you are doing this to the different files, we can do it even more easily:
import csv
newvalues = [...]
with open("path/to/input.csv") as from, open("path/to/output.csv", "w") as to:
reader = csv.reader(from)
writer = csv.writer(to)
for row, newvalue in zip(reader, newvalues):
row.append(newvalue)
writer.writerow(row)
This also has the advantage of not reading the entire file into memory, so for very large files, this is a better solution.