Why is for loop iterated only once? - python

I'm rather new to coding, yet as I have to write a few letters I wanted to write a script to change the name within this letter automatically.
I've got a textfile with placeholders for the name and a csv-file where the names are stored in the following format:
Surname;Firstname
Doe;John
Norris;Chuck
...
Now I've conjured up this script:
import csv
import re
letterPATH = "Brief.txt"
tablePATH = "Liste.csv"
with open(letterPATH, "r") as letter, open(tablePATH, "r") as table:
table = csv.reader(table, delimiter=";")
rows = list(table)
rows = rows[1::]
print(rows)
for (surname, firstname) in rows:
#Check if first- and surname have correct output
#print(firstname)
#print(surname)
for lines in letter:
new_content = ""
print(lines)
lines = re.sub(r"\<Nachname\>", surname, lines)
print(lines)
lines = re.sub(r"\<Vorname\>", firstname, lines)
print(lines)
new_content += lines
with open(surname + firstname +".txt", "w") as new_letter:
new_letter.writelines(new_content)
I've got the following problem now:
There's a file created a textfile for each entry as it should (JohnDoe.txt, ChuckNorris.txt and so on) however only the first file has the correct content, while the others are empty.
While debugging I've seen that the for-loop in line 18 is only iterated once and the with statement is iterated multiple times as it should.
I simply do not understand why the for-loop isn't iterating.
Cheers and thanks for your help! :)

letter is a file. A file keeps track of how much you've read and where the next read should be. So if you've read two lines, then the next read will be on the third line, and so on.
Since you read through the whole file the first time, the next iterations it'll not read any more lines from the file, since you've already read them.
The solution could be to reset the file pointer (the thing pointing to where in the file you've currently read to) to the beginning with the letter.seek(0) method. Or, you could simply store the file content in a list directly and iterate over the list.
import csv
import re
letterPATH = "Brief.txt"
tablePATH = "Liste.csv"
with open(letterPATH, "r") as letter_file, open(tablePATH, "r") as table:
table = csv.reader(table, delimiter=";")
letter = list(letter_file) # Add all content to a list instead.
rows = list(table)
rows = rows[1::]
print(rows)
for (surname, firstname) in rows:
#Check if first- and surname have correct output
#print(firstname)
#print(surname)
for lines in letter:
new_content = ""
print(lines)
lines = re.sub(r"\<Nachname\>", surname, lines)
print(lines)
lines = re.sub(r"\<Vorname\>", firstname, lines)
print(lines)
new_content += lines
with open(surname + firstname +".txt", "w") as new_letter:
new_letter.writelines(new_content)

Related

Code returns same line multiple times instead of multiple lines

What I'm trying to do is to open two CSV files and print only the lines in which the content of a column in file 1 and file 2 match. I already know that I should end up with 14 results, but instead the first line of the CSV file I'm working with gets printed 14 times. Where did I go wrong?
file1 = open("../dir/file1.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
file2 = open("../dir/file2.csv", "r")
for line in file2:
file2splitted = line.strip().split(",")
for line in file1:
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()
You should be using the csv module for reading these files because splitting on commas is not reliable; it's fine for a single CSV column to contain values that themselves include commas.
I've added a couple of things to try make this cleaner and to help you move forward in your learning:
I've used the with context manager that automatically closes a file once you're done reading it. No need for .close()
I've packaged the csv reading code into a function. Now we only need to write that part once and we can call the function with any file.
I've used the csv module to read the file. This will return a nested list of rows, each inner list representing a single row.
I've used a list comprehension which is a neater way of writing a for loop that creates a list. In this case, it's a list of all the items in the first column of file_1.
I've converted the list in Point 4 into a set. When we iterate through file_2, we can very quickly check whether a row value has been seen in file_1 (set lookup is O(1) rather than having to iterate through file_1 every single time).
The indices I print are from my own test files, you will need to adapt them to your own use-case.
import csv
def read_csv(file_name):
with open(file_name) as infile: # Context manager to auto-close files at end
reader = csv.reader(infile)
#next(reader) remove the hash if you want to drop the headers
return list(reader)
file_1 = read_csv('file_1.csv')
file_2 = read_csv('file_2.csv')
# Make a set of file_1 column 0 with a list comprehension
file_1_vals = set([item[0] for item in file_1])
# Now iterate through file_2
for row in file_2:
if row[2] in file_1_vals:
print(row[1])
file1 = open("../dir/file1.csv", "r")
file2 = open("../dir/file2.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
for line in file2:
file2splitted = line.strip().split(",")
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()
if you provide your csv files then I can help you more.

Extract specific fasta sequences from a big fasta file

I want to extract specific fasta sequences from a big fasta file using the following script, but the output is empty.
The transcripts.txt file contains the list transcripts IDs that I want to export (both the IDs and the sequences) from assembly.fasta to selected_transcripts.fasta.
For example:
transcripts.txt:
Transcript_00004|5601
Transcript_00005|5352
assembly.fasta:
>Transcript_00004|5601
GATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
>Transcript_00004|5360
CGATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
The IDs are preceded by the > symbol: >Transcripts_00004|5601.
I have to read the assembly.fasta file, if the transcript ID in assembly.fasta is the same of that write in transcripts.txt, I have to write this transcript ID and its sequence in selected_transcripts.fasta. So, in the example above, I have to write only the first transcript.
Any suggestions?
Thanks.
from Bio import SeqIO
my_list = [line.split(',') for line in open("/home/universita/transcripts.txt")]
fin = open('/home/universita/assembly.fasta', 'r')
fout = open('/home/universita/selected_transcripts.fasta', 'w')
for record in SeqIO.parse(fin,'fasta'):
for item in my_list:
if item == record.id:
fout.write(">" + record.id + "\n")
fout.write(record.seq + "\n")
fin.close()
fout.close()
Based on your examples there a couple of small problems which may explain why you don't get anything. Your transcripts.txt has multiple entries in one line, therefore my_list will have all the items of the first_line in my_line[0], in your loop you iterate through my_list by lines, so your first item will be
['Transcript_00004|5601', 'Transcript_00005|5352']
Also if assembly.fasta has no > in the header line you won't get back any records with IDs and sequences. The following code should take care of those problems, assuming you added > to the headers and the split function is now using space and not colon.
from Bio import SeqIO
my_list = []
with open("transcripts.txt") as transcripts:
for line in transcripts:
my_list.extend(line.split(' '))
fin = open('assembly.fasta', 'r')
fout = open('selected_transcripts.fasta', 'w')
for record in SeqIO.parse(fin,'fasta'):
for item in my_list:
if item.strip() == record.id:
fout.write(">" + record.id + "\n")
fout.write(record.seq + "\n")
fin.close()
fout.close()
Reading of transcript was changed so that all IDs are appended to my_list separately. Also each item is stripped of white space in order to avoid having line breaks in the string when it is compared to the record.id.

Python removing substrings from strings

I'm trying to remove some substrings from a string in a csv file.
import csv
import string
input_file = open('in.csv', 'r')
output_file = open('out.csv', 'w')
data = csv.reader(input_file)
writer = csv.writer(output_file,quoting=csv.QUOTE_ALL)# dialect='excel')
specials = ("i'm", "hello", "bye")
for line in data:
line = str(line)
new_line = str.replace(line,specials,'')
writer.writerow(new_line.split(','))
input_file.close()
output_file.close()
So for this example:
hello. I'm obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing. bye.
I'd want the output to be:
obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing.
This however only works when im searching for a single word. So that specials = "I'm" for example. Do I need to add my words to a list or an array?
It looks like you aren't iterating through specials, since it's a tuple rather than a list, so it's only grabbing one of the values. Try this:
specials = ["i'm, "hello", "bye"]
for line in data:
new_line = str(line)
for word in specials:
new_line = str.replace(new_line, word, '')
writer.writerow(new_line.split(','))
It seems like you're already splitting the input via the csv.reader, but then you're throwing away all that goodness by turning the split line back into a string. It's best not to do this, but to keep working with the lists that are yielded from the csv reader. So, it becomes something like this:
for row in data:
new_row = [] # A place to hold the processed row data.
# look at each field in the row.
for field in row:
# remove all the special words.
new_field = field
for s in specials:
new_field = new_field.replace(s, '')
# add the sanitized field to the new "processed" row.
new_row.append(new_field)
# after all fields are processed, write it with the csv writer.
writer.writerow(new_row)

Python compare bombs if files not sorted

I have written some code to compare two files via a search string.
The file = master data file
The checkfile = list of states & regions
When I have more than 1 state in the file that is not in sorted order it bombs out.
How can i get this to work without having to sort my "file"
The Error message: Traceback (most recent call last):
File "./gangnamstyle.py", line 27, in
csvLineList_2 = csv2[lineCount].split(",")
IndexError: list index out of range
My code:
#!/usr/bin/python
import csv
file = raw_input("Please enter the file name to search: ") #File name
checkfile = raw_input("Please enter the file with the search data: ") #datafile
save_file = raw_input("Please enter the file name to save: ") #Save Name
search_string = raw_input("Please type string to search for: ") #search string
#row = raw_input("Please enter column text is in: ") #column number - starts at 0
#ID_INDEX = row
#ID_INDEX = int(ID_INDEX)
f = open(file)
f1 = open(save_file, 'a')
csv1 = open(file, "r").readlines()
csv2 = open(checkfile, "r").readlines()
#what looks for the string in the file
copyline=False
for line in f.readlines():
if search_string in line:
copyline=True
if copyline:
f1.write(line)
for lineCount in range( len( csv1) ):
csvLineList_1 = csv1[lineCount].split(",")
csvLineList_2 = csv2[lineCount].split(",")
if search_string == csvLineList_2[0]:
f1.write(csvLineList_2[2])
f1.close() #close saved file
f.close() #close source file
#csv1.close()
#csv2.close()
OK, so that error message is an IndexError: list index out of range in the line csvLineList_2 = csv2[lineCount].split(","). There's only one indexing happening there, so apparently lineCount is too big for csv2.
lineCount is one of the values of range(len(csv1)). That makes it automatically in range for csv1. Apparently csv1 and csv2 are not the same length, causing the IndexError.
Now that's quite possible, because they contain lines from different files. Apparently the files don't have equal number of lines.
To be honest I have no clue why you are reading the lines into csv1 at all. You loop over those lines and split them (into the variable csvLineList_1), but you never use that variable.
I think your loop should just be:
for line in csv2:
parts = line.strip().split(",") # line.strip() removes whitespace and the newline
# at the end of the line
if search_string == parts[0]:
f1.write(parts[2] + "\n") # Add a newline, you probably want it
I hope this helps.
The error you're getting is probably due to the file lengths not being equal.
It's not exactly clear from what you've written, what you're hoping to do. It looks to me like (maybe) you want to find a search term in "master file", and if you find it, write the line you find to the "save file". It also looks to me like you want to find that same search term in the very first field of the "check file", and if you find it, write the contents of the third field into the "save file". If that's wrong, it's because your code has bugs.
Either way, there's a bunch of issues in the code you've posted, and you're probably going to get at least some mileage out of using the csv module to do what you're trying to do.
Maybe post a fuller problem description.
Edit:
import csv
import sys
def build_state_lookup(fn):
with open(fn) as infile:
reader = csv.reader(infile)
# throw away first line
reader.next()
# now build a dictionary mapping state to region
lookup = {state: region for (state, _, region) in reader}
return lookup
def process_big_file(in_fn, checkfile, out_fn):
lookup = build_state_lookup()
with open(in_fn) as infile:
with open(out_fn, 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# output the header row
writer.writerow(reader.next() + ['Region'])
for row in reader:
state = row[0]
region = lookup.get(state, "No Region Found")
row.append(region)
writer.writerow(row)
def main():
process_big_file(*sys.argv[1:])
if __name__ == '__main__':
main()

Find an specific word on a file, get the content of the row, and save it on a array

I've a .xls file that I convert to .csv, and then read this .csv until one specific line that contains the word clientegen, get that row and put it on a array.
This is my code so far:
import xlrd
import csv
def main():
print "Converts xls to csv and reads csv"
wb = xlrd.open_workbook('ejemplo.xls')
sh = wb.sheet_by_name('Hoja1')
archivo_csv = open('fichero_csv.csv', 'wb')
wr = csv.writer(archivo_csv, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
archivo_csv.close()
f = open('fichero_csv.csv', 'r')
for lines in f:
print lines
if __name__ == '__main__':
main()
This prints me:
[... a lot of more stuff ...]
"marco 4","","","","","","","","","","","","","","",""
"","","","","","","","","","","","","","","",""
"","","","","","","","","","","","","","","",""
"clientegen","maier","embega","Jegan ","tapa pure","cil HUF","carcHUF","tecla NSS","M1 NSS","M2 nss","M3 nss","doble nss","tapon","sagola","clip volvo","pillar"
"pz/bast","33.0","40.0","34.0","26.0","80.0","88.0","18.0","16.0","8.0","6.0","34.0","252.0","6.0","28.0","20.0"
"bast/Barra","5.0","3.0","6.0","8.0","10.0","4.0","10.0","10.0","10.0","10.0","8.0","4.0","6.0","10.0","6.0"
[... a lot of more stuff ...]
The thing I want to do is take that clientegen line, and save the content of the row on a new string array with the name finalarray for example.
finalarray = ["maier", "embega", "Jegan", "tapa pure", "cil HUF", "carcHUF", "tecla NSS", "M1 NSS", "M2 nss", "M3 nss", "doble nss", "tapon", "sagola", "clip volvo", "pillar"]
I'm not a lot into python file's read/read so I would like to know how if someone could give me a hand to find that line, get those values and put them on a array. Thanks in advance.
If you swap this for loop out for your for loop, it should do the trick:
for rownum in xrange(sh.nrows):
row = sh.row_values(rownum)
if row[0] == "clientegen": # Check if "clientgen" is the first element of the row
finalarray = list(row) # If so, make a copy of it and name it `finalarray`
wr.writerow(row)
If there will ever be more than one "clientegen" line, we can adjust this code to save all of them.
If you are just looking for the line that contains clientegen, then you could try:
finalarray = list()
with open("fichero_csv.csv") as f:
for line in f: #loop through all the lines
words = line.split(" ") #get a list of all the words
if "clientegen" in words: #check to see if your word is in the list
finalarray = words #if so, put the word list in your finalarray
break #stop checking any further

Categories

Resources