Python script not looping correctly - python

I am using this python code to look through a csv, which has dates in one column and values in the other. I am recording the minimum value from each year. My code is not looping through correctly. What's my stupid mistake? Cheers
import csv
refMin = 40
with open('data.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_ALL)
for i in range(1968,2014):
for row in reader:
if str(row[0])[:4] == str(i):
if float(row[1]) <= refMin:
refMin = float(row[1])
print 'The minimum value for ' + str(i) + ' is: ' + str(refMin)

The reader can only be iterated once. The first time around the for i in range(1968,2014) loop, you consume every item in the reader. So the second time around that loop, there are no items left.
If you want to compare every value of i against every row in the file, you could swap your loops around, so that the loop for row in reader is on the outside and only runs once, with multiple runs of the i loop instead. Or you could create a new reader each time round, although that might be slower.
If you want to process the entire file in one pass, you'll need to create a dictionary of values to replace refMin. When processing each row, either iterate through the dictionary keys, or look it up based on the current row. On the other hand, if you're happy to read the file multiple times, just move the line reader = csv.reader(...) inside the outer loop.
Here's an untested idea for doing it in one pass:
import csv
import collections
refMin = collections.defaultdict(lambda:40)
with open('data.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_ALL)
allowed_years = set(str(i) for i in range(1968,2014))
for row in reader:
year = int(str(row[0])[:4])
if float(row[1]) <= refMin[year]:
refMin[year] = float(row[1])
for year in range(1968, 2014):
print 'The minimum value for ' + str(year) + ' is: ' + str(refMin[year])
defaultdict is just like a regular dictionary except that it has a default value for keys that haven't previously been set.

I would refactor that to read the file only once:
import csv
refByYear = DefaultDict(list)
with open('data.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_ALL)
for row in reader:
refByYear[str(row[0])[:4]].append(float(row[1]))
for year in range(1968, 2014):
print 'The minimum value for ' + str(year) + ' is: ' + str(min(refByYear[str(year)]))
Here I store all values for each year, which may be useful for other purposes, or totally useless.

Related

How to dynamically join two CSV files?

I have two csv files and I was thinking about combining them via python - to practice my skill, and it turns out much more difficult than I ever imagined...
A simple conclusion of my problem: I feel like my code should be correct but the edited csv file turns out not to be what I thought.
One file, which I named as chrM_location.csv is the file that I want to edit.
The first file looks like this
The other file, named chrM_genes.csv is the file that I take reference at.
The second file looks like this:
There are a few other columns but I'm not using them at the moment. The first few roles are subject "CDS", then there is a blank row, followed by a few other roles with subject "exon", then another blank row, followed by some rows "genes" (and a few others).
What I tried to do is, I want to read first file row by row, focus on the number in the second column (42 for row 1 without header), see if it belongs to the range of 4-5 columns in file two (also read row by row), then if it is, I record the information of that corresponding row, and paste it back to the first file, at the end of the row, if not, I skip it.
below is my code, where I set out to run everything through the CDS section first, so I wrote a function refcds(). It returns me with:
whether or not the value is in range;
if in range, it forms a list of the information I want to paste to the second file.
Everything works fine for the main part of the code, I have the list final[] containing all information of that row, supposedly I only need to past it on that row and overwrite everything before. I used print(final) to check the info and it seems like just what I want it to be.
but this is what the result looks like:
I have no idea why a new row is inserted and why some rows are pasted here together, when column 2 is supposedly small -> large according to value.
similar things happened in other places as well.
Thank you so much for your help! I'm running out of solution... No error messages are given and I couldn't really figure out what went wrong.
import csv
from csv import reader
from csv import writer
mylist=[]
a=0
final=[]
def refcds(value):
mylist=[]
with open("chrM_genes.csv", "r") as infile:
r = csv.reader(infile)
for rows in r:
for i in range(0,12):
if value >= rows[3] and value <= rows[4]:
mylist = ["CDS",rows[3],rows[4],int(int(value)-int(rows[3])+1)]
return 0, mylist
else:
return 1,[]
with open('chrM_location.csv','r+') as myfile:
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
for row in csv_reader:
if (row[1]) != 'POS':
final=[]
a,mylist = refcds(row[1])
if a==0:
lista=[row[0],row[1],row[2],row[3],row[4],row[5]]
final.extend(lista)
final.extend(mylist),
csv_writer.writerow(final)
if a==1:
pass
if (row[1]) == 'END':
break
myfile.close()```
If I understand correctly - your code is trying to read and write to the same file at the same time.
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
I haven't tried your code: but I'm pretty sure this is going to cause weird stuff to happen... (If you refactor and output to a third file - do you still see the same issue?)
I think the problem is that you have your reader and writer set to the same file—I have no idea what that does. A much cleaner solution is to accumulate your modified rows in the read loop, then once you're out of the read loop (and have closed the file), open the same file for writing (not appending) and write your accumulated rows.
I've made the one big change that fixes the problem.
You also said you were trying to improve your Python, so I made some other changes that are more pythonic.
import csv
# Return a matched list, or return None
def refcds(value):
with open('chrM_genes.csv', 'r', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
if value >= row[3] and value <= row[4]:
computed = int(value)-int(row[3])+1 # probably negative??
mylist = ['CDS', row[3], row[4], computed]
return mylist
return None # if we get to this return, we've evaluated every row and didn't already return (because of a match)
# Accumulate rows here
final_rows = []
with open('chrM_location.csv', 'r', newline='') as myfile:
reader = csv.reader(myfile)
# next(reader) ## if you know your file has a header
for row in reader:
# Show unusual conditions first...
if row[1] == 'POS':
continue # skip header??
if row[1] == 'END':
break
# ...and if not met, do desired work
mylist = refcds(row[1])
if mylist is not None:
# no need to declare an empty list and then extend it
# just create it with initial items...
final = row[0:6] # use slice notation to get a subset of a list (6 non-inclusive, so only to 5th col)
final.extend(mylist)
final_rows.append(final)
# Write accumulated rows here
with open('final.csv', 'w', newline='') as finalfile:
writer = csv.writer(finalfile)
writer.writerows(final_rows)
I also tried to figure out the whole thing, and came up with the following...
I think you want to look up rows of chrM_genes by Subject and compare a POS (from chrM_locaction) against Start and End bound for each gene, if POS is within the range of Start and End, return the chrM_gene data and fill in some empty cells already in chrM_location.
My first step would be to create a data structure from chrM_genes, since we'll be reading from that over and over again. Reading a bit into your problem, I can see the need to "filter" the results by subject ('CDS','exon', etc...), but I'm not sure of this. Still, I'm going to index this data structure by subject:
import csv
from collections import defaultdict
# This will create a dictionary, where subject will be the key
# and the value will be a list (of chrM (gene) rows)
chrM_rows_by_subject = defaultdict(list)
# Fill the data structure
with open('chrM_genes.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # read (skip) header
subject_col = 2
for row in reader:
# you mentioned empty rows, that divide subjects, so skip empty rows
if row == []:
continue
subject = row[subject_col]
chrM_rows_by_subject[subject].append(row)
I mocked up chrM_genes.csv (and added a header, to try and clarify the structure):
Col1,Col2,Subject,Start,End
chrM,ENSEMBL,CDS,3307,4262
chrM,ENSEMBL,CDS,4470,5511
chrM,ENSEMBL,CDS,5904,7445
chrM,ENSEMBL,CDS,7586,8266
chrM,ENSEMBL,exon,100,200
chrM,ENSEMBL,exon,300,400
chrM,ENSEMBL,exon,700,750
Just printing the data structure to get an idea of what it's doing:
import pprint
pprint.pprint(chrM_rows_by_subject)
yields:
defaultdict(<class 'list'>,
{'CDS': [['chrM', 'ENSEMBL', 'CDS', '3307', '4262'],
['chrM', 'ENSEMBL', 'CDS', '4470', '5511'],
...
],
'exon': [['chrM', 'ENSEMBL', 'exon', '100', '200'],
['chrM', 'ENSEMBL', 'exon', '300', '400'],
...
],
})
Next, I want a function to match a row by subject and POS:
# Return a row that matches `subject` with `pos` between Start and End; or return None.
def match_gene_row(subject, pos):
rows = chrM_rows_by_subject[subject]
pos = int(pos)
start_col = 3
end_col = 4
for row in rows:
start = row[start_col])
end = row[end_col])
if pos >= start and pos <= end:
# return just the data we want...
return row
# or return nothing at all
return None
If I run these commands to test:
print(match_gene_row('CDS', '42'))
print(match_gene_row('CDS', '4200'))
print(match_gene_row('CDS', '7586'))
print(match_gene_row('exon', '500'))
print(match_gene_row('exon', '399'))
I get :
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '7586', '8266']
None # exon: 500
['chrM', 'ENSEMBL', 'exon', '300', '400']
Read chrM_location.csv, and build a list of rows with matching gene data.
final_rows = [] # accumulate all rows here, for writing later
with open('chrM_location.csv', newline='') as f:
reader = csv.reader(f)
# Modify header
header = next(reader)
header.extend(['CDS','Start','End','cc'])
final_rows.append(header)
# Read rows and match to genes
pos_column = 1
for row in reader:
pos = row[pos_column]
matched_row = match_gene_row('CDS', pos) # hard-coded to CDS
if matched_row is not None:
subj, start, end = matched_row[2:5]
computed = str(int(pos)-int(start)+1) # this is coming out negative??
row.extend([subj, start, end, computed])
final_rows.append(row)
Finally, write.
with open('final.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(final_rows)
I mocked up chrM_location.csv:
name,POS,id,Ref,ALT,Frequency
chrM,41,.,C,T,0.002498
chrM,42,rs377245343,T,TC,0.001562
chrM,55,.,TA,T,0.00406
chrM,55,.,T,C,0.001874
When I run the whole thing, I get a final.csv that looks likes this:
name
POS
id
Ref
ALT
Frequency
CDS
Start
End
sequence_cc
chrM
41
.
C
T
0.002498
CDS
3307
4262
-3265
chrM
42
rs377245343
T
TC
0.001562
CDS
3307
4262
-3264
chrM
55
.
TA
T
0.00406
CDS
4470
5511
-4414
chrM
55
.
T
C
0.001874
CDS
4470
5511
-4414
I put this all together in a Gist.

Get readings from duplicate names in CSV file Python

I am fairly new at Python and am having some issues reading in my csv file. There are sensor names, datestamps and readings in each column. However, there are multiple of the same sensor name, which I have already made a list of the different options called OPTIONS, shown below
OPTIONS = []
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in OPTIONS:
OPTIONS.append(row[0])
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
print(OPTIONS)
Options
prints fine,
But now I am having issues retrieving any readings, and using them to calculate average and maximum readings for each unique sensor name.
here are a few lines of sensor_data.csv, which goes from 2018-01-01 to 2018-12-31 for sensor_1 to sensor_25.
Any help would be appreciated.
What you have for the readings variable is just the reading of each row. One way to get the average readings is to keep track of the sum and count of readings (sum_readings and count_readings respectively) and then after the for loop you can get the average by dividing the sum with the count. You can get the maximum by initializing a max_readings variable with a reading minimum value (I assume to be 0) and then update the variable whenever the current reading is larger than max_readings (max_readings < readings)
import csv
OPTIONS = []
OPTIONS_READINGS = {}
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in OPTIONS:
OPTIONS.append(row[0])
OPTIONS_READINGS[row[0]] = []
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
print(OPTIONS)
OPTIONS_READINGS[row[0]].append(readings)
for option in OPTIONS_READINGS:
print(option)
readings = OPTIONS_READINGS[option]
print('Max readings:', max(readings))
print('Average readings:', sum(readings) / len(readings))
Edit: Sorry I misread the question. If you want to get the maximum and average of each unique options, there is a more straight forward way which is to use an additional dictionary-type variable OPTIONS_READINGS whose keys are the option names and the values are the list of readings. You can find the maximum and average reading of an options by simply using the expression max(OPTIONS_READINGS[option]) and sum(OPTIONS_READINGS[option]) / len(OPTIONS_READINGS[option]) respectively.
A shorter version below
import csv
from collections import defaultdict
readings = defaultdict(list)
with open('sensor_data.csv', 'r') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
readings[row[0]].append(float(row[2]) )
for sensor_name,values in readings.items():
print('Sensor: {}, Max readings: {}, Avg: {}'.format(sensor_name,max(values), sum(values)/ len(values)))

Difficulties Iterating over CSV file in Python

I'm trying to add up all the values in a given row in a CSV file with Python but have had a number of difficulties doing so.
Here is the closest I've come:
from csv import reader
with open("simpleData.csv") as file:
csv_reader = reader(file)
for row in csv_reader:
total = 0
total = total + int(row[1])
print(total)
Instead of yielding the sum of all the values in row[1], the final print statement is yielding only the last number in the row. What am I doing incorrect?
I've also stumbled with bypassing the header (the next() that I've seen widely used in other examples on SO seem to be from Python 2, and this method no longer plays nice in P3), so I just manually, temporarily changed the header for that column to 0.
Any help would be much appreciated.
it seems you are resetting the total variable to zero on every iteration.
To fix it, move the variable initialization to outside the for loop, so that it only happens once:
total = 0
for row in csv_reader:
total = total + int(row[1])
from csv import reader
with open("simpleData.csv") as file:
csv_reader = reader(file)
total = 0
for row in csv_reader:
total = total + int(row[1])
print(total)
total should be moved to outside the for loop.
indents are important in Python. E.g. the import line should be pushed to left-most.
You are resetting your total, try this:
from csv import reader
with open("simpleData.csv") as file:
csv_reader = reader(file)
total = 0
for row in csv_reader:
total = total + int(row[1])
print(total)
As others have already stated, you are setting the value of total on every iteration. You can move total = 0 outside of the loop or, alternatively, use sum:
from csv import reader
with open("simpleData.csv") as file:
csv_reader = reader(file)
total = sum(int(x[0]) for x in csv_reader)
print(total)

Make a new list from CSV

So, I've search for a method to show a certain csv field based on input, and I've try to apply the code for my program. But the problem is I want to get a certain item in csv and make a new list from certain index.
I have csv file like this:
code,place,name1,name2,name3,name4
001,Home,Laura,Susan,Ernest,Toby
002,Office,Jack,Rachel,Victor,Wanda
003,Shop,Paulo,Roman,Brad,Natali
004,Other,Charles,Matthew,Justin,Bono
at first I have this code, and it works show all the row:
import csv
number = input('Enter number to find\n')
csv_file = csv.reader(open('residence.csv', 'r'), delimiter=",")
for row in csv_file:
if number == row[0]:
print (row)
**input : 001
**result : [001, Home, Laura, Susan, Ernest, Toby]
then, I try to make a certain row in the result to add the items to a new list. But it didn't work. Here's the code:
import csv
res = []
y = 2
number = input('Enter number to find\n')
csv_file = csv.reader(open('residence.csv', 'r'), delimiter=",")
for row in csv_file:
if number == row[0]:
while y <= 5:
res.append(str(row[y]))
y = y+1
print (res)
**input : 001
**expected result : [Laura, Susan, Ernest, Toby]
I want to make a new list that contains row name1, name2, name3, and name4, and then I want to print the list. But I guess the loop is wrongly placed or I missed something.
There are a couple of things you could fix in your code.
You are not skipping the header line when iterating through the rows. This means you will not always match an actual row number.
Your y variable is not re-initialized. It would be more idiomatic to use a for loop instead of a while anyhow.
If more than one row match, it will break (see 2.). If you know you will never have more than one match, you should break after you append the values to the list.
Your file is never closed. Also it should be opened with newline='' (see csv module docs)
Lastly, you match the actual string ('001'), vs. an integer (1), which could be the source of confusion when entering the input.
An updated version:
import csv
res = []
number = input('Enter number to find\n')
with open('residence.csv', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
next(csv_reader) # Skip header line
for row in csv_reader:
if number == row[0]:
for i in range(2, 6):
res.append(str(row[i]))
break
print(res)

How do create new column in csv file using python by shifting one row

I have CSV file like below. It is huge file with thousands of records.
input.csv
No;Val;Rec;CSR
0;10;1;1200
0;100;2;1300
0;100;3;1300
0;100;4;1400
0;10;5;1200
0;11;6;1200
I want to create output.csv file by adding new column "PSR" after 1st column "No". This column value depends on column "PSR" Value. For 1st row, "PSR" shall be zero. From next record on-wards, it depends on "CSR" value in previous row. If present and previous record CSR value is same, then "PSR" shall be zero. If not, PSR value shall have the previous CSR value. For exmple, Value of CSR in 2nd row is 1300 which is different to the value in 1st record ( it is 1200). So PSR value for 2nd row shall be 1200. Where in 2nd and 3rd row, CSR value is same. So PSR value for 3rd row shall be zero. So new value PSR depends on CSR value in present and previous field.
Output.csv
No;PCR;Val;Rec;CSR
0;0;10;1;1200
0;1200;100;2;1300
0;0;100;3;1300
0;1300;100;4;1400
0;1400;10;5;1200
0;0;11;6;1200
My Approach:
Use csv.reader and iterate over the objects in a list. Copy 5th column to 2nd column in list. Shift it one row down.
Then check the values in 2nd and 5th column (PCR and CSR), if both values are same. Replace the PCR value with zero.
I have problem in getting 1st step coded. I am able to duplicate the column but not able to shift it. Also 2nd step is quite straightforward.
Also, I am not sure whether this approach is correct Any pointers/recommendation would be really helpful.
Note: I am not able to install Pandas on CentOS. So help without this module would be better.
My Code:
with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
mylist = []
header = next(reader)
mylist.append(header)
for rec in reader:
mylist.append(rec)
rec.insert(1, rec[3])
mylist.append(rec)
writer.writerows(mylist)
If your open to non-python solutions then awk could be a good option:
awk 'NR==1{$2="PSR;"$2}NR>1{$2=($4==a?0";"$2:+a";"$2);a=$4}1' FS=';' OFS=';' file
No;PSR;Val;Rec;CSR
0;0;10;1;1200
0;1200;100;2;1300
0;0;100;3;1300
0;1300;100;4;1400
0;1400;10;5;1200
0;0;11;6;1200
Awk is distributed with pretty much all Linux distributions and was designed exactly for this kind of task. It will blaze through your file. Add a redirection to the end > output.csv to save the output in a file.
A simple python approach using the same logic:
#!/usr/bin/env python
last = "0"
with open('input.csv') as csv:
print next(csv).strip().replace(';', ';PSR;', 1)
for line in csv:
field = line.strip().split(';')
if field[3] == last: field.insert(1, "0")
else: field.insert(1, last)
last = field[4]
print ';'.join(field)
Produces the same output:
$ python parse.py
No;PSR;Val;Rec;CSR
0;0;10;1;1200
0;1200;100;2;1300
0;0;100;3;1300
0;1300;100;4;1400
0;1400;10;5;1200
0;0;11;6;1200
Again just redirect the output to save it:
$ python parse.py > output.csv
Just code it as you explained it. Store the previous CSR and refer to it on the next loop through; just be sure to update it.
import csv
with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
mylist = []
header = next(reader)
mylist.append(header)
mylist.insert(1,'PCR')
prev_csr = 0
for rec in reader:
rec.insert(1,prev_csr)
mylist.append(rec)
prev_csr = rec[4]
writer.writerows(mylist)
with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
header = next(reader)
header.insert(1, 'PCR')
writer.writerow(header)
prevRow = next(reader)
prevRow.insert(1, '0')
writer.writerow(prevRow)
for row in reader:
if prevRow[-1] == row[-1]:
val = '0'
else:
val = prevRow[-1]
row.insert(1,val)
prevRow = row
writer.writerow(row)
Or, even easier using the DictReader and DictWriter capabilities of csv:
input_header = ['No','Val','Rec','CSR']
output_header = ['No','PCR','Val','Rec','CSR']
with open('input.csv', 'rb') as in_file, open('output.csv', 'wb') as out_file:
in_reader, out_writer = DictReader(in_file, input_header, delemeter =';'), DictWriter(out_file, output_header, delemeter =';')
in_reader.next() # skip the header
out_writer.writeheader() # place the output header
last_csr = None
for row in in_reader():
current_csr = row['CSR']
row['PCR'] = last_csr if current_csr != last_csr else 0
last_csr = current_csr
out_writer.writerow(row)

Categories

Resources