Difficulty reading data from a text file and converting to a float - python

UPDATE:
My problem was due to the input file having an odd encoding. Changing my opening statement to "open(os.path.join(root, 'Report.TXT'), 'r', encoding='utf-16')" fixed my problem
ORIGINAL TEXT
I'm trying to make a program that will allow me to more easily organize data from some lab equipment. This program recursively moves through folders, locates a file named Report.TXT, grabs a few numbers from it, and correctly organizes them in an excel file. There's a lot of irrelevant information from this file, so I need to grab only a specific part of it (e.g. line 56, characters 72-95).
Here's an example of a part of one of these Report.TXT files containing information I want to grab (under the ng/uL column):
RetTime Type Area Amt/Area Amount Grp Name
[min] [nRIU*s] [ng/ul]
-------|------|----------|----------|----------|--|------------------
4.232 BB 6164.18262 1.13680e-5 7.00746e-1 Compound1
5.046 BV 2.73487e5 1.34197e-5 36.70109 Compound2
5.391 VB 3.10324e5 1.34678e-5 41.79371 Compound3
6.145 - - - Compound4
7.258 - - - Compound5
8.159 - - - Compound6
11.092 BB 3447.12158 2.94609e-5 1.01555 Compound7
Totals : 80.21110
This is only a portion of the Report.TXT, the actual "Compound1" is on line 54 of the real file.
I've managed to form something that will grab these and insert it into an excel file as a string:
for rootdir in range(1,tdirs+1):
flask = 0
for root, subFolders, files in os.walk(str(rootdir)):
if 'Report.TXT' in files:
flask += 1
with open(os.path.join(root, 'Report.TXT'), 'r') as fin:
print(root)
for x in range(0,67):
line = fin.readline()
if x == 54:
if "-" in line[75:94]:
compound1 = 0
else:
compound1 = str(line[75:94].strip())
print(compound1)
datasheet.write(int(rootdir)+2,int(flask),compound1)
if x == 56:
if "-" in line[75:94]:
compound2 = 0
else:
compound2 = str(line[75:94].strip())
print(compound2)
datasheet.write(int(tdirs)+int(rootdir)+6,int(flask),compound2)
However, if I replace the str(line[75:94].strip()) with a float(line[75:94].strip()), then I get a cannot convert string to float error. The printing was just for my own troubleshooting but isn't seeming to give me any extra information.
Any ideas on what I can do to fix this?

converting to float is not such a good idea in this case.
since you are copying it to a delimited file, it doesnt matter if you convert to float or not. more over(the floating point issue in python suggest not to convert to float using the standard library float() method.
you will be better of writing the sting values since you would want to have you lab results accurate.
use numpy to convert the complex numbers to decimal if it is necessary.

Related

Clean text files - remove unwanted content in LOOP (R/python)

I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)
This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)

Compare all the CSV files in a folder and print duplicate rows

I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))
What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

Re-writing a python program into VB, how to sort CSV?

About a year back, I wrote a little program in python that basically automates a part of my job (with quite a bit of assistance from you guys!) However, I ran into a problem. As I kept making the program better and better, I realized that Python did not want to play nice with excel, and (without boring you with the details suffice to say xlutils will not copy formulas) I NEED to have more access to excel for my intentions.
So I am starting back at square one with VB (2010 Express if it helps.) The only programming course I ever took in my life was on it, and it was pretty straight forward so I decided I'd go back to it for this. Unfortunately, I've forgotten much of what I had learned, and we never really got this far down the rabbit hole in the first place. So, long story short I am trying to:
1) Read data from a .csv structured as so:
41,332.568825,22.221759,-0.489714,eow
42,347.142926,-2.488763,-0.19358,eow
46,414.9969,19.932693,1.306851,r
47,450.626074,21.878299,1.841957,r
48,468.909171,21.362568,1.741944,r
49,506.227269,15.441723,1.40972,r
50,566.199838,17.656284,1.719818,r
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
2) Sort that data alphabetically by column 5
3) Then selecting only the ones with an "l" in column 5, sort THOSE numerically by column 2 (ascending order) AND copy them to a new file called coil.csv
4) Then selecting only the ones that have an "r" in column 5, sort those numerically by column 2 (descending order) and copy them to the SAME file coil.csv (appended after the others obviously)
After all of that hoopla I wish to get out:
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
50,566.199838,17.656284,1.719818,r
49,506.227269,15.441723,1.40972,r
48,468.909171,21.362568,1.741944,r
47,450.626074,21.878299,1.841957,r
46,414.9969,19.932693,1.306851,r
I realize that this may be a pretty involved question, and I certainly understand if no one wants to deal with all this bs, lol. Anyway, some full on code, snippets, ideas or even relevant links would be GREATLY appreciated. I've been, and still am googling, but it's harder than expected to find good reliable information pertaining to this.
P.S. Here is the piece of python code that did what I am talking about (although it created two seperate files for the lefts and rights which I don't really need) - if it helps you at all.
msgbox(msg="Please locate your survey file in the next window.")
mainfile = fileopenbox(title="Open survey file")
toponame = boolbox(msg="What is the name of the shots I should use for topography? Note: TOPO is used automatically",choices=("Left","Right"))
fieldnames = ["A","B","C","D","E"]
surveyfile = open(mainfile, "r")
left_file = open("left.csv",'wb')
right_file = open("right.csv",'wb')
coil_file = open("coil1.csv","wb")
reader = csv.DictReader(surveyfile, fieldnames=fieldnames, delimiter=",")
left_writer = csv.DictWriter(left_file, fieldnames + ["F"], delimiter=",")
sortedlefts = sorted(reader,key=lambda x:float(x["B"]))
surveyfile.seek(0,0)
right_writer = csv.DictWriter(right_file, fieldnames + ["F"], delimiter=",")
sortedrights = sorted(reader,key=lambda x:float(x["B"]), reverse=True)
coil_writer = csv.DictWriter(coil_file, fieldnames, delimiter=",",extrasaction='ignore')
for row in sortedlefts:
if row["E"] == "l" or row["E"] == "cl+l":
row['F'] = '%s,%s' % (row['B'], row['D'])
left_writer.writerow(row)
coil_writer.writerow(row)
for row in sortedrights:
if row["E"] == "r":
row['F'] = '%s,%s' % (row['B'], row['D'])
right_writer.writerow(row)
coil_writer.writerow(row)
One option you have is to start with a class to hold the fields. This allows you to override the ToString method to facilitate the output. Then, it's a fairly simple matter of reading each line and assigning the values to a list of the class. In your case you'll want the extra step of making 2 lists sorting one descending and combining them:
Class Fields
Property A As Double = 0
Property B As Double = 0
Property C As Double = 0
Property D As Double = 0
Property E As String = ""
Public Overrides Function ToString() As String
Return Join({A.ToString, B.ToString, C.ToString, D.ToString, E}, ",")
End Function
End Class
Function SortedFields(filename As String) As List(Of Fields)
SortedFields = New List(Of Fields)
Dim test As New List(Of Fields)
Dim sr As New IO.StreamReader(filename)
Using sr As New IO.StreamReader(filename)
Do Until sr.EndOfStream
Dim fieldarray() As String = sr.ReadLine.Split(","c)
If fieldarray.Length = 5 AndAlso Not fieldarray(4)(0) = "e"c Then
If fieldarray(4) = "r" Then
test.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
Else
SortedFields.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
End If
End If
Loop
End Using
SortedFields = SortedFields.OrderBy(Function(x) x.B).Concat(test.OrderByDescending(Function(x) x.B)).ToList
End Function
One simple way of writing the data to a csv file is to use the IO.File.WriteAllLines methods and the ConvertAll method of the List:
IO.File.WriteAllLines(" coil.csv", SortedFields("textfile1.txt").ConvertAll(New Converter(Of Fields, String)(Function(x As Fields) x.ToString)))
You'll notice how the ToString method facilitates this quite easily.
If the class will only be used for this you do have the option to make all the fields string.

What's wrong with my Python for loop?

I have two files open, EQE_data and Refl_data. I want to take each line of EQE_data, which will have eight tab-delimited columns, and find the line in Refl_data which corresponds to it, then do the data analysis and write the results to output. So for each line in EQE_data, I need to search the entire Refl_data until I find the right one. This code is successful the first time, but it is outputting the same results for the Refl_data every subsequent time. I.e., I get the correct columns for Wav1 and QE, but it seems to only be executing the nested for loop once, so I get the same R, Abs, IQE, which is correct for the first row, but incorrect thereafter.
for line in EQE_data:
try:
EQE = line.split("\t")
Wav1, v2, v3, QE, v5, v6, v7, v8 = EQE
for line in Refl_data:
Refl = line.split("\t")
Wav2, R = Refl
if float(Wav2) == float(Wav1):
Abs = 1 - (float(R) / 100)
IQE = float(QE) / Abs
output.write("%d\t%f\t%f\t%f\t%f\n" % (int(float(Wav1)), float(QE), float(R) / 100, Abs, IQE))
except:
pass
If Refl_data is a file, you need to put the read pointer back to the beginning in each loop (using Refl_data.seek(0)), or just re-open the file.
Alternatively, read all of Refl_data into a list first and loop over that list instead.
Further advice: use the csv module for tab-separated data, and don't ever use blank try:-except:; always only catch specific exceptions.

Categories

Resources