I have to create a program for my class that reads a file, coverts the lists of numbers within to floats, then adds them all together and prints only the answer onto the screen.
The farthest I've gotten is:
fname = sys.argv[1]
handle = open(fname, "r")
total = 0
for line in handle:
linearr = line.split()
for item in linearr:
item = float(item)
One of the files look like:
0.13 10.2 15.8193
0.09 99.6
100.1
100.2 17.8 56.33 12
19e-2 7.5
Trying to add the converted list to the total (total += item) has not worked. I'm really lost and would greatly appreciate any assistance.
You are almost there. total += item is the correct approach, add that line to your for loop after the conversion to float.
Make sure to print your result at the end with print(total), you probably forgot that too.
For your test file this is giving me the result 419.9593
You can use a generator expression with sum,splitting the lines into lists and casting each subelement to float:
In [9]: cat test.txt
0.13 10.2 15.8193
0.09 99.6
100.1
100.2 17.8 56.33 12
19e-2 7.5
In [10]: with open("test.txt") as f:
sm = sum(float(s) for row in map(str.split, f) for s in row)
....:
In [11]: sm
Out[11]: 419.9593
You can also combine with itertools.chain to flatten the rows:
In [1]: from itertools import chain
In [2]: with open("test.txt") as f:
sm = sum(map(float, chain(*(map(str.split,f)))))
...:
In [3]: sm
Out[3]: 419.9593
On a sidenote, you should always use with to open your files, it will automatically close your files for you.
Related
I have multiple text files that contain multiple lines of floats and each line has two floats separated by white space, like this: 1.123 456.789123. My task is to sum floats after white space from each text file. This has to be done for all lines. For example, if I have 3 text files:
1.213 1.1
23.33 1
0.123 2.2
23139 0
30.3123 3.3
44.4444 444
Now the sum of numbers on the first lines should be 1.1 + 2.2 + 3.3 = 6.6. And the sum of numbers on second lines should be 1 + 0 + 444 = 445. I tried something like this:
def foo(folder_path):
contents = os.listdir(folder_path)
for file in contents:
path = os.path.join(folder_path, file)
with open(path, "r") as data:
rows = data.readlines()
for row in rows:
value = row.split()
second_float = float(value[1])
return sum(second_float)
When I run my code I get this error: TypeError: 'float' object is not iterable. I've been pulling my hair out with this, and don't know what to do can anyone help?
Here is how I would do it:
def open_file(file_name):
with open(file_name) as f:
for line in f:
yield line.strip().split() # Remove the newlines and split on spaces
files = ('text1.txt', 'text2.txt', 'text3.txt')
result = list(zip(*(open_file(f) for f in files)))
print(*result, sep='\n')
# result is now equal to:
# [
# (['1.213', '1.1'], ['0.123', '2.2'], ['30.3123', '3.3']),
# (['23.33', '1'], ['23139', '0'], ['44.4444', '444'])
# ]
for lst in result:
print(sum(float(x[1]) for x in lst)) # 6.6 and 445.0
It may be more logical to type cast the values to float inside open_file such as:
yield [float(x) for x in line.strip().split()]
but I that is up to you on how you want to change it.
See it in action.
-- Edit --
Note that the above solution loads all the files into memory before doing the math (I do this so I can print the result), but because of how the open_file generator works you don't need to do that, here is a more memory friendly version:
# More memory friendly solution:
# Note that the `result` iterator will be consumed by the `for` loop.
files = ('text1.txt', 'text2.txt', 'text3.txt')
result = zip(*(open_file(f) for f in files))
for lst in result:
print(sum(float(x[1]) for x in lst))
The background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into an output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene, and the last number in the survival column from each section.
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).
If anyone could correct the code so I can see what I did wrong I'd appreciate it.
Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file
can then do:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.
This worked when I tried it:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
You could try something like this (I copied your data into foo.dat);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with makes sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines relies on the fact that the records are separated by lines containing only whitespace.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
here is my solution:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
Instead of checking for new line, simply print when you are done reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival
I have quite a big text file to parse.
The main pattern is as follows:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
in other words:
A separator: step #
Headers of known length (line numbers, not bytes)
Data 3-dimensional shape: nz, ny, nx
Data: fortran formating, ~10 floats/line in the original dataset
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.
I've already done a bit of programming... The main idea is
to get the offsets of each separator first ("step X")
skip nX (n1, n2...) lines + 1 to reach the data
read bytes from there all the way to the next separator.
I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
The problem is that I'm basically using file.tell() method to get the separator positions:
[file.tell() - len(sep) for line in file if sep in line]
The problem is two-fold:
for smaller files, file.tell() gives the right separator positions, for longer files, it does not. I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator.
len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. sep is a string (bytes) containing the first line of the file (the first separator).
Does anyone knows how I should parse that?
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...
1- Finding the offsets
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
As I said, this is working in the simple example, but not on the big file.
New test:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell() do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.
I got the answer: for-loops on a file and tell() do not get along very well. Just like mixing for i in file and file.readline() raises an error.
So, use file.tell() with file.readline() or file.read() only.
Never ever use:
for line in file:
[do stuff]
offset = file.tell()
This is really a shame but that's the way it is.
Okay, the problem I have here is that I've got this .txt file Python imports into a dictionary.
What is does is it takes the values from the text files, that are in this Specific format:
a 0.01
b 0.11
c 1.11
d 0.02
^ (In code format because It wouldn't wouldn't stack it like in the .txt, not actually code)
and then puts them into a dictionary like this:
d = { 'a':'0.01', 'b':'0.11, ect....}
Well, I'm using this so that it will change whatever value the user inputs (Later in the script) into whatever is defined inside the dictionary.
The problem is if I try and make it incorporate a space, it just doesn't work.
Like, I finished the letters, and their corresponding values in the .txt and began going onto symbols:
For example:
& &
* *
(so that when entered into the dictionary, the corresponding values are printed when I have it print the translated message) (I could change them up, but I decided to leave them as they are)
The problem arises when I try and have it make a space in the user input correspond to a space or another value in the translated message.
I tried leaving a row blank in my .txt, so that (space) is to (space)
But later, when it tried to load the .txt, it gave me an error, saying that: "need more than one value to unpack"
Can someone help me out?
EDIT: Adding code as requested.
TRvalues = {}
with open(r"C:\Users\Owatch\Documents\Python\Unisung Net Send\nsed.txt") as f:
for line in f:
(key, val) = line.split()
TRvalues[key] = val
if TRvalues == False:
print("\n\tError encountered while attempting to load the .txt file")
print("\n\t The file does not contain any values")
else:
print("Dictionary Loaded-")
Sample text file:
a 0.01
b 0.11
c 1.11
d 0.02
e 0.22
f 2.22
g 0.03
h 0.33
i 3.33
j 0.04
k 0.44
l 4.44
m 0.05
n 0.55
o 5.55
p 0.06
q 0.66
r 6.66
s 0.07
t 0.77
u 7.77
v 0.08
w 0.88
x 8.88
y 0.09
z 0.99
I get this error when I attempt to run the script:
Traceback (most recent call last):
File "C:/Users/Owatch/Documents/Python/Unisung Net Send/Input Encryption 0.02.py", line 17, in <module>
(key, val) = line.split()
ValueError: need more than 0 values to unpack
EDIT: Thanks for downvoting everybody! I'm now no longer able to ask questions.
Believe it or not I DO know the rules for this website, and DID research this before asking on Stack Overflow. It IS helpful for other people as well. What a nice community. Hopefully the people who answered did not downvote it. I appreciate what they did.
If I understand correctly, you're trying to use a space both as an unquoted value and as a delimiter, which won't work. I'd use the csv module and its quoting rules. For example (assuming you're using Python 3 from your print functions):
import csv
with open('nsed.txt', newline='') as f:
reader = csv.reader((line.strip() for line in f), delimiter=' ')
TRvalues = dict(reader)
print(TRvalues)
with an input file of
a 0.01
b 0.11
c 1.11
d 0.02
" " " "
gives
{' ': ' ', 'a': '0.01', 'b': '0.11', 'c': '1.11', 'd': '0.02'}
It seems that you're in effect trying to do.
In [177]: "a b".split()
Out[177]: ['a', 'b']
In [178]: " ".split()
Out[178]: []
i.e. have a blank line and expect the spaces to be preserved when doing a split() but that won't work.
Therefore k, v = line.split() won't work.
EDIT Presuming I understand the problem.
Perhaps you need to encode the values when you put them in the file and then decode on the way out.
A naive approach might be to use urrlib.quote and urllib.unquote
On the write.
In [188]: urllib.quote(' ')
Out[188]: '%20'
Which makes it a bit trickier for the special case of space unless you do for all values on the write to the file.
fd.write("%s %s\n" % (urllib.encode(val1), urllib.encode(val2)))
Then on the read from the file.
k,v = map(urrlib.unquote, lines.split())
I need to process files with data segments separated by a blank space, for example:
93.18 15.21 36.69 33.85 16.41 16.81 29.17
21.69 23.71 26.38 63.70 66.69 0.89 39.91
86.55 56.34 57.80 98.38 0.24 17.19 75.46
[...]
1.30 73.02 56.79 39.28 96.39 18.77 55.03
99.95 28.88 90.90 26.70 62.37 86.58 65.05
25.16 32.61 17.47 4.23 34.82 26.63 57.24
36.72 83.30 97.29 73.31 31.79 80.03 25.71
[...]
2.74 75.92 40.19 54.57 87.41 75.59 22.79
.
.
.
for this I am using the following function.
In every call I get the necessary data, but I need to speed-up the code.
Is there a more efficient way?
EDIT: I will be updating the code with the changes that achieve improvements
ORIGINAL:
def get_pos_nextvalues(pos_file, indices):
result = []
for line in pos_file:
line = line.strip()
if not line:
break
values = [float(value) for value in line.split()]
result.append([float(values[i]) for i in indices])
return np.array(result)
NEW:
def get_pos_nextvalues(pos_file, indices):
result = ''
for line in pos_file:
if len(line) > 1:
s = line.split()
result += ' '.join([s [i] for i in indices])
else:
break
else:
return np.array([])
result = np.fromstring(result, dtype=float, sep=' ')
result = result.reshape(result.size/len(indices), len(indices))
return result
.
pos_file = open(filename, 'r', buffering=1024*10)
[...]
while(some_condition):
vs = get_pos_nextvalues(pos_file, (4,5,6))
[...]
speedup = 2.36
not to convert floats to floats would be the first step. I would suggest, however, to first profile your code and then try to optimize the bottleneck parts.
I understand that you've changed your code from the original, but
values = [value for value in line.split()]
is not a good thing either. just write values = line.split() if this is what you mean.
Seeing how you're using NumPy, I'd suggest some methods of file reading that are demonstrated in their docs.
You are only reading every character exactly once, so there isn't any real performance to gain.
You could combine strip and split if the empty lines contain a lot of whitespace.
You could also save some time initializing the numpy array from start, instead of first creating a python array and then converting.
try increasing the read buffer, IO is probably the bottle neck of your code
open('file.txt', 'r', 1024 * 10)
also if the data is fully sequential you can try to skip the line by line code and convert a bunch of lines at once
Instead of :
if len(line) <= 1: # only '\n' in «empty» lines
break
values = line.split()
try this:
values = line.split()
if not values: # line is wholly whitespace, end of segment
break
numpy.fromfile doesn't work for you?
arr = fromfile('tmp.txt', sep=' ', dtype=int)
Here's a variant that might be faster for few indices. It builds a string of only the desired values so that np.fromstring does less work.
def get_pos_nextvalues_fewindices(pos_file, indices):
result = ''
for line in pos_file:
if len(line) > 1:
s = line.split()
for i in indices:
result += s[i] + ' '
else:
return np.array([])
result = np.fromstring(result, dtype=float, sep=' ')
result = result.reshape(result.size/len(indeces), len(indeces))
return result
This trades off the overhead of split() and an added loop for less parsing. Or perhaps there's some clever regex trick you can do to extract the desired substrings directly?
Old Answer
np.mat('1.23 2.34 3.45 6\n1.32 2.43 7 3.54') converts the string to a numpy matrix of floating point values. This might be a faster kernel for you to use. For instance:
import numpy as np
def ReadFileChunk(pos_file):
chunktxt = ""
for line in pos_file:
if len(line) > 1:
chunktxt = chunktxt + line
else:
break
return np.mat(chunktxt).tolist()
# or alternatively
#return np.array(np.mat(s))
Then you can move your indexing stuff to another function. Hopefully having numpy parse the string internally is faster than calling float() repetitively.