I'm trying to find the maximum rainfall from a data file. The precipitation column in the [4] element in a 1340 line long file.
Here's an example of a line of data from the file:
Date,Day,High T,Low T,Precip,Snow,Snow Depth
1/1/10,1,41,19,0,0,5
Here's the loop I'm trying to find the max_precip:
for line in fo:
max_precip = max(line.split(",")[4])
Any help or guidance here would be greatly appreciated. Thanks guys!
You'll need to apply this to all lines, and you'll need to convert the precipitation value to integer first:
max_precip = max(fo, key=lambda line: int(line.split(',')[4]))
This returns the whole line containing the maximum precipitation. I'm assuming you already removed the header line.
Note that you may want to look at the csv module to handle the comma-splitting for you.
To get just the precipitation maximum and ignore everything else, use a generator expression:
max_precip = max(int(line.split(',')[4]) for line in fo)
Demo:
>>> fo = '''\
... 1/1/10,1,41,19,0,0,5
... 1/2/10,1,38,18,2,0,6
... 1/3/10,1,43,17,1,0,6
... '''.splitlines()
>>> max(fo, key=lambda line: int(line.split(',')[4]))
'1/2/10,1,38,18,2,0,6'
>>> max(int(line.split(',')[4]) for line in fo)
2
I'm trying to find the maximum rainfall from a data file.
If that is what you want, you may want to pre-process your data, before passing to the built-in max function.
max_precip = max(int(line.split(',')[4]) for line in fo)
Related
I have a file of paths called test.txt
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
Notice that the number of lines is even and always even, my final goal is to parse this file and create a new one looping through these paths on a two by two basis. I am trying enumerate function but this will not parse two by two. Furthermore, I'm going out of range because indexing the way I'm doing is wrong. It would also be great if someone could tell me how to index properly with enumerate.
with open('./src/test.txt') as f:
for index,line in enumerate(f):
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
#print(sample_string)
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))
The result is something like this:
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},
Clearly the indexation is wrong since it's going through every element of my path that is g r etc instead of printing the next path. For the first iteration the next path printed should be: "fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz".
I believe the problem itself can be tackled with itertools more elegantly I just don't know how to do it. Would also be great if someone could tell me if an indexation with enumerate could also work.
One problem is that you are trying to access the data from the second line of the pair before you have read it. Additionally you can not access the second line with line[index + 1] because that refers to a character in the current line, not the next line which hasn't yet been read.
So you need to keep track of pairs of lines. You can use the index provided by enumerate() to determine whether the current line is the first (because it is an even number) or the second (because it's odd). Store the name and path for fastq_1 when you read the first line. Only write the output on the second line. Like this:
import re
with open('test.txt') as f:
for index, line in enumerate(f):
if index % 2 == 0: # even, so this is the first line of a pair
name = re.search(r'pfg[\dGT]+',line).group(0)
fastq_1 = line.rstrip()
else: # odd, so second line. Emit result
fastq_2 = line.rstrip()
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))
line.rstrip() is required to remove the trailing new line character at the end of each line.
#mhawke already provided a good solution, but to give another approach, "looping through these ... on a two by two basis" can be done with the more_itertools.chunked function from the more_itertools library or with the grouper() recipe from the Python manual.
This also gives options for what should happen when the last line is an odd one; whether that should raise an error or pair it with a default value.
You may want to consider that when you're assigning index to variable, you're getting the index character of that string not the indexation of it.
What you can do is to assign th e file to a list then get the index location so, you can switch between line as you want.
Still don't understand point, do you want to switch between lines in both fastq_1 and fastq_2 or you each path be according to its key?
Code Syntax
with open(path) as f:
lis = list(f)
for index, line in enumerate(lis):
try:
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
except IndexError:
break
Output
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},
[Program finished]
I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file
x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)
Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)
You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".
You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html
I have a file named ping.txt which has the values that shows the time taken to ping an ip for n number of times.
I have my ping.txt contains:
time=35.9
time=32.4
I have written a python code to extract this floating number alone and add it using regular expression. But I feel that the below code is the indirect way of completing my task. The findall regex I am using here outputs a list which is them converted, join and then added.
import re
add,tmp=0,0
with open("ping.txt","r+") as pingfile:
for i in pingfile.readlines():
tmp=re.findall(r'\d+\.\d+',i)
add=add+float("".join(tmp))
print("The sum of the times is :",add)
My question is how to solve this problem without using regex or any other way to reduce the number of lines in my code to make it more efficient?
In other words, can I use different regex or some other method to do this operation?
~
You can use the following:
with open('ping.txt', 'r') as f:
s = sum(float(line.split('=')[1]) for line in f)
Output:
>>> with open('ping.txt', 'r') as f:
... s = sum(float(line.split('=')[1]) for line in f)
...
>>> s
68.3
Note: I assume each line of your file contains time=some_float_number
You could do it like this:
import re
total = sum(float(s) for s in re.findall(r'\d+(\.\d+)?', open("ping.txt","r+").read()))
If you have the string:
>>> s='time=35.9'
Then to get the value, you just need:
>>> float(s.split('=')[1]))
35.9
You don't need regular expressions for something with a simple delimiter.
You can use the string split to split each line at '=' and append them to a list. At the end, you can simply call the sum function to print the sum of elements in the list
temp = []
with open("test.txt","r+") as pingfile:
for i in pingfile.readlines():
temp.append(float(str.split(i,'=')[1]))
print("The sum of the times is :",sum(temp))
Use This in RE
tmp = re.findall("[0-9]+.[0-9]+",i)
After that run a loop
sum = 0
for each in tmp:
sum = sum + float(each)
How can I get specific lines from a file in Python? I know how to read files and get it in a list etc, but this is a bit harder for me. Let me explain what I need:
I have a file that looks like this:
lcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG
GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAA
TCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl|AF033819.3_cds_AAC82598.2_2 [gene=pol] [protein=Pol] [partial=5'] [protein_id=AAC82598.2] [location=<1631..4642]
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA
ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT
TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
I need to remove every line that contains:
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
All the letters I need to store in a list, file, etc. I know how that works. Can anyone help me with the code in Python? How do I only delete lines that contain:
lcl
The answer is use regular expressions. It will be something like this:
>>> import re
>>> a = 'beginlcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]end'
>>> re.sub('lcl.*?location.*?\]', '', a)
'beginend'
Why not use startswith()?
with open('lcl.txt', 'r') as f:
for line in f.readlines():
if line.startswith("lcl|"):
print ("lcl line dropping it")
continue
else:
print (line)
Result:
lcl line dropping it
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl line dropping it
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl line dropping it
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
Note: I am assuming that there are newlines in the right places here!
I'm new to the world of python and I'm trying to extract values from multiple text files. I can open up the files fine with a loop, but I'm looking for a straight forward way to search for a string and then return the value after it.
My results text files look like this
SUMMARY OF RESULTS
Max tip rotation =,-18.1921,degrees
Min tip rotation =,-0.3258,degrees
Mean tip rotation =,-7.4164,degrees
Max tip displacement =,6.9956,mm
Min tip displacement =,0.7467,mm
Mean tip displacement = ,2.4321,mm
Max Tsai-Wu FC =,0.6850
Max Tsai-Hill FC =,0.6877
So I want to be able to search for say 'Max Tsai-Wu =,' and it return 0.6850
I want to be able to search for the string as the position of each variable might change at a later date.
Sorry for posting such an easy question, just can't seem to find a straight forward robust way of finding it.
Any help would be greatly appreciated!
Matt
You can make use of regex:
import re
regexp = re.compile(r'Max Tsai-Wu.*?([0-9.-]+)')
with open('input.txt') as f:
for line in f:
match = regexp.match(line)
if match:
print match.group(1)
prints:
0.6850
UPD: getting results into the list
import re
regexp = re.compile(r'Max Tsai-Wu.*?([0-9.-]+)')
result = []
with open('input.txt') as f:
for line in f:
match = regexp.match(line)
if match:
result.append(match.group(1))
My favorite way is to test if the line starts with the desired text:
keyword = 'Max Tsai-Wu'
if line.startswith(keyword):
And then split the line using the commas and return the value
try:
return float(line.split(',')[1])
except ValueError:
# treat the error
You can use regular expression to find both name and value:
import re
RE_VALUE = re.compile('(.*?)\s*=,(.*?),')
def test():
line = 'Max tip rotation =,-18.1921,degrees'
rx = RE_VALUE.search(line)
if rx:
print('[%s] value: [%s]' % (rx.group(1), rx.group(2)))
test()
This way reading file line by line you can fill some dictionary.
My regex uses fact that value is between commas.
If the files aren't that big, you could simply do:
import re
files = [list, of, files]
for f in files:
with open(f) as myfile:
print re.search(r'Max Tsai-Wu.*?=,(.+)', myfile.read()).group(1)