Python - Get specific lines from file - python

How can I get specific lines from a file in Python? I know how to read files and get it in a list etc, but this is a bit harder for me. Let me explain what I need:
I have a file that looks like this:
lcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG
GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAA
TCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl|AF033819.3_cds_AAC82598.2_2 [gene=pol] [protein=Pol] [partial=5'] [protein_id=AAC82598.2] [location=<1631..4642]
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA
ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT
TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
I need to remove every line that contains:
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
All the letters I need to store in a list, file, etc. I know how that works. Can anyone help me with the code in Python? How do I only delete lines that contain:
lcl

The answer is use regular expressions. It will be something like this:
>>> import re
>>> a = 'beginlcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]end'
>>> re.sub('lcl.*?location.*?\]', '', a)
'beginend'

Why not use startswith()?
with open('lcl.txt', 'r') as f:
for line in f.readlines():
if line.startswith("lcl|"):
print ("lcl line dropping it")
continue
else:
print (line)
Result:
lcl line dropping it
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl line dropping it
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl line dropping it
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
Note: I am assuming that there are newlines in the right places here!

Related

python enumerate out of range when looping through a file

I have a file of paths called test.txt
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
Notice that the number of lines is even and always even, my final goal is to parse this file and create a new one looping through these paths on a two by two basis. I am trying enumerate function but this will not parse two by two. Furthermore, I'm going out of range because indexing the way I'm doing is wrong. It would also be great if someone could tell me how to index properly with enumerate.
with open('./src/test.txt') as f:
for index,line in enumerate(f):
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
#print(sample_string)
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))
The result is something like this:
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},
Clearly the indexation is wrong since it's going through every element of my path that is g r etc instead of printing the next path. For the first iteration the next path printed should be: "fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz".
I believe the problem itself can be tackled with itertools more elegantly I just don't know how to do it. Would also be great if someone could tell me if an indexation with enumerate could also work.
One problem is that you are trying to access the data from the second line of the pair before you have read it. Additionally you can not access the second line with line[index + 1] because that refers to a character in the current line, not the next line which hasn't yet been read.
So you need to keep track of pairs of lines. You can use the index provided by enumerate() to determine whether the current line is the first (because it is an even number) or the second (because it's odd). Store the name and path for fastq_1 when you read the first line. Only write the output on the second line. Like this:
import re
with open('test.txt') as f:
for index, line in enumerate(f):
if index % 2 == 0: # even, so this is the first line of a pair
name = re.search(r'pfg[\dGT]+',line).group(0)
fastq_1 = line.rstrip()
else: # odd, so second line. Emit result
fastq_2 = line.rstrip()
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))
line.rstrip() is required to remove the trailing new line character at the end of each line.
#mhawke already provided a good solution, but to give another approach, "looping through these ... on a two by two basis" can be done with the more_itertools.chunked function from the more_itertools library or with the grouper() recipe from the Python manual.
This also gives options for what should happen when the last line is an odd one; whether that should raise an error or pair it with a default value.
You may want to consider that when you're assigning index to variable, you're getting the index character of that string not the indexation of it.
What you can do is to assign th e file to a list then get the index location so, you can switch between line as you want.
Still don't understand point, do you want to switch between lines in both fastq_1 and fastq_2 or you each path be according to its key?
Code Syntax
with open(path) as f:
lis = list(f)
for index, line in enumerate(lis):
try:
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
except IndexError:
break
Output
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},
[Program finished]

trying to convert txt data to csv columns

I am having a small issue. The below code works, but when i put a /test1 and /test2 into the file and change '/test1 (.*?) /test2', it doesnt see it in the file and except runs.
import re
with open ('test.txt') as f:
fin = f.read()
try:
print(re.search('test1 (.*?) test2', fin) .group(1))
except:
print('Didnt find test')
My goal is to extract from a list of text files and push into CSV Columns that has text like this below where i would extract /J6 to /K6 as a value range. There is multiple different lines of /J6 to /K6 , each value to be put into a separate column in the CSV.
/J60000,0000,0819,0016,0356,-13,0363/K60013
,0012,0013,0875,-0021,00465,0120/L60089,0002,
I just want to understand is there a syntax problem detecting the / . I am trying to extract values between a value and another value .thank you
You can use the re.split function.
Something like this...
In [84]: import re
In [85]: inp = "/J60000,0000,0819,0016,0356,-13,0363/K60013 ,0012,0013,0875,-0021,00465,0120/L60089,0002,"
In [86]: re.split("/[J,K,L]\d", inp)
Out[86]:
['',
'0000,0000,0819,0016,0356,-13,0363',
'0013 ,0012,0013,0875,-0021,00465,0120',
'0089,0002,']
Disclaimer: I'm not good with regex at all. I used this link as a reference.
https://www.dataquest.io/blog/regex-cheatsheet/

Python How to find the max element in a list

I'm trying to find the maximum rainfall from a data file. The precipitation column in the [4] element in a 1340 line long file.
Here's an example of a line of data from the file:
Date,Day,High T,Low T,Precip,Snow,Snow Depth
1/1/10,1,41,19,0,0,5
Here's the loop I'm trying to find the max_precip:
for line in fo:
max_precip = max(line.split(",")[4])
Any help or guidance here would be greatly appreciated. Thanks guys!
You'll need to apply this to all lines, and you'll need to convert the precipitation value to integer first:
max_precip = max(fo, key=lambda line: int(line.split(',')[4]))
This returns the whole line containing the maximum precipitation. I'm assuming you already removed the header line.
Note that you may want to look at the csv module to handle the comma-splitting for you.
To get just the precipitation maximum and ignore everything else, use a generator expression:
max_precip = max(int(line.split(',')[4]) for line in fo)
Demo:
>>> fo = '''\
... 1/1/10,1,41,19,0,0,5
... 1/2/10,1,38,18,2,0,6
... 1/3/10,1,43,17,1,0,6
... '''.splitlines()
>>> max(fo, key=lambda line: int(line.split(',')[4]))
'1/2/10,1,38,18,2,0,6'
>>> max(int(line.split(',')[4]) for line in fo)
2
I'm trying to find the maximum rainfall from a data file.
If that is what you want, you may want to pre-process your data, before passing to the built-in max function.
max_precip = max(int(line.split(',')[4]) for line in fo)

IndexError: list index out of range; split is causing the issue probably

I am just two days to Python and also to this Forum. Please bear if my question looks silly. I tried searching the stack overflow, but couldn't correlate the info whats been given.
Please help me on this
>>> import re
>>> file=open("shortcuts","r")
>>> for i in file:
... i=i.split(' ',2)
... if i[1] == '^/giftfordad$':
... print i[1]
...
^/giftfordad$
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IndexError: list index out of range
I am receiving my desire output but along with list index out of range error.
My file shortcuts contains four columns which are delimited with SPACE.
Also please help me how to find the pattern if the value resides in a variable.
For example
var1=giftcard
How to find ^/var$ in my file. Thanks for help in advance.
From Jon's point, I arrive at this. But still looking for answer. the below gives me an empty result. Need some tweak in regex for [^/ and $]
>>> import re
>>> with open('shortcut','r') as fin:
... lines = (line for line in fin if line.strip() and not line.strip().startswith('#'))
... for line in lines:
... if re.match("^/giftfordad$",line):
... print line
From my shell command, i arrive at the answer easily. Can somebody please write/correct this piece of code achieving the results, I'm looking for. Many thanks
$grep "\^\/giftfordad\\$" shortcut
RewriteRule ^/giftfordad$ /home/sathya/?id=456 [R,L]
In this:
i=i.split(' ',2)
if i[1] == '^/giftfordad$':
It would imply that i is now a list of length 1 (ie, there was no ' ' character to split on).
Also, it looks like you might be trying to use a regular expression and if i[1] == '^/giftfordad$' is not the way Python does those. That comparison would be written as:
if i[1] == '/giftfordad':
However, that's a completely valid string if you're grabbing it from a file of a list of regular expressions ;)
Just seen your example:
If you're processing an .htaccess like file, you'll want to ignore blank lines and presumably commented lines...
with open('shortcuts') as fin:
lines = (line for line in fin if line.strip() and not line.strip().startswith('#'))
for line in lines:
stuff = line.split(' ', 2)
# etc...
I took the line you gave as an example, and here is what I found, hoping that fits your expectations (I understood you wanted to replace ^/tv$ by ^/giftfordad$ in column #2):
>>> s = 'RewriteRule ^/tv$ /home/sathya?id=123 [R=301,L]'
>>> parts = s.split()
>>> parts
['RewriteRule', '^/tv$', '/home/sathya?id=123', '[R=301,L]']
>>> if len(parts) > 1:
part = parts[1]
if not "^/giftfordad$" in part:
print ' '.join([parts[0]] + ["^/giftfordad$"] + parts[2:])
else:
print s
RewriteRule ^/giftfordad$ /home/sathya?id=123 [R=301,L]
The line with join is the most complex: I recreate a list by concatenating:
the 1st column unchanged
the 2nd column replaced by ^/giftfordad$
the rest of the columns unchanged
join is then used to join all these elements as a string.

How do I handle closing double quotes in CSV column with python?

This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.

Categories

Resources