Python get String behind multiple comma - python

The problem is i need to read a text.txt file and just get very specific data from it.
the entries of that text.txt looks like this
b(1,4,8,1,4,TEST,0,3,AAAA,Test,2-150,000)
a(1,1,3,1,3,BBBB,0,3,BBBB,Test,2-150,000)
a(1,0,2,1,4,TEST,0,3,CCCC,Test,2-150,000)
b(1,1,0,1,4,TEST,0,3,DDDD,Test,2-150,000)
So now i just whant those lines with "a(" and in those i just need the sting after the 5 and 8 comma, so in line 2 it would be BBBB ,BBBB
my code so far is:
infile = open("text.txt","r")
numlines = 0
found = []
for line in infile:
numlines += 1
if "a" in line:
line=line[line.find("(")+1:line.find(")")]
found.append(line.split(','))
wordLed=len(found)
for i in range(0,wordLed):
print found[i]
infile.close()
This just gives me the full lines seperated at the "," but how can i index though them?

The quick short and dirty:
with open('text.txt') as f:
result = [line.split(',')[5:9:3] for line in f if line.startswith("a(")]
# ^^^^^^^
# "5 to 9 (excl.) by step of 3"
# that is items 5 and 5+3
#
# replace by [5] if you only want the fifths item
# replace by [5:9] if you want items from 5 to 9 (excl.)
from pprint import pprint
pprint(result)
dirty because of the lack of error handling...
... anyway, given your sample data, this produces:
[['BBBB', 'BBBB'], ['TEST', 'CCCC']]

I would use readlines function:
with open("data.txt","r") as f:
lines = f.readlines()
for line in lines:
if line[0:2] == 'a(':
data1 = line.split(',')[5]
data2 = line.split(',')[8]
print(data1, data2)
f.close()

You should check the full condition on start, i.e. a( instead of a. Also you could use split to create an array out of your string, based on ,:
infile = open("text.txt","r")
for line in infile:
if line.startswith("a("): # Starts with a(
data = line.split(',')
print data[5] # Print data at place 5
print data[8] # Print data at place 8
infile.close()

for line in [l for l infile if l.startswith('a(')]
line = line[line.find('('):].strip('()\n').split(',')
a_field, other_field = line[5], line[8]
You split the string already, just index into the list to get the fields you want.

Related

How to convert multicharacter single line into string in Python

Hello I have line like below in a file
I want to convert Text :0 to 8978 as a single string. And same for other part i.e Text:1 to 8978.
Text:0
6786993cc89 70hgsksgoop 869368
7897909086h fhsi799hjdkdh 099h
Gsjdh768hhsj dg9978hhjh98 8978
Text:1
8786993cc89 70hgsksgoop 869368
7897909086h fhsi799hjdkdh 099h
Gsjdh768hhsj dg9978hhjh98 8978
I am getting output as
6
7
G
8
7
G
But i want output as from string one and from string two as
6
8
Code is :
file = open ('tem.txt','r')
lines = file.readlines()
print(lines)
for line in lines:
line=line.strip()
linex=line.replace(' ','')
print(linex)
print (linex[0])
I'm not sure about what exact do you need, so:
#1. If need only print the first number (6), I think your code is right.
#2. If you need to print the first part of string(before "space"), it can help you:
line="6786993cc8970hgsksgoop869368 7897909086hfhsi799hjdkdh099h Gsjdh768hhsjdg9978hhjh988978"
print(line[0])
print(line.split(' ')[0])
EDIT
To read a file....
file = open('file.txt', 'r')
Lines = file.readlines()
file.close()
for line in Lines:
print(line.split(' ')[0])
New EDIT
First you need to format your file to after that get the first element. Try this please:
file = open ('tem.txt','r')
lines = file.readlines()
file.close()
linesArray = []
lineTemp = ""
for line in lines:
if 'Text' in line:
if lineTemp:
linesArray.append(lineTemp)
lineTemp = ""
else:
lineTemp += line.strip()
linesArray.append(lineTemp)
for newline in linesArray:
print(newline.split(' ')[0][0])
This should work only if you want to view the first character. Essentially, this code will read your text file, convert multiple lines in the text file to one single string and print out the required first character.
with open(r'tem.txt', 'r') as f:
data = f.readlines()
line = ''.join(data)
print(line[0])
EDITED RESPONSE
Try using regex. Hope this helps.
import re
pattern = re.compile(r'(Text:[0-9]+\s)+')
with open(r'tem.txt', 'r') as f:
data = f.readlines()
data = [i for i in data if len(i.strip())>0]
line = ' '.join([i.strip() for i in data if len(i)>0]).strip()
occurences = re.findall(pattern, line)
for i in occurences:
match_i = re.search(i, line)
start = match_i.end()
print(line[start])

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?
This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.
I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))
readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

how to count empty lines in python file

I would like to print the total empty lines using python. I have been trying to print using:
f = open('file.txt','r')
for line in f:
if (line.split()) == 0:
but not able to get proper output
I have been trying to print it.. it does print the value as 0.. not sure what wrong with code..
print "\nblank lines are",(sum(line.isspace() for line in fname))
it printing as:
blank lines are 0
There are 7 lines in the file.
There are 46 characters in the file.
There are 8 words in the file.
Since the empty string is a falsy value, you may use .strip():
for line in f:
if not line.strip():
....
The above ignores lines with only whitespaces.
If you want completely empty lines you may want to use this instead:
if line in ['\r\n', '\n']:
...
Please use a context manager (with statement) to open files:
with open('file.txt') as f:
print(sum(line.isspace() for line in f))
line.isspace() returns True (== 1) if line doesn't have any non-whitespace characters, and False (== 0) otherwise. Therefore, sum(line.isspace() for line in f) returns the number of lines that are considered empty.
line.split() always returns a list. Both
if line.split() == []:
and
if not line.split():
would work.
FILE_NAME = 'file.txt'
empty_line_count = 0
with open(FILE_NAME,'r') as fh:
for line in fh:
# The split method will split the word into list. if the line is
# empty the split will return an empty list. ' == [] ' this will
# check the list is empty or not.
if line.split() == []:
empty_line_count += 1
print('Empty Line Count : ' , empty_line_count)

How to print next line in python

I am trying to print next 3 lines after a match
for example input is :
Testing
Result
test1 : 12345
test2 : 23453
test3 : 2345454
so i am trying to search "Result" string in file and print next 3 lines from it:
Output will be :-
test1 : 12345
test2 : 23453
test3 : 2345454
my code is :
with open(filename, 'r+') as f:
for line in f:
print line
if "Benchmark Results" in f:
print f
print next(f)
its only giving me the output :
testing
how do i get my desired output, help please
First you need to check that the text is in the line (not in the fileobj f), and you can utilise islice to take the next 3 lines from f and print them, eg:
from itertools import islice
with open(filename) as f:
for line in f:
if 'Result' in line:
print(''.join(islice(f, 3)))
The loop will continue from the line after the three printed. If you don't want that - put a break inside the if.
I would suggest opening the file and spliting its content in lines, assigning the outcome to a variable so you can manipulate the data more comfortably:
file = open("test.txt").read().splitlines()
Then you can just check which line contains the string "Result", and print the three following lines:
for index, line in enumerate(file):
if "Result" in line:
print(file[index+1:index+4])
You are testing (and printing) "f" instead of "line". Be careful about that. 'f' is the file pointer, line has your data.
with open(filename, 'r+') as f:
line = f.readline()
while(line):
if "Benchmark Results" in line:
# Current line matches, print next 3 lines
print(f.readline(),end="")
print(f.readline(),end="")
print(f.readline(),end="")
line = f.readline()
It is waiting for the first "Result" in the file and then prints the rest of the input:
import re, sys
bool = False
with open("input.txt", 'r+') as f:
for line in f:
if bool == True:
sys.stdout.write(line)
if re.search("Result",line): #if it should match whole line, than it is also possible if "Result\n" == line:
bool = True
If you want end after first 3 prints, you may add variable cnt = 0 and change this part of code (for example this way):
if bool == True:
sys.stdout.write(line)
cnt = cnt+1
if cnt == 3:
break
with open('data', 'r') as f:
lines = [ line.strip() for line in f]
# get "Result" index
ind = lines.index("Result")
# get slice, add 4 since upper bound is non inclusive
li = lines[ind:ind+4]
print(li)
['Result', 'test1 : 12345', 'test2 : 23453', 'test3 : 2345454']
or as exercise with regex:
import re
with open('data', 'r') as f:
text = f.read()
# regex assumes that data as shown, ie, no blank lines between 'Result'
# and the last needed line.
mo = re.search(r'Result(.*?\n){4}', text, re.MULTILINE|re.DOTALL)
print(mo.group(0))
Result
test1 : 12345
test2 : 23453
test3 : 2345454

Python line.split between two delimeters

I have a text file that contains the following data:
Schema:
Column Name Localized Name Type MaxLength
---------------------------- ---------------------------- ------ ---------
Raw Binary Binary 16384
Row 1:
Binary:
-----BEGIN-----
fdsfdsfdasadsad
fsdfafsdafsadfa
fsdafadsfadsfdsa
-----END-----
Row 2:
Binary:
-----BEGIN-----
fsdfdssd
fdsfadsfasd
fsdafdsa
-----END-----
Row 3:
Binary:
-----BEGIN-----
fsdafadsds
fsdafasdsda
fdsafadssad
-----END-----
I need to extract the data between the "-----BEGIN-----" and "------END-----" delimiters into an array.
This is what I've tried:
data = open("test_data.txt", 'r')
result = [line.split('-----BEGIN-----') for line in data.readlines()]
print data
However this obviously gets all of the data after the '-----BEGIN-----' delimiter.
How can I add the end delimeter ?
Note the file is quite large, arround about 1GB.
For multiple lines between and you want the data separated into sections just catch each block beginning with -----BEGIN-.. and keep adding lines until you reach END:
with open("file.txt") as f:
out = []
for line in f:
if line.rstrip() == "-----BEGIN-----":
tmp = []
for line in f:
if line.rstrip() == "-----END-----":
out.append(tmp)
break
tmp.append(line)
The sections will be split into sublists:
[['fdsfdsfdasadsad\n', 'fsdfafsdafsadfa\n', 'fsdafadsfadsfdsa\n'], ['fsdfdssd\n', 'fdsfadsfasd\n', 'fsdafdsa \n'], ['fsdafadsds\n', 'fsdafasdsda\n', 'fdsafadssad\n']]
Use with to open your files and don't call readlines unless you want a list, you can iterate over the file object as above without storing all the content in memory.
Or using itertools.takewhile to get the sections :
from itertools import takewhile, imap
with open("file.txt") as f:
f = imap(str.rstrip,f) # use map for python3
out = [list(takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----"]
print(out)
[['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa'],
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa'],
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']]
If you want a single list of all the words you can chain:
from itertools import takewhile,chain, imap
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = chain.from_iterable(takewhile(lambda x: x != "-----END-----",f) for line in f if line == "-----BEGIN-----")
print(list(out))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa',
'fsdfdssd', 'fdsfadsfasd', 'fsdafdsa', 'fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
A file object returns its own iterator so every time we iterate or call takewhile we consume lines, takewhile will keep taking lines until we hit -----END---- then we continue iterating until we hit another -----BEGIN----- line, if the lines always start with - and no other lines do then you can just check for that condition i.e if line[0] == "-" and x[0] != "-" instead of check the full line.
If you wanted to process each section you could use a generator expression and work on the lines from each section:
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = ((takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----")
for sec in out:
print(list(sec))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa']
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa']
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
If you want a single string call join:
with open("file.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(chain.from_iterable(takewhile(lambda x: x != end,f)
for line in f if line == st))
print(out)
Output:
fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsafsdfdssdfdsfadsfasdfsdafdsafsdafadsdsfsdafasdsdafdsafadssad
To get a single string keeping -----BEGIN----- and -----END-----
with open("out.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(["{}{}{}".format(st, "".join(takewhile(lambda x: x != end, f)), end)
for line in f if line == st])
Output:
-----BEGIN-----fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsa-----END----------BEGIN-----fsdfdssdfdsfadsfasdfsdafdsa-----END----------BEGIN-----fsdafadsdsfsdafasdsdafdsafadssad-----END-----
Try This :
array1 =[]
with open('test_data.txt','r') as infile:
copy = False
for line in infile:
if line.strip() == "-----BEGIN-----":
copy = True
elif line.strip() == "-----END-----":
copy = False
elif copy:
array1.append(line)
This will solve your purpose.
If your file is small enough to load the whole thing into memory, then using a Regular Expression (aka regex) is probably the best approach.
import re
beginstr = '\n-----BEGIN-----\n'
endstr = '-----END-----\n'
pat = re.compile(beginstr + '(.*?\n)' + endstr, re.DOTALL)
with open('test_data.txt', 'r') as f:
data = f.read()
result = pat.findall(data)
for row in result:
print repr(row)
output
'fdsfdsfdasadsad\nfsdfafsdafsadfa\nfsdafadsfadsfdsa\n'
'fsdfdssd\nfdsfadsfasd\nfsdafdsa \n'
'fsdafadsds\nfsdafasdsda\nfdsafadssad\n'
That code creates a compiled regex pattern; it's not strictly necessary in this case, since we're only using the pattern once, but it does make the code look neater, IMHO.
That regex looks for substrings delimited by 'beginstr' and '\n' + endstr. The findall call only captures the stuff between those delimiters, due to use of the grouping parentheses. I've put a '\n' inside those parentheses so that the captured substrings will always have a trailing newline.
You can use itertools.ifilter :
from itertools import ifilter
with open('a1.txt') as f,open('a1.txt') as g :
f.next()
it=f
print [i.strip() for i in ifilter(lambda x:next(f).strip()=='-----END-----',g)]
result :
['fdsfdsfdasadsad', 'fsdfdssd', 'fsdafadsds']
If the file is not huge use re.findall :
>>> re.findall('-----BEGIN-----\n(.*?)\n-----END-----',open('file_name').read(),re.M|re.DOTALL)
['fdsfdsfdasadsad', 'fsdfdssd', 'fsdafadsds']
Or without itertools you can use following recipe :
with open('a1.txt') as f,open('a1.txt') as g :
f.next()
it=f
for line in g :
n=next(f)
try :
if n.strip()=='-----END-----':
print line
except StopIteration:
break
result :
fdsfdsfdasadsad
fsdfdssd
fsdafadsds
Note that a file object is an iterator you can get the next item from the it by next function in each iteration. so we compare the next line of each line in our file with its next line (stripped)if it's equal to '-----END-----' we print it.
split alone is just fine, no need for other tools. Just also split off the end marker and everything after it:
with open("file.txt") as f:
blocks = [part.split('-----END-----')[0].strip()
for part in f.read().split('-----BEGIN-----')[1:]]

Categories

Resources