How to read from file after some 'mark'? - python

For example, if I have some text / log file with very simple structure, where here is a few different parts of it, with different structure, and splitted by some mark line, e.g.:
0x23499 0x234234 0x234234
...
0x34534 0x353454 0x345464
$$$NEW_SECTION$$$
4345-34534-345-345345-3453
3453-34534-346-766788-3534
...
So, how I can read file by these parts? E.g. read file in one variable before that $$$NEW_SECTION$$$ mark, and after it (without using regexps, etc). Are here any simple solutions for that?

Here is the solution without reading the whole file into memory:
data1 = []
pos = 0
with open('data.txt', 'r') as f:
line = f.readline()
while line and not line.startswith('$$$'):
data1.append(line)
line = f.readline()
pos = f.tell()
data2 = []
with open('data.txt', 'r') as f:
f.seek(pos)
for line in f:
data2.append(line)
print data1
print data2
The first iteration can't be made with for line in f not to spoil the accurate position in the file.

The simplest solution is str.split
>>> s = filecontents.split("$$$NEW_SECTION$$$")
>>> s[0]
'0x23499 0x234234 0x234234\n\n0x34534 0x353454 0x345464\n'
>>> s[1]
'\n4345-34534-345-345345-3453\n3453-34534-346-766788-3534'

Solution 1:
If file is not very-big then:
with open('your_log.txt') as f:
parts = f.read().split('$$$NEW_SECTION$$$')
if len(parts) > 0:
part1 = parts[0]
...
Solution 2:
def FileParser(filepath):
with open(filepath) as f:
part = ''
while(line = f.readline()):
part += line
if (line != '$$$NEW_SECTION$$$'):
returnpart = part
part = ''
yield returnpart
for segment in FileParser('your_log.txt'):
print segment
Note: it is untested code so please validate before using it

Solution:
def sec(file_, sentinel):
with open(file_) as f:
section = []
for i in iter(f.readline, ''):
if i.rstrip() == sentinel:
yield section
section = []
else:
section.append(i)
yield section
and use:
>>> from pprint import pprint
>>> pprint(list(sec('file.txt')))
[['0x23499 0x234234 0x234234\n', '0x34534 0x353454 0x345464\n'],
['4345-34534-345-345345-3453\n',
'3453-34534-346-766788-3534\n',
'3453-34534-346-746788-3534\n']]
>>>
sections to variables or best sections to dict:
>>> sections = {}
>>> for n, section in enumerate(sec('file.txt')):
... sections[n] = section
>>>

Related

Python: Separating txt file to multiple files using a reoccuring symbol

I have a .txt file of amino acids separated by ">node" like this:
Filename.txt :
>NODE_1
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
>NODE_2
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
I want to separate this file into two (or as many as there are nodes) files;
Filename1.txt :
>NODE
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
Filename2.txt :
>NODE
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
With a number after the filename
This code works, however it deletes the ">NODE" line and does not create a file for the last node (the one without a '>' afterwards).
with open('FilePathway') as fo:
op = ''
start = 0
cntr = 1
for x in fo.read().split("\n"):
if x.startswith('>'):
if start == 1:
with open (str(cntr) + '.fasta','w') as opf:
opf.write(op)
opf.close()
op = ''
cntr += 1
else:
start = 1
else:
if op == '':
op = x
else:
op = op + '\n' + x
fo.close()
I canĀ“t seem to find the mistake. Would be thankful if you could point it out to me.
Thank you for your help!
Hi again! Thank you for all the comments. With your help, I managed to get it to work perfectly. For anyone with similar problems, this is my final code:
import os
import glob
folder_path = 'FilePathway'
for filename in glob.glob(os.path.join(folder_path, '*.fasta')):
with open(filename) as fo:
for line in fo.readlines():
if line.startswith('>'):
original = line
content = [original]
fileno = 1
filename = filename
y = filename.replace(".fasta","_")
def writefasta():
global content, fileno
if len(content) > 1:
with open(f'{y}{fileno}.fasta', 'w') as fout:
fout.write(''.join(content))
content = [line]
fileno += 1
with open('FilePathway') as fin:
for line in fin:
if line.startswith('>NODE'):
writefasta()
else:
content.append(line)
writefasta()
You could do it like this:
def writefasta(d):
if len(d['content']) > 1:
with open(f'Filename{d["fileno"]}.fasta', 'w') as fout:
fout.write(''.join(d['content']))
d['content'] = ['>NODE\n']
d['fileno'] += 1
with open('test.fasta') as fin:
D = {'content': ['>NODE\n'], 'fileno': 1}
for line in fin:
if line.startswith('>NODE'):
writefasta(D)
else:
D['content'].append(line)
writefasta(D)
This would be better way. It is going to write only on odd iterations. So that, ">NODE" will be skipped and files will be created only for the real content.
with open('filename.txt') as fo:
cntr=1
for i,content in enumerate(fo.read().split("\n")):
if i%2 == 1:
with open (str(cntr) + '.txt','w') as opf:
opf.write(content)
cntr += 1
By the way, since you are using context manager, you dont need to close the file.
Context managers allow you to allocate and release resources precisely
when you want to. It opens the file, writes some data to it and then
closes it.
Please check: https://book.pythontips.com/en/latest/context_managers.html
with open('FileName') as fo:
cntr = 1
for line in fo.readlines():
with open (f'{str(cntr)}.fasta','w') as opf:
opf.write(line)
opf.close()
op = ''
cntr += 1
fo.close()

Python 3.X combining similar lines in .txt files together

A question regarding combining values from a text file into a single variable and printing it.
An example I can give is a .txt file such as this:
School, 234
School, 543
I want to know the necessary steps to combining both of the school into a single variable "school" and have a value of 777.
I know that we will need to open the .txt file for reading and then splitting it apart with the .split(",") method.
Code Example:
schoolPopulation = open("SchoolPopulation.txt", "r")
for line in schoolPopulation:
line = line.split(",")
Could anyone please advise me on how to tackle this problem?
Python has rich standard library, where you can find classes for many typical tasks. Counter is what you need in current situation:
from collections import Counter
c = Counter()
with open('SchoolPopulation.txt', 'r') as fh:
for line in fh:
name, val = line.split(',')
c[name] += int(val)
print(c)
Something like this?
schoolPopulation = open("SchoolPopulation.txt", "r")
results = {}
for line in schoolPopulation:
parts = line.split(",")
name = parts[0].lower()
val = int(parts[1])
if name in results:
results[name] += val
else:
results[name] = val
print(results)
schoolPopulation.close()
You could also use defaultdict and the with keyword.
from collections import defaultdict
with open("SchoolPopulation.txt", "r") as schoolPopulation:
results = defaultdict(int)
for line in schoolPopulation:
parts = line.split(",")
name = parts[0].lower()
val = int(parts[1])
results[name] += val
print(results)
If you'd like to display your results nicely you can do something like
for key in results:
print("%s: %d" % (key, results[key]))
school = population = prev = ''
pop_count = 0
with open('SchoolPopulation.txt', 'r') as infile:
for line in infile:
line = line.split(',')
school = line[0]
population = int(line[1])
if school == prev or prev == '':
pop_count += line[1]
else:
pass #do something else here
prev = school

How do I print the number of lines from a File that contains a specific word using Python?

This prints out the number of all the lines:
def links(htmlfile):
infile = open('twolinks.html', 'r')
content = infile.readlines()
infile.close()
return len(content)
print("# of lines: " + str(content.count('</a>')))
But I only need the number of lines which contain < / a > at the end.
The loop way:
with open('twolinks.html') as f:
count = 0
for line in f:
if line.endswith('</a>'):
count += 1
Using comprehension:
with open('twolinks.html') as f:
sum( 1 for line in f if line.endswith('</a>') )
Or even shorter (summing booleans, treating them as 0s and 1s):
sum( line.endswith('</a>') for line in f )
import re
with open('data') as f:
print(sum( 1 for line in f if re.search('</a>',line) ))
num_lines = sum(1 for line in open('file') if '</a>' in line)
print num_lines
I guess that my answer is a bit longer in terms of code lines, but why not use a HTML parser since you know that you are parsing HTML?
for instance:
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.count = 0
def handle_endtag(self, tag):
if tag == "a":
self.count += 1
print "Encountered an end tag :", tag
print self.count
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1><a></a></body></html>')
this is modified code from the python pages. This is then easier to modify if you find the need for collecting other tags, or data with tags etc.
Or you could do something like that:
count = 0
f = open("file.txt", "r")
for line in f:
if(line[-5:].rstrip('\n')=='</a>'):
count+=1
Worked great for me.
In general, you go through the file each line at a time,
and see it the last characters (without the \n) match </a>.
see if the \n striping gives you any trouble.

trying to create a dictionary from a text file but

so, I have text file (a paragraph) and I need to read the file and create a dictionary containing each different word from the file as a key and the corresponding value for each key will be an integer showing the frequency of the word in the text file.
an example of what the dictionary should look like:
{'and':2, 'all':1, 'be':1, 'is':3} etc.
so far I have this,
def create_word_frequency_dictionary () :
filename = 'dictionary.txt'
infile = open(filename, 'r')
line = infile.readline()
my_dictionary = {}
frequency = 0
while line != '' :
row = line.lower()
word_list = row.split()
print(word_list)
print (word_list[0])
words = word_list[0]
my_dictionary[words] = frequency+1
line = infile.readline()
infile.close()
print (my_dictionary)
create_word_frequency_dictionary()
any help would be appreciated thanks.
Documentation defines collections module as "High-performance container datatypes". Consider using collections.Counter instead of re-inventing the wheel.
from collections import Counter
filename = 'dictionary.txt'
infile = open(filename, 'r')
text = str(infile.read())
print(Counter(text.split()))
Update:
Okay, I fixed your code and now it works, but Counter is still a better option:
def create_word_frequency_dictionary () :
filename = 'dictionary.txt'
infile = open(filename, 'r')
lines = infile.readlines()
my_dictionary = {}
for line in lines:
row = str(line.lower())
for word in row.split():
if word in my_dictionary:
my_dictionary[word] = my_dictionary[word] + 1
else:
my_dictionary[word] = 1
infile.close()
print (my_dictionary)
create_word_frequency_dictionary()
If you are not using version of python which has Counter:
>>> import collections
>>> words = ["a", "b", "a", "c"]
>>> word_frequency = collections.defaultdict(int)
>>> for w in words:
... word_frequency[w] += 1
...
>>> print word_frequency
defaultdict(<type 'int'>, {'a': 2, 'c': 1, 'b': 1})
Just replace my_dictionary[words] = frequency+1 with my_dictionary[words] = my_dictionary[words]+1.

Parsing specific contents in a file

I have a file that looks like this
!--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
!------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
I want to read sections [DISK] and [CAPACITY].. there will be more sections like these. I want to read the parameters defined under those sections.
I wrote a following code:
file_open = open(myFile,"r")
all_lines = file_open.readlines()
count = len(all_lines)
file_open.close()
my_data = {}
section = None
data = ""
for line in all_lines:
line = line.strip() #remove whitespace
line = line.replace(" ", "")
if len(line) != 0: # remove white spaces between data
if line[0] == "[":
section = line.strip()[1:]
data = ""
if line[0] !="[":
data += line + ","
my_data[section] = [bit for bit in data.split(",") if bit != ""]
print my_data
key = my_data.keys()
print key
Unfortunately I am unable to get those sections and the data under that. Any ideas on this would be helpful.
As others already pointed out, you should be able to use the ConfigParser module.
Nonetheless, if you want to implement the reading/parsing yourself, you should split it up into two parts.
Part 1 would be the parsing at file level: splitting the file up into blocks (in your example you have two blocks: DISK and CAPACITY).
Part 2 would be parsing the blocks itself to get the values.
You know you can ignore the lines starting with !, so let's skip those:
with open('myfile.txt', 'r') as f:
content = [l for l in f.readlines() if not l.startswith('!')]
Next, read the lines into blocks:
def partition_by(l, f):
t = []
for e in l:
if f(e):
if t: yield t
t = []
t.append(e)
yield t
blocks = partition_by(content, lambda l: l.startswith('['))
and finally read in the values for each block:
def parse_block(block):
gen = iter(block)
block_name = next(gen).strip()[1:-1]
splitted = [e.split('=') for e in gen]
values = {t[0].strip(): t[1].strip() for t in splitted if len(t) == 2}
return block_name, values
result = [parse_block(b) for b in blocks]
That's it. Let's have a look at the result:
for section, values in result:
print section, ':'
for k, v in values.items():
print '\t', k, '=', v
output:
DISK :
DIRECTION = 'OK'
TYPE = 'normal'
CAPACITY :
code = 0
ID = 110
Are you able to make a small change to the text file? If you can make it look like this (only changed the comment character):
#--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
#------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
Then parsing it is trivial:
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read('filename')
And getting data looks like this:
(Pdb) parser
<ConfigParser.SafeConfigParser instance at 0x100468dd0>
(Pdb) parser.get('DISK', 'DIRECTION')
"'OK'"
Edit based on comments:
If you're using <= 2.7, then you're a little SOL.. The only way really would be to subclass ConfigParser and implement a custom _read method. Really, you'd just have to copy/paste everything in Lib/ConfigParser.py and edit the values in line 477 (2.7.3):
if line.strip() == '' or line[0] in '#;': # add new comment characters in the string
However, if you're running 3'ish (not sure what version it was introduced in offhand, I'm running 3.4(dev)), you may be in luck: ConfigParser added the comment_prefixes __init__ param to allow you to customize your prefix:
parser = ConfigParser(comment_prefixes=('#', ';', '!'))
If the file is not big, you can load it and use Regexes to find parts that are of interest to you.

Categories

Resources