Parsing block in file to Python list without newlines - python

I have a particular block of stuff in a general file of many contents which is arbitrarily long, can contain any character, begins each line with a blank space and has the form in some text file:
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
I'm interested in the particular sequence which lies between \HF= and \RMSD=. I want to put these numbers into a Python list. This sequence is simply a series of numbers that are comma separated, however, these numbers can roll over onto a second line. ALSO, \HF= and \RMSD may be broken by rolling over onto a newline.
Current Efforts
I currently have the following:
with open(infile) as data:
d1 = []
start = '\\HF'
end = 'RMSD'
should_append = False
for line in data:
if start in line:
data = line[len(start):]
d1.append(data)
should_append=True
elif end in line:
should_append = False
break
elif should_append:
d1.append(line)
which spits out the following list
['.6184082129,7.5129238742\\\\Version=EM64L-G09RevC.01\\
State=1-A\\HF=-568\n', ' .8880019,-568.8879907,-568.8879686,
-568.887937,-\n']
The problem is not only do I have newlines throughout, I'm also keeping more data than I should. Furthermore, numbers that roll over onto other lines are given their own placement in the list. I need it to look like
['-568.8880019', '-568.8879907', ... ]

A multline non-greedy regular expression can be used to extract text that lies between \HF= and \RMSD=. Once the text is extracted it should be trivially easy to tokenize into constituent numbers
import re
import os
pattern = r'''\HF=(.*?)\RMSD='''
pat = re.compile(pattern, re.DOTALL)
for number in pat.finditer(open('file.txt').read()):
print number.group(1).replace(os.linesep, '').replace(' ', '').strip(r'''\\''')
...
-568 .8880019,-568.2343213, -568 .2343432, ... , -586.328492 1\

for a fast solution, you can implement a naive string concatenation based on regular expressions.
I implemented a short solution for your data format.
import re
def naiveDecimalExtractor(data):
p = re.compile("(-?\d+)[\n\s]*(\d+\.\d+)[\n\s]*(\d+)")
brokenNumbers = p.findall(data)
return ["".join(n) for n in brokenNumbers]
data = """
1\1\GINC-NODE9999\Scan\...
... ... ... ... ... ... ...
... ... ... ... ...\HF=-568
.8880019,-568.2343213, -568
.2343432, ... , -586.328492
1\RMSD=...
"""
print naiveDecimalExtractor(data)
Regards,
And Past

Use something like this to join everything in one line:
with open(infile) as data:
joined = ''.join(data.read().splitlines())
And then parse that without worrying about newlines.
If your file is really large you may want to consider another approach to avoid having it all in memory.

How about something like this:
# open the file to read
f = open("test.txt")
# read the whole file, then concatenate the list as one big string (str)
str = " ".join(f.readlines())
# get the substring between \HF= and \RMDS, then remove any '\', 'n', or ' '
values = str[str.find("\HF=")+5:str.find("\RMSD")].translate(None, "\n ")
# the string is now just numbers separated by commas, so split it to a list
# using the ',' deliminator
list = values.split(',')
Now list has:
['568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']

I had something like this open and forgot to post - a "slightly" different answer that uses mmap'd files and re.finditer:
This has the advantage of dealing with larger files relatively efficiently as it allows the regex engine to see the file as one long string without it being in memory at once.
import mmap
import re
with open('/home/jon/blah.txt') as fin:
mfin = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(r'\\HF=(.*?)\\RMSD=', mfin, re.DOTALL):
print match.group(1).translate(None, '\n ').split(',')
# ['-568.8880019', '-568.2343213', '-568.2343432', '...', '-586.3284921']

Related

How to read strings as integers when reading from a file in python

I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))

Extract the lines between 2 specific tags

For a routine programming question, I need to extract some lines of text that are between 2 tags(delimiters, if I need to be more specific).
The file is something like this:
*some random text*
...
...
...
tag/delimiter 1
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter 2
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text*
...
...
...
tag/delimiter n
text 1 #extract
text 2 #extract
... #extract
... #extract
text n #extract
tag/ending_delimiter
*some random text until the file ends*
The ending_delimiter is the same everywhere.
The starting delimiter, i.e delimiter 1, delimiter 2 upto n is taken from a list.
The catch is, in the file there are a few (less than 3) charecters after each starting delimiter, which, combined with the starting delimiter, work as an identifier for the lines of text until the ending_delimiter, a kind of "uid", technically.
So far, what I've tried is this:
data_file = open("file_name")
block = []
found = False
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
block.append(line)
if re.match(attribute_end, line.strip()):
break
else:
if re.match(elem, line.strip()):
found = True
block = elem
data_file.close()
I have also tried to implement the answers suggested in:
python - Read file from and to specific lines of text
but with no success.
The implementation I'm currently trying is one of the answers of the link above.
Any help is appreciated.
P.S: Using Python 2.7, on PyCharm, on Windows 10.
I suggest fixing your code the following way:
block = []
found = False
list_of_starting_delimiters = ['tag/delimiter']
attribute_end = 'tag/ending_delimiter'
curr = []
for elem in list_of_starting_delimiters:
for line in data_file:
if found:
curr.append(line)
if line.strip().startswith(attribute_end):
found = False
block.append("\n".join(curr)) # Add merged list to final list
curr = [] # Zero out current list
else:
if line.strip().startswith(elem): # If line starts with start delimiter
found = True
curr.append(line.strip()) # Append line to current list
if len(curr) > 0: # If there are still lines in the current list
block.append(curr) # Add them to the final list
See the Python demo
There are quite a lot of issues with your current code:
block = elem made block a byte string and the further .append caused an exception
You only grabbed one occurrence of the block because upon fining one, you had a break statement
All the lines were added as separate items while you needed to collect them into a list and then join them with \n to get strings to paste into a resulting list
You need no regex to check if a string appears at the start of a string, use str.startswith method.
By the time I figured this out there are a fair amount of good responses already, but my approach would be that you could resolve this with:
import re
pattern = re.compile(r"(^tag\/delimiter) (.{0,3})\n\n((^[\w\d #\.]*$\n)+)^(tag\/ending_delimiter)", re.M)
You could then find all matches in your text by either doing:
for i in pattern.finditer(<target_text>):
#do something with each match
pattern.findAll(<target_text>) - returns a list of strings of all matches
This of course bears the stipulation that you need to specify different delimiters and compile a different regex pattern (re.compile) for each different delimiter, using variables and string concatenation as #SpghttCd shows in his answer
For more info see the python re module
What about
import re
with open(file, 'r') as f:
txt = f.read()
losd = '|'.join(list_of_starting_delimiters)
enddel = 'attribute_end'
block = re.findall('(?:' + losd + r')([\s\S]*?)' + enddel, txt)
I would make that in following way: For example purpose let <d1> and <d2> and <d3> be our starting delimiters and <d> ending delimeter and string being text you are processing. Then following line of code:
re.findall('(<d1>|<d2>|<d3>)(.+?)(<d>)',string,re.DOTALL)
will give list of tuples, with each tuple containing starting delimiter, body and ending delimiter. This code use grouping inside regular expression (brackets), pipe (|) in regular expressions acts similar to or, dot (.) combined with DOTALL flag match any character, plus (+) means 1 or more, question (?) non-greedy manner (this is important in this case, as otherwise you would get single match begining at first starting delimiter and ending at last ending delimiter)
My re-less solution would be the following:
list_of_starting_delimiters = ['tag/delimiter 1', 'tag/delimiter 2', 'tag/delimiter n']
enddel = 'tag/ending_delimiter'
block ={}
section = ''
with open(file, 'r') as f:
for line in f:
if line.strip() == enddel:
section = ''
if section:
block[section] = block.get(section, '') + line
if line.strip() in list_of_starting_delimiters:
section = line.strip()
print(block)
It extracts the blocks into a dictionary with start delimiter tags as keys and according sections as values.
It requires the start and end tags to be the only content of their respective lines.
Output:
{'tag/delimiter 1':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter 2':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n',
'tag/delimiter n':
'\ntext 1 #extract\n\ntext 2 #extract\n\n... #extract\n\n... #extract\n\ntext n #extract\n\n'}

Match all names with exactly 5 digits at the end

I have a text file like this:
john123:
1
2
coconut_rum.zip
bob234513253:
0
jackdaniels.zip
nowater.zip
3
judy88009:
dontdrink.zip
9
tommi54321:
dontdrinkalso.zip
92
...
I have millions of entries like this.
I want to pick up the name and number which has a number 5 digits long. I tried this:
matches = re.findall(r'\w*\d{5}:',filetext2)
but it's giving me results which have at least 5 digits.
['bob234513253:', 'judy88009:', 'tommi54321:']
Q1: How to find the names with exactly 5 digits?
Q2: I want to append the zip files which is associated with these names with 5 digits. How do I do that using regular expressions?
That's because \w includes digit characters:
>>> import re
>>> re.match('\w*', '12345')
<_sre.SRE_Match object at 0x021241E0>
>>> re.match('\w*', '12345').group()
'12345'
>>>
You need to be more specific and tell Python that you only want letters:
matches = re.findall(r'[A-Za-z]*\d{5}:',filetext2)
Regarding your second question, you can use something like the following:
import re
# Dictionary to hold the results
results = {}
# Break-up the file text to get the names and their associated data.
# filetext2.split('\n\n') breaks it up into individual data blocks (one per person).
# Mapping to str.splitlines breaks each data block into single lines.
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
# See if the name matches our pattern.
if re.match('[A-Za-z]*\d{5}:', name):
# Add the name and the relevant data to the file.
# [:-1] gets rid of the colon on the end of the name.
# The list comprehension gets only the file names from the data.
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
Or, without all the comments:
import re
results = {}
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
if re.match('[A-Za-z]*\d{5}:', name):
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
Below is a demonstration:
>>> import re
>> filetext2 = '''\
... john123:
... 1
... 2
... coconut_rum.zip
...
... bob234513253:
... 0
... jackdaniels.zip
... nowater.zip
... 3
...
... judy88009:
... dontdrink.zip
... 9
...
... tommi54321:
... dontdrinkalso.zip
... 92
... '''
>>> results = {}
>>> for name, *data in map(str.splitlines, filetext2.split('\n\n')):
... if re.match('[A-Za-z]*\d{5}:', name):
... results[name[:-1]] = [x for x in data if x.endswith('.zip')]
...
>>> results
{'tommi54321': ['dontdrinkalso.zip'], 'judy88009': ['dontdrink.zip']}
>>>
Keep in mind though that it is not very efficient to read in all of the file's contents at once. Instead, you should consider making a generator function to yield the data blocks one at a time. Also, you can increase performance by pre-compiling your Regex patterns.
import re
results = {}
with open('datazip') as f:
records = f.read().split('\n\n')
for record in records:
lines = record.split()
header = lines[0]
# note that you need a raw string
if re.match(r"[^\d]\d{5}:", header[-7:]):
# in general multiple hits are possible, so put them into a list
results[header] = [l for l in lines[1:] if l[-3:]=="zip"]
print results
Output
{'tommi54321:': ['dontdrinkalso.zip'], 'judy88009:': ['dontdrink.zip']}
Comment
I tried to keep it very simple, if your input is very long you should, as suggested by iCodez, implement a generator that yields one record at a time, while for the regexp match I tried a little optimization searching only the last 7 characters of the header.
Addendum: a simplistic implementation of a record generator
import re
def records(f):
record = []
for l in f:
l = l.strip()
if l:
record.append(l)
else:
yield record
record = []
yield record
results = {}
for record in records(open('datazip')):
head = record[0]
if re.match(r"[^\d]\d{5}:", head[-7:]):
results[head] = [ r for r in record[1:] if r[-3:]=="zip"]
print results
You need to limit the regex to the end of the word so that it wont match any further using \b
[a-zA-Z]+\d{5}\b
see for example http://regex101.com/r/oC1yO6/1
The regex would match
judy88009:
tommi54321:
python code would be like
>>> re.findall(r'[a-zA-Z]+\d{5}\b', x)
['judy88009', 'tommi54321']

Python trimming a non-standard segment in a string

How does one remove a header from a long string of text?
I have a program that displays a FASTA file as
...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...
The string is large and contains multiple headers like this
So the headers that need to be trimmed start with a > and end with a $
There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25
How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?
The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.
As it is part of fasta file, so you are going to slice it like this:
>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']
Also, some people are answering with slicing with '>ion1'. That's totally wrong!
I believe your problem is solved! I am also editing a tag with bioinformatics for this question!
I would use the re module for that:
>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']
And if you have 1 string on each line:
>>> with open("foo.txt", "r") as f:
... for line in f:
... re.split(">[^$]*\$",line[:-1])
...
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']
If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):
for line in file:
stripped_header = line.partition(">")[2].partition("$")[0]
You could use split:
for line in file:
stripped_header = line.spilt(">")[1].split("$")[0]
You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):
for line in file:
bool = False
stripped_header = ""
for char in line:
if char == ">":
bool = True
elif bool:
if char != "$":
stripped_header += char
else:
bool = False
Or alternatively use a regular expression, but it seems like my peers have already beat me to it!

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!
The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.
try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])
You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Categories

Resources