Python next substring search - python

I am transmitting a message with a pre/postamble multiple times. I want to be able to extract the message between two valid pre/postambles. My curent code is
print(msgfile[msgfile.find(preamble) + len(preamble):msgfile.find(postamble, msgfile.find(preamble))])
The problem is that if the postamble is corrupt, it will print all data between the first valid preamble and the next valid postamble. An example received text file would be:
garbagePREAMBLEmessagePOSTcMBLEgarbage
garbagePRdAMBLEmessagePOSTAMBLEgarbage
garbagePREAMBLEmessagePOSTAMBLEgarbage
and it will print
messagePOSTcMBLEgarbage
garbagePRdEAMBLEmessage
but what i really want it to print is the message from the third line since it has both a valid pre/post amble. So I guess what i want is to be able to find and index from the next instance of a substring. Is there an easy way to do this?
edit: I dont expect my data to be in nice discrete lines. I just formatted it that way so it would be easier to see

Process it line by line:
>>> test = "garbagePREAMBLEmessagePOSTcMBLEgarbage\n"
>>> test += "garbagePRdAMBLEmessagePOSTAMBLEgarbage\n"
>>> test += "garbagePREAMBLEmessagePOSTAMBLEgarbage\n"
>>> for line in test.splitlines():
if line.find(preamble) != -1 and line.find(postamble) != -1:
print(line[line.find(preamble) + len(preamble):line.find(postamble)])

are all messages on single lines?
Then you can use regular expressions to identify lines with valid pre- and postamble:
input_file = open(yourfilename)
import re
pat = re.compile('PREAMBLE(.+)POSTAMBLE')
messages = [pat.search(line).group(1) for line in input_file
if pat.search(line)]
print messages

import re
lines = ["garbagePREAMBLEmessagePOSTcMBLEgarbage",
"garbagePRdAMBLEmessagePOSTAMBLEgarbage",
"garbagePREAMBLEmessagePOSTAMBLEgarbage"]
# you can use regex
my_regex = re.compile("garbagePREAMBLE(.*?)POSTAMBLEgarbage")
# get the match found between the preambles and print it
for line in lines:
found = re.match(my_regex,line)
# if there is a match print it
if found:
print(found.group(1))
# you can use string slicing
def validate(pre, post, message):
for line in lines:
# method would break on a string smaller than both preambles
if len(line) < len(pre) + len(post):
print("error line is too small")
# see if the message fits the pattern
if line[:len(pre)] == pre and line[-len(post):] == post:
# print message
print(line[len(pre):-len(post)])
validate("garbagePREAMBLE","POSTAMBLEgarbage", lines)

Related

My code is missing some of the lines im trying to get out of a file

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.
Here's is my code:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
But I'm getting:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
Any help would be great!
Attached is a screenshot of the file
The specific problem of your code is the if .. elif .. elif statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.
You wanted something like this for your function:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith() if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.
Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

Placeholder for integer with unknown value

I am making a webscraping tool which gets the amount of players on a game server.
At the moment the most efficient method of doing this is to use Requests and BS4, to write the HTML source to a txt file, then search that file for
" / "
Unfortunately my HTML contains two forward slashed with spaces either side, so I need to be able to do something like
"%d / %d"
So it only gets the one with the integer, unfortunately I do not know the values either side, I just need it to only pick the one an integer in it.
prange = list(range(0, 65))
searchfile = open("data.txt", "r")
for line in searchfile:
if " / " in line:
print (line)
searchfile.close()
Thanks in advance!
You can try using re to find required pattern:
>>> import re
>>> re.search( '(\d+)\s+/\s+(\d+)', 'dsdsd 111 / 222 dsdsds').groups()
('111', '222')
What you want is using regex to search for a specific pattern in your document.
re.search(r'(\d) / (\d)', your_text) will return all occurrences of X / Y where X and Y are 1-digit numbers. If you want more than one digit, you can take a look at the regex syntax, and write something like r'(\d+) / (\d+)'.
With your example, you should have:
prange = list(range(0, 65))
searchfile = open("data.txt", "r")
for line in searchfile:
m = re.search(r'(\d+ / \d+)', line)
if m:
print (line)
searchfile.close()

python randomizer - get random text between curly braces with double nesting level

hey i need to create simple python randomizer. example input:
{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?
and output should be:
hello, can you give me advice
i have a script, which can do this but only in one nested level
with open('text.txt', 'r') as text:
matches = re.findall('([^{}]+)', text.read())
words = []
for match in matches:
parts = match.split('|')
if parts[0]:
words.append(parts[random.randint(0, len(parts)-1)])
message = ''.join(words)
this is not enough for me )
Python regex does not support nested structures, so you'll have to find some other way to parse the string.
Here's my quick kludge:
def randomize(text):
start= text.find('{')
if start==-1: #if there are no curly braces, there's nothing to randomize
return text
# parse the choices we have
end= start
word_start= start+1
nesting_level= 0
choices= [] # list of |-separated values
while True:
end+= 1
try:
char= text[end]
except IndexError:
break # if there's no matching closing brace, we'll pretend there is.
if char=='{':
nesting_level+= 1
elif char=='}':
if nesting_level==0: # matching closing brace found - stop parsing.
break
nesting_level-= 1
elif char=='|' and nesting_level==0:
# put all text up to this pipe into the list
choices.append(text[word_start:end])
word_start= end+1
# there's no pipe character after the last choice, so we have to add it to the list now
choices.append(text[word_start:end])
# recursively call this function on each choice
choices= [randomize(t) for t in choices]
# return the text up to the opening brace, a randomly chosen string, and
# don't forget to randomize the text after the closing brace
return text[:start] + random.choice(choices) + randomize(text[end+1:])
As I said above, nesting is essentially useless here, but if you want to keep your current syntax, one way to handle it is to replace braces in a loop until there are no more:
import re, random
msg = '{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?'
while re.search(r'{.*}', msg):
msg = re.sub(
r'{([^{}]*)}',
lambda m: random.choice(m.group(1).split('|')),
msg)
print msg
# zdravstvuy, can u give me advice?

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources