python randomizer - get random text between curly braces with double nesting level

python randomizer - get random text between curly braces with double nesting level - python

hey i need to create simple python randomizer. example input:
{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?
and output should be:
hello, can you give me advice
i have a script, which can do this but only in one nested level
with open('text.txt', 'r') as text:
matches = re.findall('([^{}]+)', text.read())
words = []
for match in matches:
parts = match.split('|')
if parts[0]:
words.append(parts[random.randint(0, len(parts)-1)])
message = ''.join(words)
this is not enough for me )

Python regex does not support nested structures, so you'll have to find some other way to parse the string.
Here's my quick kludge:
def randomize(text):
start= text.find('{')
if start==-1: #if there are no curly braces, there's nothing to randomize
return text
# parse the choices we have
end= start
word_start= start+1
nesting_level= 0
choices= [] # list of |-separated values
while True:
end+= 1
try:
char= text[end]
except IndexError:
break # if there's no matching closing brace, we'll pretend there is.
if char=='{':
nesting_level+= 1
elif char=='}':
if nesting_level==0: # matching closing brace found - stop parsing.
break
nesting_level-= 1
elif char=='|' and nesting_level==0:
# put all text up to this pipe into the list
choices.append(text[word_start:end])
word_start= end+1
# there's no pipe character after the last choice, so we have to add it to the list now
choices.append(text[word_start:end])
# recursively call this function on each choice
choices= [randomize(t) for t in choices]
# return the text up to the opening brace, a randomly chosen string, and
# don't forget to randomize the text after the closing brace
return text[:start] + random.choice(choices) + randomize(text[end+1:])

As I said above, nesting is essentially useless here, but if you want to keep your current syntax, one way to handle it is to replace braces in a loop until there are no more:
import re, random
msg = '{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?'
while re.search(r'{.*}', msg):
msg = re.sub(
r'{([^{}]*)}',
lambda m: random.choice(m.group(1).split('|')),
msg)
print msg
# zdravstvuy, can u give me advice?

Related

How to remove dash/ hyphen from each line in .txt file

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?
E.g.:
effects on the skin is fully under-
stood one fights
to:
effects on the skin is fully understood
one fights
or:
effects on the skin is fully
understood one fights
Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.
Edit:
The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below

This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.
To combine the results back into a single block of text, you can join it against the line separator of your choice:
source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""
def reflow(text):
holdover = ""
for line in text.splitlines():
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
yield f"{holdover}{lin}"
holdover = e[:-1]
print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""
To read one file line-by-line and write directly to a new file:
def reflow(infile, outfile):
with open(infile) as source, open(outfile, "w") as dest:
holdover = ""
for line in source.readlines():
line = line.rstrip("\n")
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
dest.write(f"{holdover}{lin}\n")
holdover = e[:-1]
if __name__ == "__main__":
reflow("source.txt", "dest.txt")

Here is one way to do it
with open('test.txt') as file:
combined_strings = []
merge_line = False
for item in file:
item = item.replace('\n', '') # remove new line character at end of line
if '-' in item[-1]: # check that it is the last character
merge_line = True
combined_strings.append(item[:-1])
elif merge_line:
merge_line = False
combined_strings[-1] = combined_strings[-1] + item
else:
combined_strings.append(item)

If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items
words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if "-" in wordsSplit[i][g]:
#setting the new word in the list and removing the hyphen
wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if wordsSplit[i][g] != "":
msg += wordsSplit[i][g]+" "
What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.

What about
import re
a = '''effects on the skin is fully under-
stood one fights'''
re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')
Explanation
a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)
-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.
.replace('~','\n') finally replaces all remaining ~ chars to newlines.

Remove everything but #number in brackets

I have a file where the lines have the form #nr = name(#nr, (#nr), different vars, and names).
I would like to only have the #nr in the brackets to get the form #nr = name(#nr, #nr)
I have tried to solve this in different ways like using regex, startswith() and lists but nothing has worked so far.
Any help is much appreciated.
Edit: Code
for line in f.split():
start = line.find( '(' )
end = line.find( ')' )
if start != -1 and end != -1:
line = ''.join(i for i in x if not i.startswith('#'))
print(line)
Edit 2:
As example I have:
#304= IFCRELDEFINESBYPROPERTIES('0FZ0hKNanFNAQpJ_Iqh4zM',#42,$,$,(#142),#301);
Afterwards I want to have:
#304= IFCRELDEFINESBYPROPERTIES(#42,#142,#301);

This can be solved using regex, though trying to do it with a single find/replace would be more complicated. Instead, you can do it in two steps:
import re
def sub_func(match):
nums = re.findall(r'#\d+', match.group(2))
return match.group(1) + '(' + ','.join(nums) + ');'
text = "#304= IFCRELDEFINESBYPROPERTIES('0FZ0hKNanFNAQpJ_Iqh4zM',#42,$,$,(#142),#301);"
result = re.sub(r'(^[^(]+)\((.*)\);', sub_func, text)
print(result)
# '#304= IFCRELDEFINESBYPROPERTIES(#42,#142,#301);'
So instead of passing a string as the second argument for re.sub, we pass a function instead, where we can process the results of the match with some more regex and reformatting the results before passing it back.

Comparing multiple file items using re

Currently I have a script that finds all the lines across multiple input files that have something in the format of
Matches: 500 (54.3 %) and prints out the top 10 highest matches in percentage.
I want to be able to have it also output the top 10 lines for score ex: Score: 4000
import re
def get_values_from_file(filename):
f = open(filename)
winpat = re.compile("([\d\.]+)\%")
xinpat = re.compile("[\d]") #ISSUE, is this the right regex for it? Score: 500****
values = []
scores = []
for line in f.readlines():
if line.find("Matches") >=0:
percn = float(winpat.findall(line)[0])
values.append(percn)
elif line.find("Score") >=0:
hey = float(xinpat.findall(line)[0])
scores.append(hey)
return (scores,values)
all_values = []
all_scores = []
for filename in ["out0.txt", "out1.txt"]:#and so on
values = get_values_from_file(filename)
all_values += values
all_scores += scores
all_values.sort()
all_values.reverse()
all_scores.sort() #also for scores
all_scores.reverse()
print(all_values[0:10])
print(all_scores[0:10])
Is my regex for the score format correct? I believe that's where I am having the issue, as it doesn't output both correctly.
Any thoughts? Should I split it into two functions?
Thank you.

Is my regex for the score format correct?
No, it should be r"\d+".
You don't need []. Those brackets establish a character class representing all of the characters inside the brackets. Since you only have one character type inside the bracket, they do nothing.
You only match a single character. You need a * or a + to match a sequence of characters.
You have an unescaped backslash in your string. Use the r prefix to allow the regular expression engine to see the backslash.
Commentary:
If it were me, I'd let the regular expression do all of the work, and skip line.find() altogether:
#UNTESTED
def get_values_from_file(filename):
winpat = re.compile(r"Matches:\s*\d+\s*\(([\d\.]+)\%\)")
xinpat = re.compile(r"Score:\s*([\d]+)")
values = []
scores = []
# Note: "with open() as f" automatically closes f
with open(filename) as f:
# Note: "for line in f" more memory efficient
# than "for line in f.readlines()"
for line in f:
win = winpat.match(line)
xin = xinpat.match(line)
if win: values.append(float(win.group(0)))
if xin: scores.append(float(xin.group(0)))
return (scores,values)
Just for fun, here is a version of the routine which calls re.findall exactly once per file:
# TESTED
# Compile this only once to save time
pat = re.compile(r'''
(?mx) # multi-line, verbose
(?:Matches:\s*\d+\s*\(([\d\.]+)\s*%\)) # "Matches: 300 (43.2%)"
|
(?:Score:\s*(\d+)) # "Score: 4000"
''')
def get_values_from_file(filename):
with open(filename) as f:
values, scores = zip(*pat.findall(f.read()))
values = [float(value) for value in values if value]
scores = [float(score) for score in scores if score]
return scores, values

No. xinpat will only match single digits, so findall() will return a list of single digits, which is a bit messy. Change it to
xinpat = re.compile("[\d]+")
Actually, you don't need the square brackets here, so you could simplify it to
xinpat = re.compile("\d+")
BTW, the names winpat and xinpat are a bit opaque. The pat bit is ok, but win & xin? And hey isn't great either. But I guess xin and hey are just temporary names you made up when you decidd to expand the program.
Another thing I just noticed, you don't need to do
all_values.sort()
all_values.reverse()
You can (and should) do that in one hit:
all_values.sort(reverse=True)

Python next substring search

I am transmitting a message with a pre/postamble multiple times. I want to be able to extract the message between two valid pre/postambles. My curent code is
print(msgfile[msgfile.find(preamble) + len(preamble):msgfile.find(postamble, msgfile.find(preamble))])
The problem is that if the postamble is corrupt, it will print all data between the first valid preamble and the next valid postamble. An example received text file would be:
garbagePREAMBLEmessagePOSTcMBLEgarbage
garbagePRdAMBLEmessagePOSTAMBLEgarbage
garbagePREAMBLEmessagePOSTAMBLEgarbage
and it will print
messagePOSTcMBLEgarbage
garbagePRdEAMBLEmessage
but what i really want it to print is the message from the third line since it has both a valid pre/post amble. So I guess what i want is to be able to find and index from the next instance of a substring. Is there an easy way to do this?
edit: I dont expect my data to be in nice discrete lines. I just formatted it that way so it would be easier to see

Process it line by line:
>>> test = "garbagePREAMBLEmessagePOSTcMBLEgarbage\n"
>>> test += "garbagePRdAMBLEmessagePOSTAMBLEgarbage\n"
>>> test += "garbagePREAMBLEmessagePOSTAMBLEgarbage\n"
>>> for line in test.splitlines():
if line.find(preamble) != -1 and line.find(postamble) != -1:
print(line[line.find(preamble) + len(preamble):line.find(postamble)])

are all messages on single lines?
Then you can use regular expressions to identify lines with valid pre- and postamble:
input_file = open(yourfilename)
import re
pat = re.compile('PREAMBLE(.+)POSTAMBLE')
messages = [pat.search(line).group(1) for line in input_file
if pat.search(line)]
print messages

import re
lines = ["garbagePREAMBLEmessagePOSTcMBLEgarbage",
"garbagePRdAMBLEmessagePOSTAMBLEgarbage",
"garbagePREAMBLEmessagePOSTAMBLEgarbage"]
# you can use regex
my_regex = re.compile("garbagePREAMBLE(.*?)POSTAMBLEgarbage")
# get the match found between the preambles and print it
for line in lines:
found = re.match(my_regex,line)
# if there is a match print it
if found:
print(found.group(1))
# you can use string slicing
def validate(pre, post, message):
for line in lines:
# method would break on a string smaller than both preambles
if len(line) < len(pre) + len(post):
print("error line is too small")
# see if the message fits the pattern
if line[:len(pre)] == pre and line[-len(post):] == post:
# print message
print(line[len(pre):-len(post)])
validate("garbagePREAMBLE","POSTAMBLEgarbage", lines)

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K

You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.

If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.

Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')

import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python randomizer - get random text between curly braces with double nesting level - python

Related

How to remove dash/ hyphen from each line in .txt file

Remove everything but #number in brackets

Comparing multiple file items using re

Python next substring search

Help parsing text file in python

Categories

Resources