My issue is that I would like to take input text with formatting like you would use when creating a Stackoverflow post and reformat it into the required text string. The best way I can think is to give an example....
# This is the input string
Hello **there**, how are **you**
# This is the intended output string
Hello [font=Nunito-Black.ttf]there[/font], how are [font=Nunito-Black.ttf]you[/font]
SO the ** is replaced by a different string that has an opening and a closing part but also needs to work as many times as needed for any string. (As seen 2 times in the example)
I have tried to use a variable to record if the ** in need of replacing is an opening or a closing part, but haven't managed to get a function to work yet, hence it being incomplete
I think replacing the correct ** is hard because I have been trying to use index which will only return the position of the 1st occurrence in the string
My attempt as of now
def formatting_text(input_text):
if input_text:
if '**' in input_text:
d = '**'
for line in input_text:
s = [e+d for e in line.split(d) if e]
count = 0
for y in s:
if y == '**' and count == 0:
s.index(y)
# replace with required part
return output_text
return input_text
I have tried to find this answer so I'm sorry if has already been asked but I have had no luck finding it and don't know what to search
Of course thank you for any help
A general solution for your case,
Using re
import re
def formatting_text(input_text, special_char, left_repl, right_repl):
# Define re pattern.
RE_PATTERN = f"[{special_char}].\w+.[{special_char}]"
for word in re.findall(RE_PATTERN, input_text):
# Re-assign with replacement with the parts.
new_word = left_repl+word.strip(special_char)+right_repl
input_text = input_text.replace(word, new_word)
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
Without using re
def formatting_text(input_text, special_char, left_repl, right_repl):
while True:
# Replace the left part.
input_text = input_text.replace(special_char, left_repl, 1)
# Replace the right part.
input_text = input_text.replace(special_char, right_repl, 1)
if input_text.find(special_char) == -1:
# Nothing found, time to stop.
break
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
However the above solution should work for other special_char like __, *, < etc. But if you want to just make it bold only, you may prefer kivy's bold markdown for label i.e. [b] and escape [/b].
So the formatting stack overflow uses is markdown, implemented in javascript. If you just want the single case to be formatted then you can see an implementation here where they use regex to find the matches and then just iterate through them.
STRONG_RE = r'(\*{2})(.+?)\1'
I would recommend against re-implementing an entire markdown solution yourself when you can just import one.
Related
I'm trying to extract financial data from a wall of text. basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out. Some curly braces really throw a wrench into my plan, because i'm trying to format a string.
I want to pass my function a string such as:
"totalCashflowsFromInvestingActivities"
and extract the following raw number:
"-2478000"
this is my current function, which works, but not efficient at all
def splitting(value, text):
x= text.split('"{}":'.format(value))[1]
y=x.split(',"fmt":')[0]
z=y.split(':')[1]
return z
any help would be greatly appreciated!
sample text:
"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}
Here is a solution using regex. It assumes the format is always the same, having the raw value always immediately after the title and separated by ":{.
import re
def get_value(value_name, text):
""" finds all the occurrences of the passed `value_name`
and returns the `raw` values"""
pattern = value_name + r'":{"raw":(-?\d*)'
return re.findall(pattern, text)
text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'
val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']
You can cast that result to a numeric type with map by replacing the return line.
return list(map(int, re.findall(pattern, text)))
If Buran is right and your source is Json, you might find this helpful:
import json
s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'
j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
if "totalCashflowsFromInvestingActivities" in i:
print(i["totalCashflowsFromInvestingActivities"]["raw"])
In this way you can find anything in the wall of text.
Take a look at this too: https://www.w3schools.com/python/python_json.asp
I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!
I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.
For example, from the following string:
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].
The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.
When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.
This is what I have so far:
regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)
Here is a demo: https://regex101.com/r/INPRdN/2
The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.
I would appreciate if someone could provide me with a solution (regex or AST) to this problem.
Thank you.
I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:
# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')
def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""
def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text
def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)
def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)
def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]
# result that will contains
expressions = []
string = remove_brackets(string)
# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)
# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')
# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]
# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])
# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))
OUTPU:
['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']
I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.
This answer might be too later. But it is possible to do it using python regex package.
import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] +
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)'
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like 'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.
output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]']
pattern explaination:
( # 1st capturing group start
\b[a-z]\w*\b #variable name ,eg 'bar'
(?!\s*[\(\"]) #negative lookahead. so to ignore something like foobar"
(\[(?:[^\[\]]|(?2))*\]) #2nd capture group,capture nested groups in '[ ]'
#eg '[1:10:var1[2+1]]'.
#'?2' refer to 2nd capturing group recursively
? #2nd capturing group is optional so to capture something like 'foo'
) #end of 1st group.
Regex is not a powerful enough tool to do this. If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.
This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do
If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this. You would be best off trying to build a recursive function to do the parsing.
Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.
i have a custom script i want to extract data from with python, but the only way i can think is to take out the marked bits then leave the unmarked bits like "go up" "go down" in this example.
string_a = [start]go up[wait time=500]go down[p]
string_b = #onclick go up[wait time=500]go down active="False"
In trying to do so, all I managed to do was extract the marked bits, but i cant figure out a way to save the data that isnt marked! it always gets lost when i extract the other bits!
this is the function im using to extract them. I call it multiple times in order to whittle away the markers, but I can't choose the order they get extracted in!
class Parsers:
#staticmethod
def extract(line, filters='[]'):
##retval list
substring=line[:]
contents=[]
for bracket in range(line.count(str(filters[0]))):
startend =[]
for f in filters:
now= substring.find(f)
startend.append(now)
contents.append(substring[startend[0]+1:startend[1]])
substring=substring[startend[1]+1:]
return contents, substring
btw the order im calling it at the moment is like this. i think i should put the order back to the # being first, but i dont want to break it again.
star_string, first = Parsers.extract(string_a, filters='* ')
bracket_string, substring = Parsers.extract(string_a, filters='[]')
at_string, final = Parsers.extract(substring, filters='# ')
please excuse my bad python, I learnt this all on my own and im still figuring this out.
You are doing some mighty malabarisms with Python string methods above - but if all you want is to extract the content within brackets, and get the remainder of the string, that would be an eaasier thing with regular expressions (in Python, the "re" module)
import re
string_a = "[start]go up[wait time=500]go down[p]"
expr = r"\[.*?\]"
expr = re.compile(r"\[.*?\]")
contents = expr.findall(string_a)
substring = expr.sub("", string_a)
This simply tells the regexp engine to match for a literal [, and whatever characters are there(.*) up to the following ] (? is used to match the next ], and not the last one) - the findall call gets all such matches as a list of strings, and the sub call replaces all the matches for an empty string.
For nice that regular expressions are, they are less Python than their own sub-programing language. Check the documentation on them: https://docs.python.org/2/library/re.html
Still, a simpler way of doing what you had done is to check character by character, and have some variables to "know" where you are in the string (if inside a tag or not, for example) - just like we would think about the problem if we could look at only one character at a time. I will write the code thinking on Python 3.x - if you are still using Python 2.x, please convert your strings to unicode objects before trying something like this:
def extract(line, filters='[]'):
substring = ""
contents = []
inside_tag = False
partial_tag = ""
for char in line:
if char == filters[0] and not inside_tag:
inside_tag = True
elif char == filters[1] and inside_tag:
contents.append(partial_tag)
partial_tag = ""
inside_tag = False
elif inside_tag:
partial_tag += char
else:
substring += 1
if partial_tag:
print("Warning: unclosed tag '{}' ".format(partial_tag))
return contents, substring
Perceive as there is no need of complicated calculations of where each bracket falls in the line, and so on - you just get them all.
Not sure I understand this fully - you want to get [stuff in brackets] and everything else? If you are just parsing flat strings - no recursive brackets-in-brackets - you can do
import re
parse = re.compile(r"\[.*?\]|[^\[]+").findall
then
>>> parse('[start]go up[wait time=500]go down[p]')
['[start]', 'go up', '[wait time=500]', 'go down', '[p]']
>>> parse('#onclick go up[wait time=500]go down active="False"')
['#onclick go up', '[wait time=500]', 'go down active="False"']
The regex translates as "everything between two square brackets OR anything up to but not including an opening square bracket".
If this isn't what you wanted - do you want #word to be a separate chunk? - please show what string_a and string_b should be parsed as!