need help extracting string from text - python

I'm trying to extract financial data from a wall of text. basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out. Some curly braces really throw a wrench into my plan, because i'm trying to format a string.
I want to pass my function a string such as:
"totalCashflowsFromInvestingActivities"
and extract the following raw number:
"-2478000"
this is my current function, which works, but not efficient at all
def splitting(value, text):
x= text.split('"{}":'.format(value))[1]
y=x.split(',"fmt":')[0]
z=y.split(':')[1]
return z
any help would be greatly appreciated!
sample text:
"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}

Here is a solution using regex. It assumes the format is always the same, having the raw value always immediately after the title and separated by ":{.
import re
def get_value(value_name, text):
""" finds all the occurrences of the passed `value_name`
and returns the `raw` values"""
pattern = value_name + r'":{"raw":(-?\d*)'
return re.findall(pattern, text)
text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'
val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']
You can cast that result to a numeric type with map by replacing the return line.
return list(map(int, re.findall(pattern, text)))

If Buran is right and your source is Json, you might find this helpful:
import json
s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'
j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
if "totalCashflowsFromInvestingActivities" in i:
print(i["totalCashflowsFromInvestingActivities"]["raw"])
In this way you can find anything in the wall of text.
Take a look at this too: https://www.w3schools.com/python/python_json.asp

Related

Formatting in python(Kivy) like in Stack overflow

My issue is that I would like to take input text with formatting like you would use when creating a Stackoverflow post and reformat it into the required text string. The best way I can think is to give an example....
# This is the input string
Hello **there**, how are **you**
# This is the intended output string
Hello [font=Nunito-Black.ttf]there[/font], how are [font=Nunito-Black.ttf]you[/font]
SO the ** is replaced by a different string that has an opening and a closing part but also needs to work as many times as needed for any string. (As seen 2 times in the example)
I have tried to use a variable to record if the ** in need of replacing is an opening or a closing part, but haven't managed to get a function to work yet, hence it being incomplete
I think replacing the correct ** is hard because I have been trying to use index which will only return the position of the 1st occurrence in the string
My attempt as of now
def formatting_text(input_text):
if input_text:
if '**' in input_text:
d = '**'
for line in input_text:
s = [e+d for e in line.split(d) if e]
count = 0
for y in s:
if y == '**' and count == 0:
s.index(y)
# replace with required part
return output_text
return input_text
I have tried to find this answer so I'm sorry if has already been asked but I have had no luck finding it and don't know what to search
Of course thank you for any help
A general solution for your case,
Using re
import re
def formatting_text(input_text, special_char, left_repl, right_repl):
# Define re pattern.
RE_PATTERN = f"[{special_char}].\w+.[{special_char}]"
for word in re.findall(RE_PATTERN, input_text):
# Re-assign with replacement with the parts.
new_word = left_repl+word.strip(special_char)+right_repl
input_text = input_text.replace(word, new_word)
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
Without using re
def formatting_text(input_text, special_char, left_repl, right_repl):
while True:
# Replace the left part.
input_text = input_text.replace(special_char, left_repl, 1)
# Replace the right part.
input_text = input_text.replace(special_char, right_repl, 1)
if input_text.find(special_char) == -1:
# Nothing found, time to stop.
break
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
However the above solution should work for other special_char like __, *, < etc. But if you want to just make it bold only, you may prefer kivy's bold markdown for label i.e. [b] and escape [/b].
So the formatting stack overflow uses is markdown, implemented in javascript. If you just want the single case to be formatted then you can see an implementation here where they use regex to find the matches and then just iterate through them.
STRONG_RE = r'(\*{2})(.+?)\1'
I would recommend against re-implementing an entire markdown solution yourself when you can just import one.

Remove brackets if the content inside is a number

Is there any way to remove the brackets () if the content inside .isnumeric()
I do know a little bit of RegEx but I'm unable to find a way to do it using RegEx.
Example:
input = '((1)+(1))+2+(1+2)+((2))'
output = somefunction(input)
Here the output should look like
(1+1)+2+(1+2)+2
import re
x = '((1)+(1))+2+(1+2)+((2))'
re.sub(r'(\()([\d*\.]+)(\))', r"\2", x)
"""
or
re.sub(r'\(([\d*\.]+)\)', r"\1", x) # #deceze
"""
But this will give you
(1+1)+2+(1+2)+(2)
Can maybe use re.subn to do this until number of replacements are 0

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Categories

Resources