I am trying to strip a line of code so that only the comment at the end is saved. Because # signs can be included within "" marks, to do this I am trying to cycle through the line catching pairs of " marks so that it ignores any # marks within "" marks. When I use a code visualiser on my code below, after the second for loop it seems to go pack to processing s as if it has just stripped the first " mark. I can't see what I'm doing wrong here, because the print statement I have included on line 19 shows that s has been stripped to after the second ", but when the code returns to the top, it starts cycling again from after the first ". Any idea of what I am doing wrong here?
s = '("8# " + str" #9 " + line) #lots of hash(#) symbols here'
quoteCount = 0
for char in s:
if quoteCount%2 == 0:
if char == '#':
s = s[s.index('#'):]
break
if char == '"':
quoteCount = quoteCount + 1
s = s[s.index('"'):]
s = s.lstrip('"')
for char in s:
if char == '"':
quoteCount = quoteCount + 1
s = s[s.index('"'):]
s = s.lstrip('"')
print(s)
break
print(s)
If I understand your question correctly you only want to keep the last comment (#lots of hash(#) symbols here).
To do this you don't need the nested for loop.
s = '("8# " + str" #9 " + line) #lots of hash(#) symbols here'
quoteCount = 0
for char in s:
if quoteCount%2 == 0:
if char == '#':
s = s[s.index('#'):]
break
if char == '"':
quoteCount = quoteCount + 1
s = s[s.index('"'):]
s = s.lstrip('"')
print(s)
Easier to remove the quoted strings with a regular expression:
import re
s = '("8# " + str" #9 " + line) #lots of hash(#) symbols here'
pattern = r'"[^"]*"'
s = re.sub(pattern, '', s)
print s[s.index('#'):]
Output:
#lots of hash(#) symbols here
Your code is overly complicated so I suggest you use an alternative method to finding the comment like the already mentioned regex one or the one I came up with.
s = '("8# " + str" #9 " + line) #lots of hash(#) symbols here'
s = s[s.rfind('"') + 1:] # Get to the last quotation mark
if s.find('#') >= 0: # The first # sign should start the comment if there is one
s = s[s.find('#'):]
else:
s = '' # No comment was found
print(s)
Related
The following works well:
from textwrap import fill
print(fill("hello-there", 8))
Outputs:
hello-
there
However, I am using a lot of text where words are separated with underscores, not hyphens. The option break_on_hyphens is great but there seems to be no way to specify other separators.
I looked around and was really surprised to not find anything on this. Does anyone have any idea of the best way to proceed?
So the quick-and-dirty way I found was to simply write a wrapper around the function that replaces the desired character with a hyphen to make that function work. It's not very Pythonic, but would get the job done. I'm hoping someone else can come up with the "actual" answer (if it exists...).
Note that I wrote a "fill" version as I am looking for a string, not a list.
Python 3:
from textwrap import wrap
def fill_custom_sep(source_text, separator_char, width=70, **kwargs):
chunk_start_index = 0
replaced_with_hyphens = source_text.replace(separator_char, '-')
returned = ""
for chunk in wrap(replaced_with_hyphens, width, **kwargs):
if len(returned):
returned += '\n' # Todo: Modifiable
chunk_length = len(chunk)
if chunk[-1] == "-" and source_text[chunk_start_index + (chunk_length - 1)] == separator_char:
chunk = chunk[:-1] + separator_char
returned += chunk
chunk_start_index += chunk_length
return returned
print(fill_custom_sep("hello_there", "_", 8))
Outputs:
hello_
there
While we're at it, here is a simple function to wrap around a list of separators (not just one):
from textwrap import wrap
def wrap_custom(source_text, separator_chars, width=70, keep_separators=True):
current_length = 0
latest_separator = -1
current_chunk_start = 0
output = ""
char_index = 0
while char_index < len(source_text):
if source_text[char_index] in separator_chars:
latest_separator = char_index
output += source_text[char_index]
current_length += 1
if current_length == width:
if latest_separator >= current_chunk_start:
# Valid earlier separator, cut there
cutting_length = char_index - latest_separator
if not keep_separators:
cutting_length += 1
if cutting_length:
output = output[:-cutting_length]
output += "\n"
current_chunk_start = latest_separator + 1
char_index = current_chunk_start
else:
# No separator found, hard cut
output += "\n"
current_chunk_start = char_index + 1
latest_separator = current_chunk_start - 1
char_index += 1
current_length = 0
else:
char_index += 1
return output
wrapped = wrap_custom("Split This-This_Text", [" ","_","-"], 7, False)
I have a file called test.txt It has a bunch of duplicate spaces. The test.txt file contains HTML. I want to remove all the unnessary whitespace to reduce the size of contents in the test.txt file. How can I remove the duplicate spaces and make the entire string on one line.
test.txt
<center>
<b class="test" >My name
is
fred</ b> <center>
What I want to print
<center><b class="test">My name is fred</b><center>
What gets printed
<center><b class="test" >Mynameisfred</b> <center>
program.py
def is_white_space(before, curr, after):
# remove duplicate spaces
if (curr == " " and (before == " " or after == " ")):
return True
# Remove all \n
elif (curr == "\n"):
return True
return False
f = open('test.txt', 'r')
contents = f.read()
f.close()
new = "";
i = 0
while (i < len(contents)):
if (i != 0 and
i != (len(contents) - 1) and
not is_white_space(contents[i - 1], contents[i], contents[i + 1])):
new += contents[i]
i += 1
print(new)
This will leave a space between digits or letters.
from string import ascii_letters, digits
def main():
with open('test.txt', 'r') as f:
parts = f.read().split()
keep_separated = set(ascii_letters) | set(digits)
for i in range(len(parts) - 1):
if parts[i][-1] in keep_separated and parts[i + 1][0] in keep_separated:
parts[i] = parts[i] + " "
print(''.join(parts))
if __name__ == '__main__':
main()
The client's name is after the word "for" and before the opening parenthesis "(" that starts the proposal number. I need to extract the client name to use to look up the deal in a future step. What would be the easiest way to set this up? Using Zapier Extract Pattern or to Use Zapier Code in Python?
I have tried this and it did not work. It seemed promising though.
input_data
client = Reminder: Leruths has sent you a proposal for Business Name (#642931)
import regex
rgx = regex.compile(r'(?si)(?|{0}(.*?){1}|{1}(.*?)
{0})'.format('for', '('))
s1 = 'client'
for s in [s1]:
m = rgx.findall
for x in m:
print x.strip()
I have also tried this and it did not work.
start = mystring.find( 'for' )
end = mystring.find( '(' )
if start != -1 and end != -1:
result = mystring[start+1:end]
I am looking for Business Name to be returned in my example.
Fastest way:
start = client.find('for')
end = client.find('(')
result = client[start+4:end-1]
print(result)
With regex:
result = re.search(r' for (.*) [(]', client)
print(result.group(1))
There is probably a cleaner way to do this, but here is another solution without regex
client = "Reminder: Leruths has sent you a proposal for Business Name (#642931)"
cs = client.split(" ")
name = ""
append = False
for word in cs:
if "for" == word:
append = True
elif word.startswith("("):
append = False
if append is True and word != "for":
name += (word + " ")
name = name.strip()
print(name)
Another method:
client = "Reminder: Leruths has sent you a proposal for Business Name (#642931)"
cs = client.split(" ")
name = ""
forindex = cs.index("for")
for i in range(forindex+1, len(cs)):
if cs[i].startswith("("):
break
name += cs[i] + " "
name = name.strip()
print(name)
Running the code below gives:
Regex method took 2.3912417888641357 seconds
Search word by word method took 4.78193998336792 seconds
Search with list index method took 3.1756017208099365 seconds
String indexing method took 0.8496286869049072 seconds
Code to check the fastest to get the name over a million tries:
import re
import time
client = "Reminder: Leruths has sent you a proposal for Business Name (#642931)"
def withRegex(client):
result = re.search(r' for (.*) [(]', client)
return(result.group(1))
def searchWordbyWord(client):
cs = client.split(" ")
name = ""
append = False
for word in cs:
if "for" == word:
append = True
elif word.startswith("("):
append = False
if append is True and word != "for":
name += (word + " ")
name = name.strip()
return name
def searchWithListIndex(client):
cs = client.split(" ")
name = ""
forindex = cs.index("for")
for i in range(forindex+1, len(cs)):
if cs[i].startswith("("):
break
name += cs[i] + " "
name = name.strip()
return name
def stringIndexing(client):
start = client.find('for')
end = client.find('(')
result = client[start+4:end-1]
return result
wr = time.time()
for x in range(1,1000000):
withRegex(client)
wr = time.time() - wr
print("Regex method took " + str(wr) + " seconds")
sw = time.time()
for x in range(1,1000000):
searchWordbyWord(client)
sw = time.time() - sw
print("Search word by word method took " + str(sw) + " seconds")
wl = time.time()
for x in range(1,1000000):
searchWithListIndex(client)
wl = time.time() - wl
print("Search with list index method took " + str(wl) + " seconds")
si = time.time()
for x in range(1,1000000):
stringIndexing(client)
si = time.time() - si
print("String indexing method took " + str(si) + " seconds")
the program is when user input"8#15#23###23#1#19###9#20"
output should be "HOW WAS IT"
However,it could not work to show space(###).
enter code here
ABSTRACT ={"A":"1","B":"2","C":"3","D":"4","E":"5","F":"6","G":"7","H":"8","I":"9", "J":"10","K":"11","L":"12","M":"13","N":"14","O":"15","P":"16","Q":"17","R":"18","S":"19","T":"20","U":"21","V":"22","W":"23", "X":"24","Y":"25","Z":"26",
" ":"###","":"#" }
ABSTRACT_SHIFTED = {value:key for key,value in ABSTRACT.items()}
def from_abstract(s):
result = ''
for word in s.split('*'):
result = result +ABSTRACT_SHIFTED.get(word)
return result
This would do the trick:
#!/usr/bin/env python
InputString = "8#15#23###23#1#19###9#20"
InputString = InputString.replace("###", "##")
InputString = InputString.split("#")
DecodedMessage = ""
for NumericRepresentation in InputString:
if NumericRepresentation == "":
NumericRepresentation = " "
DecodedMessage += NumericRepresentation
continue
else:
DecodedMessage += chr(int(NumericRepresentation) + 64)
print(DecodedMessage)
Prints:
HOW WAS IT
you can also use a regex
import re
replacer ={"A":"1","B":"2","C":"3","D":"4","E":"5","F":"6","G":"7","H":"8","I":"9", "J":"10","K":"11","L":"12","M":"13","N":"14","O":"15","P":"16","Q":"17","R":"18","S":"19","T":"20","U":"21","V":"22","W":"23", "X":"24","Y":"25","Z":"26",
" ":"###","":"#" }
reversed = {value:key for key,value in replacer.items()}
# Reversed because regex is greedy and it will match 1 before 15
target = '8#15#23###23#1#19###9#20'
pattern = '|'.join(map(lambda x: x + '+', list(reversed.keys())[::-1]))
repl = lambda x: reversed[x.group(0)]
print(re.sub(pattern, string=target, repl=repl))
And prints:
HOW WAS IT
With a couple minimal changes to your code it works.
1) split on '#', not '*'
2) retrieve ' ' by default if a match isn't found
3) use '##' instead of '###'
def from_abstract(s):
result = ''
for word in s.replace('###','##').split('#'):
result = result +ABSTRACT_SHIFTED.get(word," ")
return result
Swap the key-value pairs of ABSTRACT and use simple split + join on input
ip = "8#15#23###23#1#19###9#20"
ABSTRACT = dict((v,k) for k,v in ABSTRACT.items())
''.join(ABSTRACT.get(i,' ') for i in ip.split('#')).replace(' ', ' ')
#'HOW WAS IT'
The biggest challenge here is that "#" is used as a token separator and as the space character, you have to know the context to tell which you've got at any given time, and that makes it difficult to simply split the string. So write a simple parser. This one will accept anything as the first character in a token and then grab everything until it sees the next "#".
ABSTRACT ={"A":"1","B":"2","C":"3","D":"4","E":"5","F":"6","G":"7","H":"8","I":"9", "J":"10","K":"11","L":"12","M":"13","N":"14","O":"15","P":"16","Q":"17","R":"18","S":"19","T":"20","U":"21","V":"22","W":"23", "X":"24","Y":"25","Z":"26",
" ":"###","":"#" }
ABSTRACT_SHIFTED = {value:key for key,value in ABSTRACT.items()}
user_input = "8#15#23###23#1#19###9#20"
def from_abstract(s):
result = []
while s:
print 'try', s
# tokens are terminated with #
idx = s.find("#")
# ...except at end of line
if idx == -1:
idx = len(s) - 1
token = s[:idx]
s = s[idx+1:]
result.append(ABSTRACT_SHIFTED.get(token, ' '))
return ''.join(result)
print from_abstract(user_input)
I've written a function that surrounds a search term with a HTML element with given attributes. The idea is the resulting surrounded string is written to a log file later on with the search term highlighted.
def inject_html(needle, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
I want to open this up to accept an array of needles, locating and surrounding them with HTML in the haystack. I could easily do this by surrounding the code with another loop which iterates through the needles, locating and surrounding instances of the search term. Problem is, this doesn't protect from accidentally surrounding previously injected HTML code., e.g.
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
for needle in needles:
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
search_strings = ["foo", "pan", "test"]
haystack = "Foobar"
print(inject_html(search_strings,haystack))
<s<span class="matched">pan</span> class="matched">Foo</span>bar
On the second iteration, the code searches for and surrounds the "pan" text from the "span" that was inserted in the previous iteration.
How would you recommend I change my original function to look for a list of needles without the risk of injecting HTML into undesired locations (such as within existing tags).
--- UPDATE ---
I got around this by maintaining a list of "immune" ranges (ones which have already been surrounded with HTML and therefore do not need to be checked again.
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
immune = []
new_str = haystack
for needle in needles:
next_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), next_index)
end = start + len(needle)
if not any([(x[0] > start and x[0] < end) or (x[1] > start and x[1] < end) for x in immune]):
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
next_index = end + len(html_close) + len(html_open)
# Add the highlighted range (and HTML code) to the list of immune ranges
immune.append([start, next_index])
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
It's not particularly Pythonic though, I'd be interested to see if anyone can come up with something cleaner.
I'd use something like this:
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class":"matched"}):
new_text_body = []
html_start_tag = "<" + html_element_name + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
text_body_lines = text_body.split("\n")
for line in text_body_lines:
for p in phrases:
if line.lower() == p.lower():
line = html_start_tag + p + html_end_tag
break
new_text_body.append(line)
return "\n".join(new_text_body)
It goes through line by line and replaces each line if the line is an exact match (case-insensitive).
ROUND TWO:
With the requirement that the match needs to be (1) case-insensitive and (2) matches multiple words/phrases on each line, I would use:
import re
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class": "matched"}):
html_start_tag = "<" + html_element_name + " " + " ".join(["%s=\"%s\"" % (k, html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
for p in phrases:
text_body = re.sub(r"({})".format(p), r"{}\1{}".format(html_start_tag, html_end_tag), text_body, flags=re.IGNORECASE)
return text_body
For each provided phrase p, this uses a case-insensitive re.sub() replacement to replace all instances of that phrase in the provided text. (p) matches the phrase via a regular expression group. \1 is a backfill operator that matches the found phrase, enclosing it in HTML tags.
text = """
Somewhat more than forty years ago, Mr Baillie Fraser published a
lively and instructive volume under the title _A Winter’s Journey
(Tatar) from Constantinople to Teheran. Political complications
had arisen between Russia and Turkey - an old story, of which we are
witnessing a new version at the present time. The English government
deemed it urgently necessary to send out instructions to our
representatives at Constantinople and Teheran.
"""
new = inject_html(["TEHERAN", "Constantinople"], text)
print(new)
> Somewhat more than forty years ago, Mr Baillie Fraser published a lively and instructive volume under the title _A Winter’s Journey (Tatar) from <span class="matched">Constantinople</span> to <span class="matched">Teheran</span>. Political complications had arisen between Russia and Turkey - an old story, of which we are witnessing a new version at the present time. The English government deemed it urgently necessary to send out instructions to our representatives at <span class="matched">Constantinople</span> and <span class="matched">Teheran</span>.