Here is my problem: in a variable that is text and contains commas, I try to delete only the commas located between two strings (in fact [ and ]). For example using the following string:
input = "The sun shines, that's fine [not, for, everyone] and if it rains, it Will Be better."
output = "The sun shines, that's fine [not for everyone] and if it rains, it Will Be better."
I know how to use .replace for the whole variable, but I can not do it for a part of it.
There are some topics approaching on this site, but I did not manage to exploit them for my own question, e.g.:
Repeatedly extract a line between two delimiters in a text file, Python
Python finding substring between certain characters using regex and replace()
replace string between two quotes
import re
Variable = "The sun shines, that's fine [not, for, everyone] and if it rains, it Will Be better."
Variable1 = re.sub("\[[^]]*\]", lambda x:x.group(0).replace(',',''), Variable)
First you need to find the parts of the string that need to be rewritten (you do this with re.sub). Then you rewrite that parts.
The function var1 = re.sub("re", fun, var) means: find all substrings in te variable var that conform to "re"; process them with the function fun; return the result; the result will be saved to the var1 variable.
The regular expression "[[^]]*]" means: find substrings that start with [ (\[ in re), contain everything except ] ([^]]* in re) and end with ] (\] in re).
For every found occurrence run a function that convert this occurrence to something new.
The function is:
lambda x: group(0).replace(',', '')
That means: take the string that found (group(0)), replace ',' with '' (remove , in other words) and return the result.
You can use an expression like this to match them (if the brackets are balanced):
,(?=[^][]*\])
Used something like:
re.sub(r",(?=[^][]*\])", "", str)
Here is a non-regex method. You can replace your [] delimiters with say [/ and /], and then split on the / delimiter. Then every odd string in the split list needs to be processed for comma removal, which can be done while rebuilding the string in a list comprehension:
>>> Variable = "The sun shines, that's fine [not, for, everyone] and if it rains,
it Will Be better."
>>> chunks = Variable.replace('[','[/').replace(']','/]').split('/')
>>> ''.join(sen.replace(',','') if i%2 else sen for i, sen in enumerate(chunks))
"The sun shines, that's fine [not for everyone] and if it rains, it Will Be
better."
If you don't fancy learning regular expressions (see other responses on this page), you can use the partition command.
sentence = "the quick, brown [fox, jumped , over] the lazy dog"
left, bracket, rest = sentence.partition("[")
block, bracket, right = rest.partition("]")
"block" is now the part of the string in between the brackets, "left" is what was to the left of the opening bracket and "right" is what was to the right of the opening bracket.
You can then recover the full sentence with:
new_sentence = left + "[" + block.replace(",","") + "]" + right
print new_sentence # the quick, brown [fox jumped over] the lazy dog
If you have more than one block, you can put this all in a for loop, applying the partition command to "right" at every step.
Or you could learn regular expressions! It will be worth it in the long run.
Related
I scraped a few pdfs and some thick fonts get scraped as in this example:
text='and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
instead of
"and assesses our reformed teaching in the classroom"
How to fix this? I am trying with regex
pattern=r'([a-z])(?=\1)'
re.sub(pattern,'',text)
#"and aseses reformed teaching in the clasrom"
I am thinking of grouping the two groups above and add word boundaries
EDIT: this one fixes words with even number of letters:
pattern=r'([a-z])\1([a-z])\2'
re.sub(pattern,'\1\2',text)
#"and assesses oouurr reformed teaching in the classroom"
If letters are duplicated, you can try something like this
for w in text.split():
if len(w) %2 != 0:
print(w)
continue
if w[0::2] == w[1::2]:
print(w[0::2])
continue
print(w)
I am using a mixed approach: build the pattern and substitution in a for loop, then applying regex. The regexes applied go from e.g. words of 8x2=16 letters down to 3.
import re
text = 'and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
wrd_len = [9,8,7,6,5,4,3,2]
for l in wrd_len:
sub = '\\' + '\\'.join(map(str,range(1,l+1)))
pattern = '([a-z])\\' + '([a-z])\\'.join(map(str,range(1,l+1)))
text = re.sub(pattern, sub , text)
text
#and assesses our reformed teaching in the classroom
For example, the regex for 3-letter words becomes:
re.sub('([a-z])\1([a-z])\2([a-z])\3', '\1\2\3', text)
As a side note, I could not get those backslashes right with raw strings, and I am actually going to use [a-zA-Z].
i found solution in javascript that works fine :
([a-z])\1(?:(?=([a-z])\2)|(?<=\3([a-z])\1\1))
but in some how it doesn't work in python because lookbehind can't take references to group so i came up with another solution that can work in this example :
([a-z])\1(?:(?=([a-z])\2)|(?=[^a-z])))
try it here
I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!
for example
input:
text = 'wish you “happy” every (day)'
expected output:
text = 'wish you (happy) every “day”'
I'd like to swap parentheses with quotations everywhere in an unknown text.
I'm doing a school assignment and I'm NOT allowed to use list!!!
As strings are immutable so I'm not sure what to do then. please help!
Here's a silly answer using string.replace():
text = 'wish you “happy” every (day)'
text = text\
.replace('”', '*')\
.replace('“', "^")\
.replace("(", '“')\
.replace(")", '”')\
.replace("*", ')')\
.replace("^", '(')
Here's a serious answer:
Since you have a list of characters to replace, build a dictionary of what character should be swapped with another character, then iterate through the string and create a new string which contains the original string with replaced letters. This way, you can have as many or as few strings in the switch dictionary as you want, and it will work without modification. Factored up to a larger scale, you might want to do something like store this dictionary elsewhere, for example if you're creating it based on user input that you get from an interface or a webapp. It's often good to separate the specification what you are doing from how you are doing it, because one can be treated as data (which characters to swap) whereas the other is logic (actually swapping them).
text = 'wish you “happy” every (day)'
newtext = ''
switch = {
'“': '(',
'”': ')',
'(': '“',
')': '”',
}
for letter in text:
if letter in switch: letter = switch[letter]
newtext += letter
The reason I took an iterative approach is because we are swapping characters, so if you replace all instances of each character at the same time, characters will get swapped back once you swap the next one unless you include an intermediate step such as * in my silly answer or ### in the other answer, which opens the possibility of collisions (if your text already contained ### or * it would incorrectly be replaced).
immutability of strings is why you have to create the newtext string instead of replacing the characters in the original string as I iterate.
An intermediate replace is usually vulnerable, as text could contain the intermediate representation: ###...### in Roopak A Nelliat's answer, * and ^ in Ryan's "silly" one (EDIT: as noted in Ryan's more serious text below).
Here's a regex solution without going through the intermediate replace:
text = 'wish you "happy" every (day)'
def replacer(match):
quote, content = match.groups()
if quote:
return f'({content})'
else:
return f'"{content}"'
re.sub(r'(?:(")|\()(.*?)(?(1)"|\))', replacer, text)
# => 'wish you (happy) every "day"'
The regexp will match a starting delimiter (either " or (), then content, and a matching final delimiter (i.e. if the starting delimiter was " then ", if not then )). Then the replacer function constructs a replacement with the other set of delimiters.
What you need would be regex replace. Also, the group should be retained.
NOTE: I had to use 3 replaces because there is a swap.
a = 'wish you "happy" every (day)'
b = re.sub(r'"(\w+)"', r'###\1###', a)
// replaces words inside double quotes to words inside ###
// b = 'wish you ###happy### every (day)'
c = re.sub(r'\((\w+)\)', r'"\1"', b)
// replaces words inside paranthesis to words inside double quotes
// c = 'wish you ###happy### every "day"'
d = re.sub(r'###(\w+)###', r'(\1)', c)
// replaces the words inside ### to paranthesis
// d = 'wish you (happy) every "day"'
Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.
i have a custom script i want to extract data from with python, but the only way i can think is to take out the marked bits then leave the unmarked bits like "go up" "go down" in this example.
string_a = [start]go up[wait time=500]go down[p]
string_b = #onclick go up[wait time=500]go down active="False"
In trying to do so, all I managed to do was extract the marked bits, but i cant figure out a way to save the data that isnt marked! it always gets lost when i extract the other bits!
this is the function im using to extract them. I call it multiple times in order to whittle away the markers, but I can't choose the order they get extracted in!
class Parsers:
#staticmethod
def extract(line, filters='[]'):
##retval list
substring=line[:]
contents=[]
for bracket in range(line.count(str(filters[0]))):
startend =[]
for f in filters:
now= substring.find(f)
startend.append(now)
contents.append(substring[startend[0]+1:startend[1]])
substring=substring[startend[1]+1:]
return contents, substring
btw the order im calling it at the moment is like this. i think i should put the order back to the # being first, but i dont want to break it again.
star_string, first = Parsers.extract(string_a, filters='* ')
bracket_string, substring = Parsers.extract(string_a, filters='[]')
at_string, final = Parsers.extract(substring, filters='# ')
please excuse my bad python, I learnt this all on my own and im still figuring this out.
You are doing some mighty malabarisms with Python string methods above - but if all you want is to extract the content within brackets, and get the remainder of the string, that would be an eaasier thing with regular expressions (in Python, the "re" module)
import re
string_a = "[start]go up[wait time=500]go down[p]"
expr = r"\[.*?\]"
expr = re.compile(r"\[.*?\]")
contents = expr.findall(string_a)
substring = expr.sub("", string_a)
This simply tells the regexp engine to match for a literal [, and whatever characters are there(.*) up to the following ] (? is used to match the next ], and not the last one) - the findall call gets all such matches as a list of strings, and the sub call replaces all the matches for an empty string.
For nice that regular expressions are, they are less Python than their own sub-programing language. Check the documentation on them: https://docs.python.org/2/library/re.html
Still, a simpler way of doing what you had done is to check character by character, and have some variables to "know" where you are in the string (if inside a tag or not, for example) - just like we would think about the problem if we could look at only one character at a time. I will write the code thinking on Python 3.x - if you are still using Python 2.x, please convert your strings to unicode objects before trying something like this:
def extract(line, filters='[]'):
substring = ""
contents = []
inside_tag = False
partial_tag = ""
for char in line:
if char == filters[0] and not inside_tag:
inside_tag = True
elif char == filters[1] and inside_tag:
contents.append(partial_tag)
partial_tag = ""
inside_tag = False
elif inside_tag:
partial_tag += char
else:
substring += 1
if partial_tag:
print("Warning: unclosed tag '{}' ".format(partial_tag))
return contents, substring
Perceive as there is no need of complicated calculations of where each bracket falls in the line, and so on - you just get them all.
Not sure I understand this fully - you want to get [stuff in brackets] and everything else? If you are just parsing flat strings - no recursive brackets-in-brackets - you can do
import re
parse = re.compile(r"\[.*?\]|[^\[]+").findall
then
>>> parse('[start]go up[wait time=500]go down[p]')
['[start]', 'go up', '[wait time=500]', 'go down', '[p]']
>>> parse('#onclick go up[wait time=500]go down active="False"')
['#onclick go up', '[wait time=500]', 'go down active="False"']
The regex translates as "everything between two square brackets OR anything up to but not including an opening square bracket".
If this isn't what you wanted - do you want #word to be a separate chunk? - please show what string_a and string_b should be parsed as!