Lets say I have a bunch of strings, and they can only be in the following formats:
format1 = 'substring1#substring2'
format2 = 'substring1$substring2'
format3 = 'substring1'
Let me explain. The strings are sometimes divided using the # or $ character. However other times, they are not.
I want to remove the part that appears after the # or $, if it exists. If it was just one special character, that is #, I could have done this:
string = string.split('#')[0]
But how can I do it for the 2 special characters in a quick and elegant way? Also assume the following things:
Only one special character can appear in the string.
The special characters will not appear in any other part of the string.
Thanks.
If you want to avoid regex, one possibility would be:
string = min(string.split('#')[0], string.split('$')[0])
Regular expressions.
import re
re.sub('[$#].*', '', string_to_modify)
Use regex!
import re
new_string = re.sub(r"(#|$).*$", "", string)
Use re.split() for it.
import re
print(re.split("#|$","STRING#OTHER_STRING#OTHER_STRING_2"))
Use replace before split:
format1 = 'substring1#substring2'
format2 = 'substring1$substring2'
format3 = 'substring1'
print(format1.replace('#', '$').split('$')[0])
print(format2.replace('#', '$').split('$')[0])
print(format3.replace('#', '$').split('$')[0])
Output
substring1
substring1
substring1
You can use a for loop to split by an arbitrary number of delimiters. Regular expression is typically less efficient than Python str methods.
def converter(x, delims='#$'):
for delim in delims:
x = x.split(delim, maxsplit=1)[0]
return x
format1 = 'substring1#substring2'
format2 = 'substring1$substring2'
format3 = 'substring1'
for value in [format1, format2, format3]:
print(converter(value))
# substring1
# substring1
# substring1
I think you can use a list to maintain the special characters that can be used and for every of them check if is present in the string, when you find one execute the splitting process and retrieve only the left part like so:
delimters = ["#","$"]
for symbol in delimters:
if symbol in string1:
left_part = string1.split(symbol)[0]
Now this approach has some disadvantages but is the simplest in my opinion. The problem is that if you have more than one string you need to nested loops.
Related
I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.
So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?
Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"
With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)
Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).
I've seen a lot of examples on how to remove brackets from a string in Python, but I've not seen any that allow me to remove the brackets and a number inside of the brackets from that string.
For example, suppose I've got a string such as "abc[1]". How can I remove the "[1]" from the string to return just "abc"?
I've tried the following:
stringTest = "abc[1]"
stringTestWithoutBrackets = str(stringTest).strip('[]')
but this only outputs the string without the final bracket
abc[1
I've also tried with a wildcard option:
stringTest = "abc[1]"
stringTestWithoutBrackets = str(stringTest).strip('[\w+\]')
but this also outputs the string without the final bracket
abc[1
You could use regular expressions for that, but I think the easiest way would be to use split:
>>> stringTest = "abc[1][2][3]"
>>> stringTest.split('[', maxsplit=1)[0]
'abc'
You can use regex but you need to use it with the re module:
re.sub(r'\[\d+\]', '', stringTest)
If the [<number>] part is always at the end of the string you can also strip via:
stringTest.rstrip('[0123456789]')
Though the latter version might strip beyond the [ if the previous character is in the strip list too. For example in "abc1[5]" the "1" would be stripped as well.
Assuming your string has the format "text[number]" and you only want to keep the "text", then you could do:
stringTest = "abc[1]"
bracketBegin = stringTest.find('[')
stringTestWithoutBrackets = stringTest[:bracketBegin]
I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31
Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.