How to find part of string? - python

I am working with a string. I could find the part of string I need but not all of it. Which part of my code needs to change?
s = "3D(filters:!!(),refreshInterval:(pause:!!t,value:0),time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))%26_a%3D(description:!%27!%27,filters:!!(),fullScreenMode:!!"
report_time = s[s.find("time:(") + 1:s.find("))")]
Output I need:
>>> report_time
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'
Output I have:
>>> report_time
'ime:(from:!%272019-10-01T20:28:50.088Z!%27,to:now)'

You put the "+1" on the wrong index. You need to pick up from the first find location and go one character past the second to pick up the extra right parenthesis. This last needs even one more character (thanks to `smac89 for catching that).
report_time = s[s.find("time:("):s.find("))") + 2]
Output:
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'

Alternatively use a regular expression, e.g:
import re
re.search(r'(time:\(.*\)\))', s).group(1)
Explanation: group(1) returns the matching content of the 1st set of parentheses. .* matches any characters in between. The parentheses in your search therm need to be escaped.
Output:
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Python - Regex to avoid matching duplicates

My string looks like this:
bo_1
bo_1
bo_2
bo_2
bo_3
bo_3
bo_4
bo_4
bo_5
bo_5
bo_6
bo_6
bo_7
bo_7
bo_8
bo_8
bo_9
bo_9
bo_10
bo_10
I want to match the first instance of each digit and ignore the next duplicate line. My regex is as follows:
(bo_\d)(?![\s\S]*\1)
which returns the following:
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_1'
How would I modify the regex to return a result like this instead (to include 'bo_1' at the start and 'bo_10' at the end):
'bo_1'
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_10'
Technically you don't need regex for that (you can use set() for instance):
>>> # Assume your string is in the variable called "text"
>>> result = set(text.split('\n'))
>>> result
{'bo_7', 'bo_3', 'bo_1', 'bo_6', 'bo_5', 'bo_8', 'bo_9', 'bo_2', 'bo_4', 'bo_10'}
Anyway, the issue with your regex is that bo_1 is also matching bo_10, so it will be seen as a duplicate by the regex. You can solve it using word boundaries to ensure that the full 'word' is tested for a match:
\b(bo_\d+)\b(?![\s\S]*\b\1\b)
regex101 demo
Use
(bo_\d+$)(?![\s\S]*^\1$)
Since you want to include bo_10, you should use \d+ and not just \d in the initial group. Then, in your negative lookahead, put the backrefrence between start-of-line and end-of-line anchors, so that, for example, bo_1 does not get excluded because it's followed by a bo_10.
https://regex101.com/r/8khbcc/1

The difference between ( [^,]*) and (.*,) in regular expression? Using python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I donĀ“t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.
You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'
Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.
You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

Categories

Resources