grep prefix python strings - python

I'm having problems using findall in python.
I have a text such as:
the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched
So i'm trying to find all occurrences of of those alphanumeric strings in the text. the thing is I know they all have the "33e" prefix.
Right now, I have strings = re.findall(r"(33e+)+", stdout_value) but it doesn't work.
I want to be able to return 33e445a64b65, 33e5c44598e46

try this
>>> x="the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched"
>>> re.findall("33e\w+",x)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']

Here's a slight variation:
>>> string = '''the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched'''
>>> re.findall(r"(33e[a-z0-9]+)", string)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Instead of matching any word characters, it will only match digits and lowercase numbers after the 33e -- that's what the [a-z0-9]+ means.
If you wanted to also match capital letters, you could replace that part with [a-zA-Z0-9]+ instead.

Related

Regex to get all occurrences of a pattern followed by a value in a comma separate string

This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Parsing Korean text into a list using regex

I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:
my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'
Into a list like this:
parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'
So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).
I can extract text in parentheses by using regex as follows:
corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)
which yields this:
[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]
But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.
You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:
corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)
It's little bit clumsy but you can try:
my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.

Python - Regex to avoid matching duplicates

My string looks like this:
bo_1
bo_1
bo_2
bo_2
bo_3
bo_3
bo_4
bo_4
bo_5
bo_5
bo_6
bo_6
bo_7
bo_7
bo_8
bo_8
bo_9
bo_9
bo_10
bo_10
I want to match the first instance of each digit and ignore the next duplicate line. My regex is as follows:
(bo_\d)(?![\s\S]*\1)
which returns the following:
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_1'
How would I modify the regex to return a result like this instead (to include 'bo_1' at the start and 'bo_10' at the end):
'bo_1'
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_10'
Technically you don't need regex for that (you can use set() for instance):
>>> # Assume your string is in the variable called "text"
>>> result = set(text.split('\n'))
>>> result
{'bo_7', 'bo_3', 'bo_1', 'bo_6', 'bo_5', 'bo_8', 'bo_9', 'bo_2', 'bo_4', 'bo_10'}
Anyway, the issue with your regex is that bo_1 is also matching bo_10, so it will be seen as a duplicate by the regex. You can solve it using word boundaries to ensure that the full 'word' is tested for a match:
\b(bo_\d+)\b(?![\s\S]*\b\1\b)
regex101 demo
Use
(bo_\d+$)(?![\s\S]*^\1$)
Since you want to include bo_10, you should use \d+ and not just \d in the initial group. Then, in your negative lookahead, put the backrefrence between start-of-line and end-of-line anchors, so that, for example, bo_1 does not get excluded because it's followed by a bo_10.
https://regex101.com/r/8khbcc/1

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

Categories

Resources