Python : Extract substring if exist from another string using regex

Python : Extract substring if exist from another string using regex - python

I want to extract a value if exist from an url using regex ,
My string :
string = "utm_source=google&utm_campaign=replay&utm_medium=display&ctm_account=4&ctm_country=fr&ctm_bu=b2c&ctm_adchannel=im&esl-k=gdn|nd|c427558773026|m|k|pwww.ldpeople.com|t|dm|a100313514420|g9711440090"
From this string, I want to extract : c427558773026 , the value to extract will start always by c and have this pattern |c*|
import re
pattern = re.compile('|c\w|')
pattern.findall(string)
The result is none in my case, I am using python 2.7

You could assert a pipe (not that it is escaped) \| on the left and right using lookarounds, and match a c char followed by 1+ digits \d+
(?<=\|)c\d+(?=\|)
Regex demo
import re
string = "utm_source=google&utm_campaign=replay&utm_medium=display&ctm_account=4&ctm_country=fr&ctm_bu=b2c&ctm_adchannel=im&esl-k=gdn|nd|c427558773026|m|k|pwww.ldpeople.com|t|dm|a100313514420|g9711440090"
print(re.findall(r"(?<=\|)c\d+(?=\|)", string))
Or use a capturing group leaving out the lookbehind as #Wiktor Stribiżew suggest:
\|(c\d+)(?=\|)
Regex demo

The problem with your approach is that | is the or, which must be escaped to match the literal character. Additionally, you could use look-ahead/look-behind to ensure that | is encapsulating the string, and not capture it with findall
Here is a code snippet that should solve the problem:
>>> import re
>>> string = "utm_source=google&...&esl-k=gdn|nd|c427558773026|m|k|..."
>>> pattern = re.compile('(?<=\|)c\d+(?=\|)')
>>> pattern.findall(string)
['c427558773026']

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.

You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position

You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.

This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

How to split string at any number followed by a period instead of a fixed delimiter

input:
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
expected output:
[
"1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking",
"2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering",
...
]
Attempt: I have tried using a string.split(range(0,5)+"."). What would be the best way to do this?

I don't usually reach for regular expressions first, but this cries out for re.split.
parts = re.split(r'(\d\.)`, string)
This does need a bit of post-processing. It creates:
['', '1.', 'Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking', '2.', 'Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering', ...
So you'll need to combine ever other element.

You could split using a regex with lookaround assertions that assert 1+ digits followed by a dot to the right using (?=\d+\.) and assert not the start of the string to the left using (?<!^)
(?<!^)(?=\d+\.)
Regex demo | Python demo
import re
pattern = r"(?<!^)(?=\d+\.)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.split(pattern, string)
print(res)
Output
[
'1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking',
'2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering',
'3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering',
'4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering',
'5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management'
]
Or instead of splitting, you could also use a pattern to match 1 or more digits followed by a dot, and then match until the first occurrence of the same pattern or the end of the string.
\d+\..*?(?=\d+\.|$)
Regex demo | Python demo
import re
pattern = r"\d+\..*?(?=\d+\.|$)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.findall(pattern, string)

Extract a string between two set of patterns in Python

I am trying to extract a substring between two set of patterns using re.search().
On the left, there can be either 0x or 0X, and on the right there can be either U, , or \n. The result should not contain boundary patterns. For example, 0x1234U should result in 1234.
I tried with the following search pattern: (0x|0X)(.*)(U| |\n), but it includes the left and right patterns in the result.
What would be the correct search pattern?

You could use also use a single group using .group(1)
0[xX](.*?)[U\s]
The pattern matches:
0[xX] Match either 0x or 0X
(.*?) Capture in group 1 matching any character except a newline, as least as possible
[U\s] Match either U or a whitespace characters (which could also match a newline)
Regex demo | Python demo
import re
s = r"0x1234U"
pattern = r"0[xX](.*?)[U\s]"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
1234

You could use a combination of lookbehind and lookahead with a non-greedy match pattern in between:
import re
pattern = r"(?<=0[xX])(.*?)(?=[U\s\n])"
re.findall(pattern,"---0x1234U...0X456a ")
['1234', '456a']

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?

How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.

You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'

This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]

your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis

You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?

You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))

How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.

This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(

Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'

You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Extract substring if exist from another string using regex - python

Related

Replace a substring between two substrings

How to split string at any number followed by a period instead of a fixed delimiter

Extract a string between two set of patterns in Python

how to make a list in python from a string and using regular expression [duplicate]

Add [] around numbers in strings

Categories

Resources