Find string, extract value [duplicate] - python

This question already has answers here:
Extract part of a regex match
(11 answers)
Closed 3 years ago.
I'm trying to parse HTML in Python that has an inline script in it. I need to find a string inside of the script, then extract the value. I've been trying to do this in regex for the past few hours, but I'm still not convinced this is the correct approach.
Here is a sample:
['key_to_search_for']['post_date'] = '10 days ago';
The result I want to extract is: 10 days ago
This regex gets me part of the way, but I can't figure out the full match:
^\[\'key_to_search_for\'\]\[\'post_date\'\] = '(\d{1,2})+( \w)
Regex playground
However, even once I can match with regex, I'm not sure the best way to get only the value. I was thinking of just replacing the keys with blanks, like .replace('['key_to_search_for']['post_date'] = '',''), but that seems inefficient.
Should I be matching the regex then replacing? Is there a better way to handle this?

You can extract the value using a single capturing group and match the 2 words using a quantifier for \w+.
The value is in capture group 1.
^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$
Regex demo
Or use a negated character class matching any char except a '
^\['key_to_search_for'\]\['post_date'\] = '([^']+)';$
Regex demo

Related

regular expression in python to extract words that contain # but not dot [duplicate]

This question already has an answer here:
Regex: I want to match only words without a dot at the end
(1 answer)
Closed last year.
I am using Python to extract words from a text. I want to extract word that contain# but not dot.
Regular expression should match following: #bob, cat#bob
Regular expression should not match: xyz#bob.com.
I tried following: (?:\w+)?#\w+(?!\.) - but it extracts #bob, cat#bob and xyz#bo.
Just to elaborate, if I have text "hi #bob and cat#bob my email is xyz#bob.com" I want to extract #bob and cat#bob only from this text. My regular expression above extracts part of xyz#bob.com (precisely it extracts xyz#bo). How can I avoid extracting xyz#bob.com completely.
I was finally able to find a solution. The following expression worked for me: (?:\w+)?#\w+\b(?!.)

How can I re-write my Regex Expression to begin the search at the occurrence of a separate pattern? [duplicate]

This question already has answers here:
Python extract pattern matches
(10 answers)
Closed 2 years ago.
Apologies if this is a duplicate - I wasn't exactly sure what to search for and everything I found came up short.
I'm using Python and if anybodies interested I drafted up a quick example on here:
Regex101 Example I created
I'm trying to use regex to grab the first part of a string that might be formatted like so:
**This is a Location** 8:20
or it could be formatted like...
Irrelevant information - **Relevant Information** 6:90
I wrote the following expression which does the job almost perfectly, pulling the relevant part of the string (words) out but it also pulls in the second part of the string (numbers). This is annoying as I then need to do a second regex/python expression to split that out.
r'(\w* ){1,5}\d+:\d+'
I'm using Python so I know I can quite easily separate the info manually with a slice etc but I feel like there must be a more elegant solution to my Regex that would negate the need for this step. Essentially I think the solution would be to match '\d+:\d+' and look back from there.
Ok - perhaps this isn't the most elegant solution but I've just realised I think I can use capturing groups like so:
# Pattern with groups
pattern = '((\w* ){1,5})(\d+:\d+)'
string = "useless something else - useful 2:2"
r = re.search(pattern, string)
if r:
useful info= r.group(1)
boundary = r.group(3)
Theoretically, I'm always going to have the same number of groups with group 1 containing the relevant string I'm trying to grab and group 3 the time/number value. I'll test this now and update/close this thread.

Python regex to extract number of processors [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
I have a string which contains the number of processors:
SQLDB_GP_Gen5_2
The number is after _Gen and before _ (the number 5). How can I extract this using python and regular expressions?
I am trying to do it like this but don't get a match:
re.match('_Gen(.*?)_', 'SQLDB_GP_Gen5_2')
I was also trying this using pandas:
x['SLO'].extract(pat = '(?<=_Gen).*?(?:(?!_).)')
But this also wasn't working. (x is a Series)
Can someone please also point me to a book/tutorial site where I can learn regex and how to use with Pandas.
Thanks,
Mick
re.match searches from the beginning of the string. Use re.search instead, and retrieve the first capturing group:
>>> re.search(r'_Gen(\d+)_', 'SQLDB_GP_Gen5_2').group(1)
'5'
You need to use Series.str.extract with a pattern containing a capturing group:
x['SLO'].str.extract(r'_Gen(.*?)_', expand=False)
^^^^ ^^^^^^^^^^^
To only match a number, use r'_Gen(\d+)_'.
NOTES:
With Series.str.extract, you need to use a capturing group, the method only returns any value if it is captured
r'_Gen(.*?)_' will match _Gen, then will capture any 0+ chars other than line break chars as few as possible, and then match _. If you use \d+, it will only match 1+ digits.
Using re :
re.findall(r'Gen(.*)_',text)[0]

Remove everything after regex pattern match but keep pattern [duplicate]

This question already has answers here:
Using regex to remove all text after the last number in a string
(2 answers)
Closed 4 years ago.
I was searching for a way to remove all characters past a certain pattern match. I know that there are many similar questions here on SO but i was unable to find one that works for me. Basically i have a fixed pattern (\w\w\d\d\d\d), and i want to remove everything after that, but keep the pattern.
ive tried using:
test = 'PP1909dfgdfgd'
done = re.sub ('(\w\w\d\d\d\d/w*)', '\w\w\d\d\d\d/', test)
but still get the same string ..
example:
dirty = 'AA1001dirtydata'
dirty2 = 'AA1001222%^&*'
Desired output:
clean = 'AA1001'
You can use re.match() instead of re.sub():
re.match('\w\w\d\d\d\d', dirty).group(0) # returns 'AA1001'
Note: match will look for the regular expression at the beginning of the string you provide and only "match" the characters corresponding to the pattern. If you want to find the pattern partway through the string you can use re.search().

Make part of a regex match in python optional [duplicate]

This question already has answers here:
How to use regex with optional characters in python?
(5 answers)
Closed 5 years ago.
I'm trying to match a URL using re but am having trouble in regards to making part of the match optional.
import re
x = raw_input('Link: ')
reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)/[A-Za-z0-9?&=/?_]+'
if re.match(reg, x):
print 'True'
Currently, the above code would match something like:
https://iskis.com/?loc=shop_view_item&item=220503032
I would like to alter the regular expression to make the following, [A-Za-z0-9?&=/?_]+ an option - As such, anything after the slash isn't required, so the following should match:
https://iskis.com
I'm sure there is a simple solution but I don't know how to go about solving this.
reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'
Should do it. Surround the character class with () so it's a group, put a ? after it to make the text match 0-1 instances of that group, and put a $ at the end so that the regex will match to the end.
EDIT:
Come to think of it, you could use the optional match elsewhere in your regex.
reg = '(https?)://(www\.)?(iskis?)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'

Categories

Resources