Extracting numbers from a string using regex in python - python

I have a list of urls that I would like to parse:
['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
I would like to use a Regex expression to create a new list containing the numbers at the end of the string and any letters before punctuation (some strings contain numbers in two positions, as the first string in the list above shows). So the new list would look like:
['20170303', '20160929a', '20161005a']
This is what I've tried with no luck:
code = re.search(r'?[0-9a-z]*', urls)
Update:
Running -
[re.search(r'(\d+)\D+$', url).group(1) for url in urls]
I get the following error -
AttributeError: 'NoneType' object has no attribute 'group'
Also, it doesn't seem like this will pick up a letter after the numbers if a letter is there..!

# python3
from urllib.parse import urlparse
from os.path import basename
def extract_id(url):
path = urlparse(url).path
resource = basename(path)
_id = re.search('\d[^.]*', resource)
if _id:
return _id.group(0)
urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]
print(ids)
Output:
['20170303', '20160929a', '20161005a']

You can use this regex (\d+[a-z]*)\. :
regex demo
Outputs
20170303
20160929a
20161005a

Given:
>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
You can do:
for s in lios:
m=re.search(r'(\d+\w*)\D+$', s)
if m:
print m.group(1)
Prints:
20170303
20160929a
20161005a
Which is based on this regex:
(\d+\w*)\D+$
^ digits
^ any non digits
^ non digits
^ end of string

import re
patterns = {
'url_refs': re.compile("(\d+[a-z]*)\."), # YCF_L
}
def scan(iterable, pattern=None):
"""Scan for matches in an iterable."""
for item in iterable:
# if you want only one, add a comma:
# reference, = pattern.findall(item)
# but it's less reusable.
matches = pattern.findall(item)
yield matches
You can then do:
hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)
Feed references to your other functions. You can go through larger sets of stuff this way, and do it faster I suppose.

Related

Replace characters with particular format with a variable value in python

I have filenames with the particular format as given
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
.
.
.
I want to remove the BH?.M part with the value in a string variable in name.
name=['T','D','FG'.....]
expected output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
.
.
.
Is it possible with str.replace()?
You could use the built-in regex module (re) alongside the following pattern to effectively replace the content in your strings.
Pattern
'(?<=BH)[A-Z]+\.M'
This pattern looks behind (non-matching) to ensure to check for the substring 'BH', then matches on any uppercase character [A-Z] one or more times + followed by the substring '.M'.
Solution
The below solution uses re.sub() alongside the pattern outlined above to return a string with the substring matched by the pattern replaced with that defined here as replacement.
import re
original = 'II.NIL.10.BHB.M.2078.198.160857'
replacement = 'FG'
output = re.sub(r'(?<=BH)[A-Z]+\.M', replacement, original)
print(output)
Output
II.NIL.10.BHFG.2078.198.160857
Processing multiple files
To repeat this process for multiple files you could apply the above logic within a loop/comprehension, running the re.sub() function on each original/replacement pairing and storing/processing appropriately.
The below example uses the data from your original question alongside the above logic to create a list containing the results of each re.sub() operation by way of a dictionary mapping between the original filenames and substrings to be inserted using re.sub().
import re
originals = [
'II.NIL.10.BHZ.M.2058.190.160877',
'II.NIL.10.BHA.M.2008.190.168857',
'II.NIL.10.BHB.M.2078.198.160857'
]
replacements = ['T','D','FG']
mapping = {originals[i]: replacements[i] for i, _ in enumerate(originals)}
results = [re.sub(r'(?<=BH)[A-Z]+\.M', v, k) for k,v in mapping.items()]
for r in results:
print(r)
Output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
Nope, you cannot use str.replace with a wildcard. You will have to use regex with something such as the following
import re
filenames = ['II.NIL.10.BHA.M.2008.190.168857 ', 'II.NIL.10.BHB.M.2078.198.160857',
'II.NIL.10.BHC.M.2078.198.160857']
name = ['T','D','FG']
newfilenames = []
for i in range(len(filenames)):
newfilenames.append(re.sub(r'BH.?\.M', 'BH'+name[i], filenames[i]))
print(' '.join(newfilenames)) # outputs II.NIL.10.BHT.2008.190.168857 II.NIL.10.BHD.2078.198.160857 II.NIL.10.BHFG.2078.198.160857
You can use iter with next in the replacement lambda of re.sub:
import re
name = iter(['T','D','FG'])
s = """
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
"""
result = re.sub('(?<=BH)\w\.\w', lambda x:f'{next(name)}', s)
Output:
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

Python Regex: Remove the parts of the string that does not match regex pattern

I want to remove parts of the string that does not match the format that I want. Example:
import re
string = 'remove2017abcdremove'
pattern = re.compile("((20[0-9]{2})([a-zA-Z]{4}))")
result = pattern.search(string)
if result:
print('1')
else:
print('0')
It returns "1" so I can find the matching format inside the string however I also want to remove the parts that says "remove" on it.
I want it to return:
desired_output = '2017abcd'
You need to identify group from search result, which is done through calling a group():
import re
string = 'remove2017abcdremove'
pattern = re.compile("(20[0-9]{2}[a-zA-Z]{4})")
string = pattern.search(string).group()
# 2017abcd

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Number of regex matches

I'm using the finditer function in the re module to match some things and everything is working.
Now I need to find out how many matches I've got. Is it possible without looping through the iterator twice? (one to find out the count and then the real iteration)
Some code:
imageMatches = re.finditer("<img src\=\"(?P<path>[-/\w\.]+)\"", response[2])
# <Here I need to get the number of matches>
for imageMatch in imageMatches:
doStuff
Everything works, I just need to get the number of matches before the loop.
If you know you will want all the matches, you could use the re.findall function. It will return a list of all the matches. Then you can just do len(result) for the number of matches.
If you always need to know the length, and you just need the content of the match rather than the other info, you might as well use re.findall. Otherwise, if you only need the length sometimes, you can use e.g.
matches = re.finditer(...)
...
matches = tuple(matches)
to store the iteration of the matches in a reusable tuple. Then just do len(matches).
Another option, if you just need to know the total count after doing whatever with the match objects, is to use
matches = enumerate(re.finditer(...))
which will return an (index, match) pair for each of the original matches. So then you can just store the first element of each tuple in some variable.
But if you need the length first of all, and you need match objects as opposed to just the strings, you should just do
matches = tuple(re.finditer(...))
#An example for counting matched groups
import re
pattern = re.compile(r'(\w+).(\d+).(\w+).(\w+)', re.IGNORECASE)
search_str = "My 11 Char String"
res = re.match(pattern, search_str)
print(len(res.groups())) # len = 4
print (res.group(1) ) #My
print (res.group(2) ) #11
print (res.group(3) ) #Char
print (res.group(4) ) #String
If you find you need to stick with finditer(), you can simply use a counter while you iterate through the iterator.
Example:
>>> from re import *
>>> pattern = compile(r'.ython')
>>> string = 'i like python jython and dython (whatever that is)'
>>> iterator = finditer(pattern, string)
>>> count = 0
>>> for match in iterator:
count +=1
>>> count
3
If you need the features of finditer() (not matching to overlapping instances), use this method.
I know this is a little old, but this but here is a concise function for counting regex patterns.
def regex_cnt(string, pattern):
return len(re.findall(pattern, string))
string = 'abc123'
regex_cnt(string, '[0-9]')
For those moments when you really want to avoid building lists:
import re
import operator
from functools import reduce
count = reduce(operator.add, (1 for _ in re.finditer(my_pattern, my_string)))
Sometimes you might need to operate on huge strings. This might help.
if you are using finditer method best way you can count the matches is to initialize a counter and increment it with each match

Categories

Resources