Delete substring not matching regex in Python - python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)

You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo

another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"

If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Related

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Python : How to ignore a delimited part of a sentence?

I have the following line :
CommonSettingsMandatory = #<Import Project="[\\.]*Shared(\\vc10\\|\\)CommonSettings\.targets," />#,true
and i want the following output:
['commonsettingsmandatory', '<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />', 'true'
If i do a simple regex with the comma, it will split the value if there's a value in it, like i wrote a comma after targets, it will split here.
So i want to ignore the text between the ## to make sure there's no splitting there.
I really don't know how to do!
http://docs.python.org/library/re.html#re.split
import re
string = 'CommonSettingsMandatory = #toto,tata#, true'
splitlist = re.split('\s?=\s?#(.*?)#,\s?', string)
Then splitlist contains ['CommonSettingsMandatory', 'toto,tata', 'true'].
While you might be able to use split with a lookbehind, I would use the groups captured by this expression.
(\S+)\s*=\s*##([^#]+)##,\s*(.*)
m = re.Search(expression, myString). use m.group(1) for the first string, m.group(2) for the second, etc.
If I understand you correctly, you're trying to split the string using spaces as delimiters, but you want to also remove any text between pound signs?
If that's correct, why not simply remove the pound sign-delimited text before splitting the string?
import re
myString = re.sub(r'#.*?#', '', myString)
myArray = myString.split(' ')
EDIT: (based on revised question)
import re
myArray = re.findall(r'^(.*?) = #(.*?)#,(.*?)$', myString)
That will actually return an array of tuples including your matches, in the form of:
[
(
'commonsettingsmandatory',
'<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />',
'true'
)
]
(spacing added to illustrate the format better)

Number of regex matches

I'm using the finditer function in the re module to match some things and everything is working.
Now I need to find out how many matches I've got. Is it possible without looping through the iterator twice? (one to find out the count and then the real iteration)
Some code:
imageMatches = re.finditer("<img src\=\"(?P<path>[-/\w\.]+)\"", response[2])
# <Here I need to get the number of matches>
for imageMatch in imageMatches:
doStuff
Everything works, I just need to get the number of matches before the loop.
If you know you will want all the matches, you could use the re.findall function. It will return a list of all the matches. Then you can just do len(result) for the number of matches.
If you always need to know the length, and you just need the content of the match rather than the other info, you might as well use re.findall. Otherwise, if you only need the length sometimes, you can use e.g.
matches = re.finditer(...)
...
matches = tuple(matches)
to store the iteration of the matches in a reusable tuple. Then just do len(matches).
Another option, if you just need to know the total count after doing whatever with the match objects, is to use
matches = enumerate(re.finditer(...))
which will return an (index, match) pair for each of the original matches. So then you can just store the first element of each tuple in some variable.
But if you need the length first of all, and you need match objects as opposed to just the strings, you should just do
matches = tuple(re.finditer(...))
#An example for counting matched groups
import re
pattern = re.compile(r'(\w+).(\d+).(\w+).(\w+)', re.IGNORECASE)
search_str = "My 11 Char String"
res = re.match(pattern, search_str)
print(len(res.groups())) # len = 4
print (res.group(1) ) #My
print (res.group(2) ) #11
print (res.group(3) ) #Char
print (res.group(4) ) #String
If you find you need to stick with finditer(), you can simply use a counter while you iterate through the iterator.
Example:
>>> from re import *
>>> pattern = compile(r'.ython')
>>> string = 'i like python jython and dython (whatever that is)'
>>> iterator = finditer(pattern, string)
>>> count = 0
>>> for match in iterator:
count +=1
>>> count
3
If you need the features of finditer() (not matching to overlapping instances), use this method.
I know this is a little old, but this but here is a concise function for counting regex patterns.
def regex_cnt(string, pattern):
return len(re.findall(pattern, string))
string = 'abc123'
regex_cnt(string, '[0-9]')
For those moments when you really want to avoid building lists:
import re
import operator
from functools import reduce
count = reduce(operator.add, (1 for _ in re.finditer(my_pattern, my_string)))
Sometimes you might need to operate on huge strings. This might help.
if you are using finditer method best way you can count the matches is to initialize a counter and increment it with each match

Categories

Resources