Regex in Python. NOT matches - python

I'll go straight: I have a string like this (but with thousands of lines)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
but no good.
Thanks in advance, sorry if my english is not quite good.

As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:
regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)
As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:
new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))

First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:
abcdefg_nodigitshere
But you can subfilter that this way:
import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)
for line in inputs.splitlines():
if re.match(myreg, line):
# do x
elif re.match(mydigre, line):
# do y
else:
# line doesn't end with _\d+
Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.
all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])

Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):
edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.
Breakdown: modifiers are global and multiline
(?i) // case insensitive flag
^ // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$)) // look ahead, not this form of a line ?
.* // ok then select all except newline or eos
(?:\n|$) // select newline or end of string

Related

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?
We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.
What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Using regular expressions to find a pattern

If I have a file that consists of sentences like this:
1001 apple
1003 banana
1004 grapes
1005
1007 orange
Now I want to detect and print all such sentences where there is a number but no corresponding text (eg 1005), how can I design the regular expression to find such sentences? I find them a bit confusing to construct.
res=[]
with open("fruits.txt","r") as f:
for fruit in f:
res.append(fruit.strip().split())
Would it be something like this: re.sub("10**"/.")
Well you don't need a regular expressions for this:
with open("fruits.txt", "r") as f:
res = [int(line.strip()) for line in f if len(line.split()) == 1]
A regex that would detect a number, then a space, then an underscore word is ([0-9])+[ ]\w+.
A good ressource for trying that stuff out is http://regexr.com/
The re pattern for this would be re.sub("[0-9][0-9][0-9][0-9]"). This looks if there are only four numbers and nothing else, so it will find your 1005.
Hope this helps!
There are two ways to go about this: search() and findall(). The former will find the first instance of a match, and the latter will give a list of every match.
In any case, the regex you want to use is "^\d{4}$". It's a simple regex which matches a 4-digit number that takes up the entirety of a string, or, in multiline mode, a line. So, to find 'only number' sections, you will use the following code:
# assume 'func' is set to either be re.search or re.findall, whichever you prefer
with open("fruits.txt", "r") as f:
solo = func("^\d{4}$", f.read(), re.MULTILINE)
# 'solo' now has either the first 'non-labeled' number,
# or a list of all such numbers in the file, depending on
# the function you used. search() will return None if there
# are no such numbers, and findall() will return an empty list.
# if you prefer brevity, re.MULTILINE is equivalent to re.M
Additional explanation of the regex:
^ matches at the beginning of the line.
\d is a special sequence which matches any numeric digit.
{4} matches the prior element (\d) exactly four times.
$ matches at the end of the line.
Please try:
(?:^|\s+)(\d{4}\b)(?!\s.*\w+)
DEMO

Match string between special characters

I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.
You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'
Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]

Identifying lines with consecutive upper case letters

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Categories

Resources