Regular expression for pandoc-markdown citations - python

I'm trying to search and replace citations from pandoc-markdown.
They have the following syntax:
[prenote #autorkey, postnote]
Or for more than one Author
[prenote1 #authorekey1, postnote1; prenote2 #authorkey2, postnote2]
The pre-notes, the author-keys and the post-notes should each be in their own capture group.
For only one author in a citation I used regex this:
\[((.*) )?#(.*?)(, (.*))?\]
But I can't figure out how to match a citation with multiple authors.
Ideally it would be possible to match citations with one or more author keys.
The pre-note and the post-note should be optional.
Is this possible?

We need more context with code (full sample code) to be able to answer fully, so I can only answer in the same general way in which you asked the question.
I do not believe you can do it in one operation with one regular expression.
So the overall technique I would use is:
First match the entire citation (with one or more authors) using a simple regex with only one group, namely for everything between [ and ].
Then, when a match is found, split what is in that match (i.e. everything between the square brackets) by ; to get a list of "prenote #authorkey, postnote" strings.
Do the wanted replacements on each element in that resulting list of single author strings.
Stitch together the final citation by joining the resulting list with semicolons again and adding [ and ] in around it.
Put that final citation in the original instead of the matched string.
You can put steps 2 to 4 in a function f(match_object), and then use re.sub(pattern, f, string) to do the replacement. It will call function f for each match it finds, and replace that match with the return value of f.

You might make use of the PyPi regex module to get the 3 capturing groups.
(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) #(.*?), ([^][,\r\n]*)[\];]
Regex demo | Python demo
Explanation
(?: Non capture group
\G(?!^) Assert the position at the end of the previous match, not at the start
| Or
\[(?=[^][\r\n]*\]) Match [ and assert that there is a closing ]
) Close non capture group
[^\S\r\n]* Match 0+ occurrences of a whitespace char except a newline
(.*?) Capture group 1, match any char except a newline as least as possible
# Match literally
(.*?) Capture group 2, match any char except a newline as least as possible
, Match literally
([^][,\r\n]*) Capture group 3, match any char except ] [ , or a newline
[\];] Match either ] or ;
Example code using regex.finditer
import regex
pattern = r"(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) #(.*?), ([^][,\r\n]*)[\];]"
test_str = ("[prenote #autorkey, postnote]\n"
"[prenote1 #authorekey1, postnote1; prenote2 #authorkey2, postnote2]\n")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print (match.group(groupNum))
Output
prenote
autorkey
postnote
prenote1
authorekey1
postnote1
prenote2
authorkey2
postnote2

Related

regex subtitution [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Where is such a regex wrong?

I am using python.
The pattern is:
re.compile(r'^(.+?)-?.*?\(.+?\)')
The text like:
text1 = 'TVTP-S2(xxxx123123)'
text2 = 'TVTP(xxxx123123)'
I expect to get TVTP
Another option to match those formats is:
^([^-()]+)(?:-[^()]*)?\([^()]*\)
Explanation
^ Start of string
([^-()]+) Capture group 1, match 1+ times any character other than - ( and )
(?:-[^()]*)? As the - is excluded from the first part, optionally match - followed by any char other than ( and )
\([^()]*\) Match from ( till ) without matching any parenthesis between them
Regex demo | Python demo
Example
import re
regex = r"^([^-()]+)(?:-[^()]*)?\([^()]*\)"
s = ("TVTP-S2(xxxx123123)\n"
"TVTP(xxxx123123)\n")
print(re.findall(regex, s, re.MULTILINE))
Output
['TVTP', 'TVTP']
This regex works:
pattern = r'^([^-]+).*\(.+?\)'
>>> re.findall(pattern, 'TVTP-S2(xxxx123123)')
['TVTP']
>>> re.findall(pattern, 'TVTP(xxxx123123)')
['TVTP']
a quick answer will be
^(\w+)(-.*?)?\((.*?)\)$
https://regex101.com/r/wL4jKe/2/
It is because the first plus is lazy, and the subsequent dash is optional, followed by a pattern that allows any character.
This allows the regex engine to choose the single letter T for the first group (because it is lazy), choose to interpret the dash as just not being there, which is allowed because it is followed by a question mark, and then have the next .* match "VTP-S2".
You can just grab non-dashes to capture, followed by nonparentheses up to the parentheses.
p=re.compile(r'^([^-]*?)[^(]*\(.+?\)')
p.search('TVTP-S2(xxxx123123) blah()').group(1)
The nonparentheses part prevents the second portion from matching 'S2(xxxx123123) blah(' in my modified example above.

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex similar to a Hearst Pattern in Python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))
In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force
I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

python regex get text among two tag with new line

I'm new in regex.Here is my data.
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
I want to get this.
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
Here is my regex.
(<p>\[tag(.*)\])(.+)(\[\/tag\]<\/p>)
But it doesn't work because of new line(\n).If I use re.DOTALL , It works ,but if my data has multi records like
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
re.findall() returns only one match.I briefly want this.
[data1,data2,data3...].What can i do ?
Simple as this:
\](.*?)\[
reobj = re.compile(r"\](.*?)\[", re.IGNORECASE | re.DOTALL | re.MULTILINE)
result = reobj.findall(YOURSTRING)
Output:
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
DEMO
Regex Explanation:
\] matches the character ] literally
1st Capturing group (.*?)
.*? matches any character
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\[ matches the character [ literally
s modifier: single line. Dot matches newline characters
You can use a this regex:
\[tag\]([\s\S]*?)\[\/tag\]
Working demo
Match information:
MATCH 1
1. [8-44] `y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38`
Update: what
\[tag\]
([\s\S]*?) --> the [\s\S]*? is used to match everything, since \S will capture
all non blanks and \s will capture blanks. This is just a trick, you can
also use [\D\d] or [\W\w]. Btw, the *? is just a ungreedy quantifier
\[\/tag\]
On the other hand, if you want to allow attributes in the tag you can use:
\[tag.*?\]([\s\S]*?)\[\/tag\]

Categories

Resources