I have a project where I'm trying to analyze a database of tweets. I need to write a python regex expression that pulls tweets mentioning specific twitter users. Here is an example tweet I'd like to capture.
"That #A_Person is a real jerk."
The regex that I've been trying is
([^.?!]*)(\b([#]A_Person)\b)([^.?!]*)
But it's not working and I've tried lots of variations. Any advice would be appreciated!
\b matches a word boundary, but # is not a word character, so if it occurs after a space, the match will fail. Try removing the word boundary there, and removing the extra groups, and add a character set at the end for [.?!] to include the final punctuation, and you get:
[^.?!]*#A_Person\b.*?[^.?!]*[.?!]
You also might consider including a check for the start of the string or the end of the last sentence, otherwise the engine will go through a lot of steps while going through areas without any matches. Perhaps use
(?:^|(?<=[.?!])\s*)
which will match the start of the string, or will lookbehind for [.?!] possibly followed by spaces. Put those together and you get
(?:^|(?<=[.?!])\s*)([^.?!]*#A_Person\b.*?[^.?!]*[.?!])
where the string you want is in the first group (no leading spaces).
https://regex101.com/r/447KsF/3
Related
I am parsing a database and extracting entries to a new database. For this I use keywords which should and keywords which should not be included. For a keyword I want excluded, it should be "-anyletter-fv", I wonder if -anyletter- is possible to program. If there is no letter, a space, a comma, or anything but a letter, I don't want to exclude it, only if there is specifically a letter in front of it.
If I understand you correctly, you try to exclude those cases in which your keyword starts with some letter.
Use library re for it (https://docs.python.org/3/library/re.html)
print(re.match("^\w.*", " keyword"))
will return a match object if a pattern that you look for is found, otherwise None.
You can use it for if-expressions.
the "^" marks the beginning of the sequence, "\w" matches all [a-zA-Z0-9], while ".*" matches all other sequences of varying length.
Therefore you get matches for keywords that do not start with ascii character.
I hope this helps you.
Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.
I am looking to find all matches in a string and print all substrings until I match these strings to a new line.
e.g.
"123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
should print:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
where "ABC" is the pattern match which is recurring.
Is there an efficient way I can do so using findall?
New to Python here, using python version 2.4.3
Edit just an F.Y.I:
What I am trying to do is basically I have a 250+Gb file which has control characters showing start and end of line but these Ctrl Characters (because of issues.. mostly network) are embedded within these lines i.e. in between the start/end indicating control characters.
With that, there is no specific distinction between the start/end control chars and the ones that come in between these messages.
So I am basically removing these control chars, and have I wish to have a complete message per line pertaining to some specific regex.
The regex here is not necessarily ABC or in order for all of these messages.
I have tried using findall and am able to find all the matches, just I did not know how to get the strings following these until i find the next match. (the regex here can be either -ABC=35nga|DEF=64325:dfaf:1234| or **ABC=35632|DEF=61 and many different forms.
And I have to break for each line and for the ones which have multiple lines embededed within a line.
Using re.findall:
See the regex in action on regex101.
s = "123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
re.findall("ABC.*?(?=ABC|$)",s)
which gives a list:
['ABC97edf', 'ABCaaabbdd1234', 'ABC0009ui50', 'ABC_1234']
And if you wanted to print the elements in this list, you could simply do:
for sub in re.findall("ABC.*?(?=ABC|$)",s):
print(sub)
which would output:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.
I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.
How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.
If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.
Thanks so much for your help!
So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?
her_PP$ voice_NN an_DT exact_JJ duplicate_NN
Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.
In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:
[\w-]+_NN
This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.
For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:
[\w-]+_PP\$
Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).
Combined this would amount to this:
[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN
Debuggex Demo
To do "an undetermined amount of unknown words" you would do this:
(?:[\w-]+\W+)*?
So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.
How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.