I'm trying to write a code using regex and my text file. My file contains these words line by line:
each
expressions
flags
in
from
given
line
of
once
lines
no
My purpose is; displaying the words that created by removing letters from given substring.
For example; if my substring is "flamingoes", my output should be;
flags
in
line
lines
no
Because they are created from my substring by removing letters, and they are in my text file also.
I did many works about regex but I am interested about this challenge. Is there any regex solution for this?
You should create a regex for each word you are looking for. The expression .*? between each letter is a non-greedy pattern, which will avoid backtracking (at least some of it), and make the search faster.
For example, the regex for the word "given" would be g.*?i.*?v.*?e.*?n
import re
def hidden_words(needles, haystack):
for needle in needles:
regex = re.compile(('.*?').join(list(needle)))
if regex.search(haystack):
yield needle
needles = ['each', 'expressions', 'flags', 'in', 'from',
'given', 'line', 'of', 'once', 'lines', 'no']
print(*hidden_words(needles, 'flamingoes'), sep='\n')
Essentially each character is optional. A simple
import re
word = 'flamingoes'
pattern = ''.join( c+'?' for c in word ) # ? Marks the letter as optional
for line in open('file').readLines():
line = line.strip()
m = re.match(pattern, line)
if m:
print(line)
Should suffice
Related
I've made this Python program for printing words from a text but I got stuck where Python reaches the next 'tab' index it returns to the initial one when it checks the conditional and I don't know why, so can anyone explain to me why it doesn't take the new 'tab' index?
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
text = re.sub('\W+', ' ', initial_text)
t = -1
for i in text:
n = text.find(i)
if i == ' ':
print(text[t+1:n])
t = n
This is because you are using the find() function, this will return the index number of the first occurrence of the word you are searching, that's why it is again moving to the first index.
You can refer to the find() function documentation.
Use this approach
import re
initial_text = "whatever your text is"
text = re.sub(r'[^\w\s]', '', initial_text)
words_list = text.split()
for word in words:
print(word)
Explanation using an example :
import re
initial_text = "Hello : David welcome to Stack ! overflow"
text = re.sub(r'[^\w\s]', '', initial_text)
Above piece removes the punctuations
words_list = text.split()
words_list after this step will be : ['Hello', 'David', 'welcome', 'to', 'Stack', 'overflow']
for word in words_list:
print(word)
Above code takes each element from the list and prints it.
Looks like you can use
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
words = re.findall(r'[^\W_]+', initial_text)
for word in words:
print(word)
See Python proof.
re.findall extracts all non-overlapping matches from the given text.
[^\W_]+ is a regular expression that matches one or more characters different from non-word and underscores, and that means it matches substrings that consist of digits or/and letters only (all, ASCII and other Unicode).
See regex proof.
EXPLANATION
[^\W_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), '_' (1 or more
times (matching the most amount possible))
i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.
I'm currently having a hard time separating words on a txt document
with regex into a list, I have tried ".split" and ".readlines" my document
consists of words like "HelloPleaseHelpMeUnderstand" the words are
capitalized but not spaced so I'm at a loss on how to get them into a list.
this is what I have currently but it only returns a single word.
import re
file1 = open("file.txt","r")
strData = file1.readline()
listWords = re.findall(r"[A-Za-z]+", strData)
print(listWords)
one of my goals for doing this is to search for another word within the elements of the list, but i just wish to know how to list them so i may continue my work.
if anyone can guide me to a solution I would be grateful.
A regular regex based on lookarounds to insert spaces between glued letter words is
import re
text = "HelloPleaseHelpMeUnderstand"
print( re.sub(r"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", " ", text) )
# => Hello Please Help Me Understand
See the regex demo. Note adjustments will be necessary to account for numbers, or single letter uppercase words like I, A, etc.
Regarding your current code, you need to make sure you read the whole file into a variable (using file1.read(), you are reading just the first line with readline()) and use a [A-Z]+[a-z]* regex to match all the words glued the way you show:
import re
with open("file.txt","r") as file1:
strData = file1.read()
listWords = re.findall(r"[A-Z]+[a-z]*", strData)
print(listWords)
See the Python demo
Pattern details
[A-Z]+ - one or more uppercase letters
[a-z]* - zero or more lowercase letters.
How about this:
import re
strData = """HelloPleaseHelpMeUnderstand
And here not in
HereIn"""
listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData)
result = [i[0] for i in listWords]
print(result)
# ['HelloPleaseHelpMeUnderstand', 'HereIn']
print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?"))
Do i Think This Is A Better Answer?
So I have a list of strings, lets say: my_list = ['hope', 'faith', 'help']
now I open a textfile with the name infile and seperate the words with
for line in infile:
line_list = line.split()
now I want to make a regex that i can change by using for loop like this:
for word in line_list:
match = re.findall(word$, line_list)
print(match)
I've tried several ways to get 'word' into that regex but none seems to work
Any ideas?
You don't need to use a regular expression. There is the method endswith for the standard type str in Python.
with open('path/name.ext') as infile :
line_list = infile.readlines()
for line in line_list :
match = [word for word in my_list if line.endswith(word)]
print(match)
This would print out either the matching word or an empty list for every line in the file.
But you can do it with a regular expression if you absolutely want...
pattern = r'({0})$'.format('|'.join(my_list))
for line in line_list :
match = re.findall(pattern, line)
print(match)
The search pattern contains of a group with all elements from my_list operated with a logical or |.
A regex is just a string which may or may not contain wildcard or special characters. So the best way to "make elements of a list part of a regex" is to 'write' the regex :
my_list = ['hope', 'faith', 'help']
for regex_el in my_list:
regex = "{0:s}".format(regex_el)
print regex
Of course that is over simplistic. That's just using a plain string as a regex. You could have small regular expressions to bolt into the larger regex or you could surround the element from the list with other parts of a regex :
regex = "^ *{0:s} ".format(regex_el)
Would construct a regex to find your word only if it were the first word in a string, preceded by none or more spaces and followed by a space.
Then in your code, replace the 'word' in your call to findall with the 'regex' constructed above.
You will need to replace the line_list in your call to findall as well as findall expects a pattern (be that a simple string or a genuine regex) and a string in which to search (that could be word in your loop or line from the loop over lines in the file.
Also note print match will print an empty list if no match was found. You may wish to replace that with
if match:
print(match)
To only print words from the line which match your constructed regex.
Could I recommend you check out this website : https://regex101.com/ to experiment with regexs and the strings you're aplying them to.
I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA