I want to split strings only by suffixes. For example, I would like to be able to split dord word to [dor,wor].
I though that \wd would search for words that end with d. However this does not produce the expected results
import re
re.split(r'\wd',"dord word")
['do', ' wo', '']
How can I split by suffixes?
x='dord word'
import re
print re.split(r"d\b",x)
or
print [i for i in re.split(r"d\b",x) if i] #if you dont want null strings.
Try this.
As a better way you can use re.findall and use r'\b(\w+)d\b' as your regex to find the rest of word before d:
>>> re.findall(r'\b(\w+)d\b',s)
['dor', 'wor']
Since \w also captures digits and underscore, I would define a word consisting of just letters with a [a-zA-Z] character class:
print [x.group(1) for x in re.finditer(r"\b([a-zA-Z]+)d\b","dord word")]
See demo
If you're wondering why your original approach didn't work,
re.split(r'\wd',"dord word")
It finds all instances of a letter/number/underscore before a "d" and splits on what it finds. So it did this:
do[rd] wo[rd]
and split on the strings in brackets, removing them.
Also note that this could split in the middle of words, so:
re.split(r'\wd', "said tendentious")
would split the second word in two.
Related
I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.
You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.
You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*
My string is of the form my_str = "2a1^ath67e22^yz2p0". I would like to split based on the pattern '^(any characters) and get ["2a1", "67e22", "2p0"]. The pattern could also appear in the front or the back part of the string, such as ^abc27e4 or 27c2^abc. I tried re.split("(.*)\^[a-z]{1,100}(.*)", my_str) but it only splits one of those patterns. I am assuming here that the number of characters appearing after ^ will not be larger than 100.
you don't need regex for simple string operations, you can use
my_list = my_str.split('^')
EDIT: sorry, I just saw that you don't want to split just on the ^ character but also on strings following. Therefore you will need regex.
my_list = re.split('\^[a-z]+', my_str)
If the pattern is at the front or the end of the string, this will create an empty list element. you can remove them with
my_list = list(filter(None, my_list))
if you want to use regex library, you can just split by '\^'
re.split('\^', my_str)
# output : ['2a1', 'ath67e22', 'yz2p0']
I want to define a function that takes a sentence and returns the words that are at least a length of 4 and in lowercase. The problem is, I pretty new to Python and I'm not quite certain on how to make code dealing with words instead of integers. My current code is as follows:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
return (word.lower())
If I my_function("Bill's dog was born in 2010") I expect ["bill","born"] where as my code outputs "bill's"
From what I've seen on StackOverflow and in the Python tutorial, regular expression would help me but I do not fully understand what is going on in the module. Can you guys explain how regex could help, if it can at all?
Your requirements are slightly inconsistent, so I'll go with your example as the reference.
In [27]: import re
In [28]: s = "Bill's dog was born in 2010"
In [29]: [w.lower() for w in re.findall(r'\b[A-Za-z]{4,}\b', s)]
Out[29]: ['bill', 'born']
Let's take a look at the regular expression, r'\b[A-Za-z]{4,}\b'.
The r'...' is not part of the regular expression. It's a Python construct called a raw string. It's like a normal string literal except backslash sequences like \b don't have their usual meaning.
The two \b look for a word boundary (that is, the start or the end of a word).
The [A-Za-z]{4,} looks for a sequence of four or more letters. The [A-Za-z] is called a character class and consists of letters A through Z and a through z. The {4,} is a repetition operator that requires that the character class is matched at least four times.
Finally, the list comprehension, [w.lower() for w in ...], converts the words to lowercase.
Yes, Regex would be the simplest and easiest approach to achieve what you want.
Try this regex:
matches = re.findall(ur"\b[a-zA-Z]{4,}\b", "Put Your String Here") #matches [Your,String,Here]
You return the first word that is 4 chars or longer, instead of all such words. Append to sentence and return that instead:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
sentence.append(word.lower())
return sentence
You can simplify that with a list comprehension:
def my_function(s):
return [word.lower() for word in s.split() if len(word) >= 4]
Yes, a regular expression could do this too, but for your case that may be overkill.
You forgot to accumulate the long words in 'sentence';) You're instead returning the first one
Using re.split
>>> import re
>>> a='Hi, how are you today?'
>>> [x for x in re.split('[^a-z]', a.lower()) if len(x)>=4]
['today']
>>>
I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA
Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['