Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.
You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']
Related
I've a list ['test_x', 'text', 'x']. Is there any way to use regex in python to find if the string inside the list contains either '_x' or 'x'?
'x' should be a single string and not be part of a word without _.
The output should result in ['test_x', 'x'].
Thanks.
Using one line comprehension:
l = ['test_x', 'text', 'x']
result = [i for i in l if '_x' in i or 'x' == i]
You can use regexp this way:
import re
print(list(filter(lambda x: re.findall(r'_x|^x$',x),l)))
The regexp searches for exact patterns ('_x' or 'x') within each element of the list. applies the func to each element of the iterable.
You can make your expression more genric this way:
print(list(filter(lambda x: re.findall(r'[^A-Za-z]x|^\W*x\W*$',x),l)))
Here am telling python to search for expressions which DON't start with A to Z or a to z but end in x OR search for expressions that start and end with 0 or more non-word characters but have x in between. You can refer this quick cheatsheet on regular expressions https://www.debuggex.com/cheatsheet/regex/python
[re.findall('x|_x', s) for s in your_list]
Is there a simple way in python to replace multiples characters by another?
For instance, I would like to change:
name1_22:3-3(+):Pos_bos
to
name1_22_3-3_+__Pos_bos
So basically replace all "(",")",":" with "_".
I only know to do it with:
str.replace(":","_")
str.replace(")","_")
str.replace("(","_")
You could use re.sub to replace multiple characters with one pattern:
import re
s = 'name1_22:3-3(+):Pos_bos '
re.sub(r'[():]', '_', s)
Output
'name1_22_3-3_+__Pos_bos '
Use a translation table. In Python 2, maketrans is defined in the string module.
>>> import string
>>> table = string.maketrans("():", "___")
In Python 3, it is a str class method.
>>> table = str.maketrans("():", "___")
In both, the table is passed as the argument to str.translate.
>>> 'name1_22:3-3(+):Pos_bos'.translate(table)
'name1_22_3-3_+__Pos_bos'
In Python 3, you can also pass a single dict mapping input characters to output characters to maketrans:
table = str.maketrans({"(": "_", ")": "_", ":": "_"})
Sticking to your current approach of using replace():
s = "name1_22:3-3(+):Pos_bos"
for e in ((":", "_"), ("(", "_"), (")", "__")):
s = s.replace(*e)
print(s)
OUTPUT:
name1_22_3-3_+___Pos_bos
EDIT: (for readability)
s = "name1_22:3-3(+):Pos_bos"
replaceList = [(":", "_"), ("(", "_"), (")", "__")]
for elem in replaceList:
print(*elem) # : _, ( _, ) __ (for each iteration)
s = s.replace(*elem)
print(s)
OR
repList = [':','(',')'] # list of all the chars to replace
rChar = '_' # the char to replace with
for elem in repList:
s = s.replace(elem, rChar)
print(s)
Another possibility is usage of so-called list comprehension combined with so-called ternary conditional operator following way:
text = 'name1_22:3-3(+):Pos_bos '
out = ''.join(['_' if i in ':)(' else i for i in text])
print(out) #name1_22_3-3_+__Pos_bos
As it gives list, I use ''.join to change list of characters (strs of length 1) into str.
I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']
I have 2 related questions/ issues.
def remove_delimiters (delimiters, s):
for d in delimiters:
ind = s.find(d)
while ind != -1:
s = s[:ind] + s[ind+1:]
ind = s.find(d)
return ' '.join(s.split())
delimiters = [",", ".", "!", "?", "/", "&", "-", ":", ";", "#", "'", "..."]
d_dataset_list = ['hey-you...are you ok?']
d_list = []
for d in d_dataset_list:
d_list.append(remove_delimiters(delimiters, d[1]))
print d_list
Output = 'heyyouare you ok'
What is the best way of avoiding strings being combined together when a delimiter is removed? For example, so that the output is hey you are you ok ?
There may be a number of different sequences of ..., for example .. or .......... etc. How does one go around implementing some form of rule, where if more than one . appear after each other, to remove it? I want to try and avoid hard-coding all sequences in my delimiters list. Thankyou
You could try something like this:
Given delimiters d, join them to a regular expression
>>> d = ",.!?/&-:;#'..."
>>> "["+"\\".join(d)+"]"
"[,\\.\\!\\?\\/\\&\\-\\:\\;\\#\\'\\.\\.\\.]"
Split the string using this regex with re.split
>>> s = 'hey-you...are you ok?'
>>> re.split("["+"\\".join(d)+"]", s)
['hey', 'you', '', '', 'are you ok', '']
Join all the non-empty fragments back together
>>> ' '.join(w for w in re.split("["+"\\".join(d)+"]", s) if w)
'hey you are you ok'
Also, if you just want to remove all non-word characters, you can just use the character group \W instead of manually enumerating all the delimiters:
>>> ' '.join(w for w in re.split(r"\W", s) if w)
'hey you are you ok'
So first of all, your function for removing delimiters could be simplified greatly by using the replace function (http://www.tutorialspoint.com/python/string_replace.htm)
This would help solve your first question. Instead of just removing them, replace with a space, then get rid of the spaces using the pattern you already used (split() treats consecutive delimiters as one)
A better function, which does this, would be:
def remove_delimiters (delimiters, s):
new_s = s
for i in delimiters: #replace each delimiter in turn with a space
new_s = new_s.replace(i, ' ')
return ' '.join(new_s.split())
to answer your second question, I'd say it's time for regular expressions
>>> import re
... ss = 'hey ... you are ....... what?'
... print re.sub('[.+]',' ',ss)
hey you are what?
>>>
I want to replace my string based on the values in my dictionary. I want to try this with regular expression.
d = { 't':'ch' , 'r' : 'gh'}
s = ' Text to replace '
m = re.search('#a pattern to just get each character ',s)
m.group() # this should get me 'T' 'e' 'x' 't' .....
# how can I replace each character in string S with its corresponding key: value in my dictionary? I looked at re.sub() but could figure out how it can be used here.
I want to generate an output -> Texch cho gheplace
Using re.sub:
>>> d = { 't':'ch' , 'r' : 'gh'}
>>> s = ' Text to replace '
>>> import re
>>> pattern = '|'.join(map(re.escape, d))
>>> re.sub(pattern, lambda m: d[m.group()], s)
' Texch cho gheplace '
The second argument to the re.sub can be a function. The return value of the function is used as a replacement string.
If there is no character in the values of the dictionary appear as a key in the dictionary, then its fairly simple. You can straight away use str.replace function, like this
for char in d:
s = s.replace(char, d[char])
print s # Texch cho gheplace
Even simpler, you can use the following and this will work even if the keys appear in any of the values in the dictionary.
s, d = ' Text to replace ', { 't':'ch' , 'r' : 'gh'}
print "".join(d.get(char, char) for char in s) # Texch cho gheplace