how to remove non-alphanumeric characters except \ / # + -:,

how to remove non-alphanumeric characters except \ / # + -:, | # - python

I have
import re
cadena = re.sub('[^0-9a-zA-Z]+', '', cadena)
how to remove non-alphanumeric characters except this
\ / # + -:, | #

Make sure to escape the backslash inside the character class. Use raw string for regular expression, it eliminates need in further escaping.
cadena = re.sub(r'[^0-9a-zA-Z\\/#+\-:,|#]+', '', cadena)
Please note that the underscore is considered an alphanumeric character. If you don't need to strip it, you can simplify the expression.
cadena = re.sub(r'[^\w\\/#+\-:,|#]+', '', cadena)

Example 1:
(Most easy way I think)
# String
string = 'ABC#790#'
# Remove the #, # and 7 from the string. (You can choose whatever you want to take out.)
line = re.sub('[##7]', '', string)
# Print output
print(line)
Example 2:
You can also make use of substrings, that way you can look on every position if a letter matches the letter you don't want in your string.
If the letter is one you would like to keep, you add that letter to a new string. Else you pass.
For example:
String 's' is your string, and string 'd' conatain letters you want to filter out.
s = 'abc/geG#gDvs#h81'
d = '##/'
e = ''
i = 0
j = 0
l = len(s)
g = len(d)
a = False
while i < l:
# Grab a single letter
letter1 = s[i:i+1]
# Add +1
i = i + 1
# Look if the letter is in string d
while j < g:
# Grab a single letter
letter2 = d[j:j+1]
# Add +1
j = j + 1
# Look if letter is not the same.
if (letter1 != letter2):
a = True
else:
a = False
break
# Reset j
j = 0
# Look if a is True of False.
# True: Letter does not contain letters from string d.
# False: Letter does contain letters from string d.
if (a == True):
e = e + letter1
else:
pass
# Print the result
print(e)

Related

remove consecutive substrings from a string without importing any packages

I want to remove consecutive "a" substrings and replace them with one "a" from a string without importing any packages.
For example, I want to get abbccca from aaabbcccaaaa.
Any suggestions?
Thanks.

This method will remove a determined repeated char from your string:
def remove_dulicated_char(string, char):
new_s = ""
prev = ""
for c in string:
if len(new_s) == 0:
new_s += c
prev = c
if c == prev and c == char:
continue
else:
new_s += c
prev = c
return new_s
print(remove_dulicated_char("aaabbcccaaaa", "a"))

Whats wrong with using a loop?
oldstring = 'aaabbcccaaaa'
# Initialise the first character as the same as the initial string
# as this will always be the same.
newstring = oldstring[0]
# Loop through each character starting at the second character
# check if the preceding character is an a, if it isn't add it to
# the new string. If it is an a then check if the current character
# is an a too. If the current character isn't an a then add it to
# the new string.
for i in range(1, len(oldstring)):
if oldstring[i-1] != 'a':
newstring += oldstring[i]
else:
if oldstring[i] != 'a':
newstring += oldstring[i]
print(newstring)

using python regular expressions this will do it.
If you don't know about regex. They are extremely powerful for
this kind of matching
import re
str = 'aaabbcccaaaa'
print(re.sub('a+', 'a', str))

You can use a function that removes double values of a string occurrence recursively until only one occurrence of the repeating string remains:
val = 'aaabbcccaaaaaaaaaaa'
def remove_doubles(v):
v = v.replace('aa', 'a')
if 'aa' in v:
v = remove_doubles(v)
if 'aa' in v:
v = remove_doubles(v)
else: return v
else: return v
print(remove_doubles(val))

There are many ways to do this. Here's another one:
def remove_duplicates(s, x):
t = [s[0]]
for c in s[1:]:
if c != x or t[-1] != x:
t.append(c)
return ''.join(t)
print(remove_duplicates('aaabbcccaaaa', 'a'))

Split string by comma, but ignore commas within brackets

I'm trying to split a string by commas using python:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
But I want to ignore any commas within brackets []. So the result for above would be:
["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]
Anybody have advice on how to do this? I tried to use re.split like so:
params = re.split(",(?![\w\d\s])", param)
But it is not working properly.

result = re.split(r",(?!(?:[^,\[\]]+,)*[^,\[\]]+])", subject, 0)
, # Match the character “,” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
(?: # Match the regular expression below
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, # Match the character “,” literally
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
] # Match the character “]” literally
)
Updated to support more than 2 items in brackets. E.g.
year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier

This regex works on your example:
,(?=[^,]+?:)
Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key> pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.
You can check out the regexr here

You can work this out using a user-defined function instead of split:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
def split_by_commas(s):
lst = list()
last_bracket = ''
word = ""
for c in s:
if c == '[' or c == ']':
last_bracket = c
if c == ',' and last_bracket == ']':
lst.append(word)
word = ""
continue
elif c == ',' and last_bracket == '[':
word += c
continue
elif c == ',':
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
main_lst = split_by_commas(s)
print(main_lst)
The result of the run of above code:
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']

Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.
Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.
(?:[^,]*\[[^][]*])+[^,]*|[^,]+
Regex demo
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*\[[^][]*])+[^,]*|[^,]+", s)
print(params)
Output
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']

I adapted #Bemwa's solution (which didn't work for my use-case)
def split_by_commas(s):
lst = list()
brackets = 0
word = ""
for c in s:
if c == "[":
brackets += 1
elif c == "]":
if brackets > 0:
brackets -= 1
elif c == "," and not brackets:
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst

split a string to have chunks containing the maximum number of possible characters

e.g. string = 'bananaban'
=> ['ban', 'anab', 'an']
My attempt:
def apart(string):
letters = []
for i in string:
while i not in letters:
letters.append(i)
print("The letters are:" +str(letters))
x = []
result = []
return result
string = str(input("Enter string: "))
print(apart(string)
Basically, If I know all the letters that are in the word/string, I want to add them into x, until x contains all letters. Then I want to add x into result.
In my examaple "bananaban" it would mean [ban] is one x, because "ban" countains the letter "b","a" and "n". Same goes for [anab]. [an] only contains "a" and "n" because it is the end of the word.
Would be cool if somebody could help me ^^

IIUC, you want to split after all characters are in the current chunk.
You could use a set to keep track of the seen characters:
s = 'bananaban'
seen = set()
letters = set(s)
out = ['']
for c in s:
if seen != letters:
out[-1] += c
seen.add(c)
else:
seen = set(c)
out.append(c)
output: ['ban', 'anab', 'an']

The logical way seens to be first create a set with all letters in your string, then go over teh original one, collecting each character, and startign a new collection each time the set of letters in the collection match the original.
def apart(string):
target = set(string)
result = []
component = ""
for char in string:
component += char
if set(component) == target:
result.append(component)
component = ""
if component:
result.append(component)
return result

Using a set of the characters in the string, you can loop through the string and add or extend the last group in your resulting list:
S = "bananaban"
chars = set(S) # distinct characters of string
groups = [""] # start with an empty group
for c in S:
if chars.issubset(groups[-1]): # group contains all characters
groups.append(c) # start a new group
else:
groups[-1] += c # append character to last group
print(groups)
['ban', 'anab', 'an']

Print the substring that is present between two uppercase letters in a string

Here is my code:
x = str(input())
w = ' '
w2 = ''
for i in x:
if i.isupper() == True:
w2 += w
else:
w2 += i
print(w2)
I have converted the uppercase alphabets to space, can anyone suggest me what to do next?
Input: abCdefGh
Expected output: def

To print the substring that is present between two uppercase letters in the string.
Step 1: Find the index position of the uppercase letter.
Step 2: Then slice the substring between the first element + 1 and the second element in the list (string[start: end])
word = "abCdefGh"
# Get indices of the capital letters in a string:
def getindices(word):
indices = []
for index, letter in enumerate(word):
if (letter.isupper() == True):
indices.append(index)
return indices
if __name__ == "__main__":
caps_indices = getindices(word)
start = caps_indices[0] + 1
end = caps_indices[1]
print(word[start:end])
Output:
def

I use a flag to tag whether program should add character i into w2.
Notice the condition I write. It is the key of my code.
x = "abCdefGh"
w2 = ''
inBetween = False
for i in x:
if i.isupper():
# Upper case denotes either start or end of substring
if not inBetween:
# start of substring
inBetween = True
continue
else:
# end of substring
inBetween = False
break
if inBetween:
w2+= i
print(w2)
Result is :
def

You can use regex to extract text easily.
import re
# pattern to accept everything between the capital letters.
pattern = '[A-Z](.*?)[A-Z]'
# input string
input_string= 'thisiSmyString'
matched_string = re.findall(pattern, input_string)
print(matched_string[0])

Find out word at specific index

I have a string with multiple words separated by underscores like this:
string = 'this_is_my_string'
And let's for example take string[n] which will return a letter.
Now for this index I want to get the whole word between the underscores.
So for string[12] I'd want to get back the word 'string' and for string[1] I'd get back 'this'

Very simple approach using string slicing is to:
slice the list in two parts based on position
split() each part based on _.
concatenate last item from part 1 and first item from part 2
Sample code:
>>> my_string = 'this_is_my_sample_string'
# ^ index 14
>>> pos = 14
>>> my_string[:pos].split('_')[-1] + my_string[pos:].split('_')[0]
'sample'

This shuld work:
string = 'this_is_my_string'
words = string.split('_')
idx = 0
indexes = {}
for word in words:
for i in range(len(word)):
idx += 1
indexes[idx] = word
print(indexes[1]) # this
print(indexes[12]) #string

The following code works. You can change the index and string variables and adapt to new strings. You can also define a new function with the code to generalize it.
string = 'this_is_my_string'
sp = string.split('_')
index = 12
total_len = 0
for word in sp:
total_len += (len(word) + 1) #The '+1' accounts for the underscore
if index < total_len:
result = word
break
print result

A little bit of regular expression magic does the job:
import re
def wordAtIndex(text, pos):
p = re.compile(r'(_|$)')
beg = 0
for m in p.finditer(text):
#(end, sym) = (m.start(), m.group())
#print (end, sym)
end = m.start()
if pos < end: # 'pos' is within current split piece
break
beg = end+1 # advance to next split piece
if pos == beg-1: # handle case where 'pos' is index of split character
return ""
else:
return text[beg:end]
text = 'this_is_my_string'
for i in range(0, len(text)+1):
print ("Text["+str(i)+"]: ", wordAtIndex(text, i))
It splits the input string at '_' characters or at end-of-string, and then iteratively compares the given position index with the actual split position.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to remove non-alphanumeric characters except \ / # + -:, | # - python

I have import re cadena = re.sub('[^0-9a-zA-Z]+', '', cadena) how to remove non-alphanumeric characters except this \ / # + -:, | #

Related

remove consecutive substrings from a string without importing any packages

Split string by comma, but ignore commas within brackets

split a string to have chunks containing the maximum number of possible characters

Print the substring that is present between two uppercase letters in a string

Find out word at specific index

Categories

Resources