string substitution using regex - python

I have this string in python
a = "haha"
result = "hh"
What i would like to achieve is using regex to replace all occurrences of "aha" to "h" and all "oho" to "h" and all "ehe" to "h"
"h" is just an example. Basically, i would like to retain the centre character. In other words, if its 'eae' i would like it to be changed to 'a'
My regex would be this
"aha|oho|ehe"
I thought of doing this
import re
reg = re.compile('aha|oho|ehe')
However, i am stuck on how to achieve this kind of substitution without using loops to iterate through all the possible combinations?

You can use re.sub:
import re
print re.sub('aha|oho|ehe', 'h', 'haha') # hh
print re.sub('aha|oho|ehe', 'h', 'hoho') # hh
print re.sub('aha|oho|ehe', 'h', 'hehe') # hh
print re.sub('aha|oho|ehe', 'h', 'hehehahoho') # hhhahh

What about re.sub(r'[aeo]h[aeo]','h',a) ?

Related

If a special character exists, then display?

I'm learning python. I'm trying to identify rows of data where the string value includes a special character.
import pandas as pd
cn = pd.read_excel(f"../Files/df.xlsx", sheet_name='Values')
cn = cn[['DestinationName']]
special_characters = "!##$%^&*()-+?_=,<>/"
cn['Special Characters'] = ["Y" if any(c in special_characters for c in cn) else "N"]
Basically, I'd like to either only display rows that include any of the special characters, or create a separate column to show whether Yes (it includes a special character) or No. For example, Red & Blue has the "&" character so it should be flagged as Yes, while RedBlue shouldn't.
I'm a little stuck, and any help would be appreciated
I would recommend using sets on this specific task :
Creating a set of your list of special characters
Create a new column, which contains the following boolean : "the intersection of special_characters and the string of column "Destination Name" is non empty"
It should look like this:
special_characters_set = set(list(special_characters))
cn["Special Characters"] = cn["DestinationName"].apply(lambda x : len(set(list(x)).intersect(special_characters_set)) != 0)
Where
# list('hello') = ['h', 'e', 'l', 'l', 'o'] # ordered and repetitions
# set(list('hello')) = {'h', 'e', 'l', 'o'} # non ordered and no repetitions
Keep in mind that the .apply() method is not really the most computationally efficient to manipulate dataframes.

splitting txt based on ':' but excluding the timestamp in python

05-23 14:14:53.275 A:B:C
in the above case i am trying to split the txt based on : using line.split(':') and following o/p should come as
['05-23 14:14:53.275','A','B','C']
but instead The o/p came is
['05-23 14','14','53.275','A','B','C']
it is also splitting the timestamp.
how do i exclude that from splitting
You are also splitting on the last space. An easy solution is to split on the last space and then split the second group:
s = '05-23 14:14:53.275 A:B:C'
front, back = s.rsplit(maxsplit=1)
[front] + back.split(':')
# ['05-23 14:14:53.275', 'A', 'B', 'C']
Split the line on whitespaces once, starting from the right:
parts = line.rsplit(maxsplit=1)
Combine the first two parts and the last one split by the colons:
parts[:1] + parts[-1].rsplit(":")
['05-23 14:14:53.275', 'A', 'B', 'C']
Just for fun of using walrus:
>>> s = '05-23 14:14:53.275 A:B:C'
>>> [(temp := s.rsplit(maxsplit=1))[0], *temp[1].split(':')]
['05-23 14:14:53.275', 'A', 'B', 'C']
I would suggest you use regex to split this.
([-:\s\d.]*)\s([\w:]*)
Try it in some regex online to see how it is split. Once you get your regex right, you cna use the groups to select which part you want and work on that.
import re
str = '05-23 14:14:53.275 A:B:C'
regex = '([-:\s\d.]*)\s([\w:]*)'
groups = re.compile(regex).match(str).groups()
timestamp = groups[0]
restofthestring = groups[1]
# Now you can split the second part using split
splits = restofthestring.split(':')

I want to write code to replace some arguments

My start of the code goes like that:
complementDNA = originalDNA.replace('a' , 't' , 't' , 'a')
and it says on the running
complementDNA = originalDNA.replace('a' , 't' , 't' , 'a')
TypeError: replace() takes at most 3 arguments (4 given)
Assuming originalDNA is a string, then I think you dont want to replace, you want to translate, ie:
originalDNA = 'atgta' # Know nothing about DNA btw
complement_table = str.maketrans('at', 'ta')
complementDNA = originalDNA.translate(complement_table)
# complementDNA is now 'tagat'
To give a brief explanation, maketrans takes at least 2 arguments and at most 3. The first two arguments are strings of equal length where each character of the first argument will be replaced by the character at the same position in the second argument. The optional third argument is other string with the characters you want to delete.
So, for example str.maketrans('ac', 'ca', 'b') will replace 'a' to 'c', 'c' to 'a' and delete all 'b'.
'abccba'.translate(str.maketrans('ac', 'ca', 'b')) will then be 'caac'
Replace takes two arguments. replace(before, after).
You will have to do it for 'a' and 't' separately and for 't' to 'a' separately. That would not give the right answer. One way you can do it is by converting the DNA to a list of characters and iterating over them checking manually to convert 'a' to 't' and 't' to 'a'. Like so
DNAlist = []
for character in originalDNA:
DNAlist.append(character)
for i in range(0, len(DNAlist)):
if DNAlist[i] == 'a':
DNAlist [i] = 't'
elif DNAlist[i] == 't':
DNAlist[i] = 'a'
# Convert the list back to string
DNAstring = ''.join(DNAlist)
Although I would suggest to use lists until you have to convert the DNA to string. Strings are immutable in python, i.e they can't be changed, just made new everytime. Therefore, string operations can be expensive.
If you read the documentation of str.replace then you will know that it replaces all occurrences of the first argument by occurences of the second argument.
To compute the complementary DNA strand of a given DNA strand with str.replace you have to do the following:
dna = "atgcgctagctcattt"
# Replace A by T and T by A.
cdna = dna.replace('a', 'x')
cdna = cdna.replace('t', 'a')
cdna = cdna.replace('x', 't')
# Replace G by C and C by G.
cdna = cdna.replace('g', 'x')
cdna = cdna.replace('c', 'g')
cdna = cdna.replace('x', 'c')
However it is probably more efficient to use str.translate:
dna = "atgcgctagctcattt"
map = str.maketrans("atgc", "tacg")
cdna = dna.translate(map)
which is similar to Jose's answer. In both cases the result will be:
cdna = "tacgcgatcgagtaaa"
I hope this will help you.
The method str.replace() only takes three arguments, the strings to replace and how many time you want to replace (blank to replace all). You can't change it all at the same time. Try:
complementDNA = originalDNA.replace('a' , 'x').replace('t', 'a').replace('x', 't')

Regex for parameter of a function

I'm fairly inexperienced with regex, but I need one to match the parameter of a function. This function will appear multiple times in the string, and I would like to return a list of all parameters.
The regex must match:
Alphanumeric and underscore
Inside quotes directly inside parenthesis
After a specific function name
Here's an example string:
Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])
and I would like this as output:
['_xyx', 'y', 'z_', x_1', 'x', 'y']
What I have so far:
(?<=Atom\(')[\w|_]*
I'm calling this with:
import re
s = "Generic3(p, [Generic3(g, [Atom('x'), Atom('y'), Atom('z')]), Atom('x'), Generic2(f, [Atom('x'), Atom('y')])])"
print(re.match(r"(?<=Atom\(')[\w|_]*", s))
But this just prints None. I feel like I'm nearly there, but I'm missing something, maybe on the Python side to actually return the matches.
Your regex is close, you need to add \W character to find the underscore:
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
r = "(?<=Atom\()\W\w+"
final_data = re.findall(r, s)
You can also try this:
import re
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
new_data = re.findall("Atom\('(.*?)'\)", s)
Output:
['_xyx', 'y', 'z_', 'x_1', 'x', 'y']

Python: replace french letters with english

I would like to replace all the french letters within words with their ASCII equivalent.
letters = [['é', 'à'], ['è', 'ù'], ['â', 'ê'], ['î', 'ô'], ['û', 'ç']]
for x in letters:
for a in x:
a = a.replace('é', 'e')
a = a.replace('à', 'a')
a = a.replace('è', 'e')
a = a.replace('ù', 'u')
a = a.replace('â', 'a')
a = a.replace('ê', 'e')
a = a.replace('î', 'i')
a = a.replace('ô', 'o')
a = a.replace('û', 'u')
a = a.replace('ç', 'c')
print(letters[0][0])
This code prints é however. How can I make this work?
May I suggest you consider using translation tables.
translationTable = str.maketrans("éàèùâêîôûç", "eaeuaeiouc")
test = "Héllô Càèùverâêt Jîôûç"
test = test.translate(translationTable)
print(test)
will print Hello Caeuveraet Jiouc. Pardon my French.
You can also use unidecode. Install it: pip install unidecode.
Then, do:
from unidecode import unidecode
s = "Héllô Càèùverâêt Jîôûç ïîäüë"
s = unidecode(s)
print(s) # Hello Caeuveraet Jiouc iiaue
The result will be the same string, but the french characters will be converted to their ASCII equivalent: Hello Caeuveraet Jiouc iiaue
The replace function returns the string with the character replaced.
In your code you don't store this return value.
The lines in your loop should be a = a.replace('é', 'e').
You also need to store that output so you can print it in the end.
This post explains how variables within loops are accessed.
Although I am new to Python, I would approach it this way:
letterXchange = {'à':'a', 'â':'a', 'ä':'a', 'é':'e', 'è':'e', 'ê':'e', 'ë':'e',
'î':'i', 'ï':'i', 'ô':'o', 'ö':'o', 'ù':'u', 'û':'u', 'ü':'u', 'ç':'c'}
text = input() # Replace it with the string in your code.
for item in list(text):
if item in letterXchange:
text = text.replace(item,letterXchange.get(str(item)))
else:
pass
print(text)
Here is another solution, using the low level unicode package called unicodedata.
In the unicode structure, a character like 'ô' is actually a composite character, made of the character 'o' and another character called 'COMBINING GRAVE ACCENT', which is basically the '̀'. Using the method decomposition in unicodedata, one can obtain the unicodes (in hex) of these two parts.
>>> import unicodedata as ud
>>> ud.decomposition('ù')
'0075 0300'
>>> chr(0x0075)
'u'
>>> >>> chr(0x0300)
'̀'
Therefore, to retrieve 'u' from 'ù', we can first do a string split, then use the built-in int function for the conversion(see this thread for converting a hex string to an integer), and then get the character using chr function.
import unicodedata as ud
def get_ascii_char(c):
s = ud.decomposition(c)
if s == '': # for an indecomposable character, it returns ''
return c
code = int('0x' + s.split()[0], 0)
return chr(code)

Categories

Resources