I'm learning python. I'm trying to identify rows of data where the string value includes a special character.
import pandas as pd
cn = pd.read_excel(f"../Files/df.xlsx", sheet_name='Values')
cn = cn[['DestinationName']]
special_characters = "!##$%^&*()-+?_=,<>/"
cn['Special Characters'] = ["Y" if any(c in special_characters for c in cn) else "N"]
Basically, I'd like to either only display rows that include any of the special characters, or create a separate column to show whether Yes (it includes a special character) or No. For example, Red & Blue has the "&" character so it should be flagged as Yes, while RedBlue shouldn't.
I'm a little stuck, and any help would be appreciated
I would recommend using sets on this specific task :
Creating a set of your list of special characters
Create a new column, which contains the following boolean : "the intersection of special_characters and the string of column "Destination Name" is non empty"
It should look like this:
special_characters_set = set(list(special_characters))
cn["Special Characters"] = cn["DestinationName"].apply(lambda x : len(set(list(x)).intersect(special_characters_set)) != 0)
Where
# list('hello') = ['h', 'e', 'l', 'l', 'o'] # ordered and repetitions
# set(list('hello')) = {'h', 'e', 'l', 'o'} # non ordered and no repetitions
Keep in mind that the .apply() method is not really the most computationally efficient to manipulate dataframes.
Related
I need to do an input of the string 'AEN' and it must split into ('A', 'E', 'N')
Ive tried several different splits, but it never produces what i need.
The image shows the code I have done.
What im trying is that x produces a result like y. But, Im having issues whit how to achieve it.
x=input('Letras: ')
y=input('Letras: ')
print(x.split())
print(y.split())
Letras: AEN
Letras: A E N
['AEN']
['A', 'E', 'N']
You just want list, which will take an arbitrary iterable and produce a new list, one item per element. A string is considered to be an iterable of individual characters.
>>> list('AEN')
['A', 'E', 'N']
str.split is for splitting a string base on a given delimiter (or arbitrary whitespace, when no delimiter is given). For example,
>>> 'AEN'.split('E')
['A', 'N']
When the given delimiter is not found in the string, it is vacuously split into a single string, identical to the original.
def getUniqueWords(wordsList) :
"""
Returns the subset of unique words containing all of the words that are presented in the
text file and will only contain each word once. This function is case sensitive
"""
uniqueWords = {}
for word in speech :
if word not in speech:
uniqueWords[word] = []
uniqueWords[word].append(word)
Assuming you are passing a clean list of words to getUniqueWords(), you can always return a set of the list which will, because of the properties of a set, remove duplicates.
Try:
def getUniqueWords(wordsList):
return set(wordsList)
Note: When you type questions, you are using markdown, enclosing your code in back ticks it makes the formatting nice with the grey box. Single tick makes the box inline like this and three back ticks with the language at the top gives the box.
Edit: To help with your comment
You can do what calling set() on a list does, but manually:
wordList = ['b', 'c', 'b', 'a', 'd', 'd', 'f']
def getUniqueWords(wordList):
unique = set()
for word in wordList:
unique.add(word)
return unique
print(getUniqueWords(wordList))
This is what calling set() on a list does. Also, using no built in functions on an open ended question (without specifying a method) is a silly addition to any question, especially when your using python.
text = 'a, a, b, b, b, a'
u = set(text.split(', '))
# u={'a', 'b'}
My start of the code goes like that:
complementDNA = originalDNA.replace('a' , 't' , 't' , 'a')
and it says on the running
complementDNA = originalDNA.replace('a' , 't' , 't' , 'a')
TypeError: replace() takes at most 3 arguments (4 given)
Assuming originalDNA is a string, then I think you dont want to replace, you want to translate, ie:
originalDNA = 'atgta' # Know nothing about DNA btw
complement_table = str.maketrans('at', 'ta')
complementDNA = originalDNA.translate(complement_table)
# complementDNA is now 'tagat'
To give a brief explanation, maketrans takes at least 2 arguments and at most 3. The first two arguments are strings of equal length where each character of the first argument will be replaced by the character at the same position in the second argument. The optional third argument is other string with the characters you want to delete.
So, for example str.maketrans('ac', 'ca', 'b') will replace 'a' to 'c', 'c' to 'a' and delete all 'b'.
'abccba'.translate(str.maketrans('ac', 'ca', 'b')) will then be 'caac'
Replace takes two arguments. replace(before, after).
You will have to do it for 'a' and 't' separately and for 't' to 'a' separately. That would not give the right answer. One way you can do it is by converting the DNA to a list of characters and iterating over them checking manually to convert 'a' to 't' and 't' to 'a'. Like so
DNAlist = []
for character in originalDNA:
DNAlist.append(character)
for i in range(0, len(DNAlist)):
if DNAlist[i] == 'a':
DNAlist [i] = 't'
elif DNAlist[i] == 't':
DNAlist[i] = 'a'
# Convert the list back to string
DNAstring = ''.join(DNAlist)
Although I would suggest to use lists until you have to convert the DNA to string. Strings are immutable in python, i.e they can't be changed, just made new everytime. Therefore, string operations can be expensive.
If you read the documentation of str.replace then you will know that it replaces all occurrences of the first argument by occurences of the second argument.
To compute the complementary DNA strand of a given DNA strand with str.replace you have to do the following:
dna = "atgcgctagctcattt"
# Replace A by T and T by A.
cdna = dna.replace('a', 'x')
cdna = cdna.replace('t', 'a')
cdna = cdna.replace('x', 't')
# Replace G by C and C by G.
cdna = cdna.replace('g', 'x')
cdna = cdna.replace('c', 'g')
cdna = cdna.replace('x', 'c')
However it is probably more efficient to use str.translate:
dna = "atgcgctagctcattt"
map = str.maketrans("atgc", "tacg")
cdna = dna.translate(map)
which is similar to Jose's answer. In both cases the result will be:
cdna = "tacgcgatcgagtaaa"
I hope this will help you.
The method str.replace() only takes three arguments, the strings to replace and how many time you want to replace (blank to replace all). You can't change it all at the same time. Try:
complementDNA = originalDNA.replace('a' , 'x').replace('t', 'a').replace('x', 't')
I want to look for permutations that match with a given word, and arrange my data based on column position.
IE - I created a CSV with data I scrapped from several websites.Say it looks something like this:
Name1 OtherVars Name2 More Vars
Stanford 23451 Mamford No
MIT yes stanfor1d 12
BeachBoys pie Beatles Sweeden
I want to (1) find permutations of each word from Name1 in Name2, and then (2) print a table with that word from Name1+it's matching word in OtherVars + the permutation of that word in Name2+it's match in MoreVars.
(if no matches found, just delete the word).
The outcome will be in this case:
Name1 OtherVars Name2 More Vars
Stanford 23451 stanford 12
So, how do I:
Find matching permutations for a word in other column?
Print the 2 words and the values they are mapped to in other columns?
PS - here's a similar question; however, it's java and it's pseudo code.
How to find all permutations of a given word in a given text?
Difflib seems not to be suitable for CSVs based on this: How to find the most similar word in a list in python
PS2 - I was advised to use Fuzzymatch however, I suspect that it's an overkill in this case.
If you're looking for a function which returns the same output for "Stanford" and "stanf1ord", you could :
use lowercase
only keep letters
sort the letters
import re
def signature(word):
return sorted(re.findall('[a-z]', word.lower()))
print(signature("Stanford"))
# ['a', 'd', 'f', 'n', 'o', 'r', 's', 't']
print(signature("Stanford") == signature("stanfo1rd"))
# True
You could create a set or dict of signatures from 1st column, and see if there's any match within the second column.
You seem to want fuzzy matching, not "permutations". There are a few python fuzzy matching libraries, but i think people like fuzzywuzzy
Alternatively, you can roll your own. Something like
def ismatch(s1,s2):
# implement logic
# return boolean if match
pass
def group():
pairs = [(n1, v1, n2, v2) for n1 in names1 for n2 in names2 if ismatch(n1,n2)]
return pairs
I have this string in python
a = "haha"
result = "hh"
What i would like to achieve is using regex to replace all occurrences of "aha" to "h" and all "oho" to "h" and all "ehe" to "h"
"h" is just an example. Basically, i would like to retain the centre character. In other words, if its 'eae' i would like it to be changed to 'a'
My regex would be this
"aha|oho|ehe"
I thought of doing this
import re
reg = re.compile('aha|oho|ehe')
However, i am stuck on how to achieve this kind of substitution without using loops to iterate through all the possible combinations?
You can use re.sub:
import re
print re.sub('aha|oho|ehe', 'h', 'haha') # hh
print re.sub('aha|oho|ehe', 'h', 'hoho') # hh
print re.sub('aha|oho|ehe', 'h', 'hehe') # hh
print re.sub('aha|oho|ehe', 'h', 'hehehahoho') # hhhahh
What about re.sub(r'[aeo]h[aeo]','h',a) ?