Removing all non-alphabet characters from a list

Removing all non-alphabet characters from a list - python

I am building a very simple translator program that uses wordreference.com to look up the meanings of words.
I am not very good at Python (3.4) but I was able to make this
(Also, I know the n = n + 1 thing I have isn't currently working, I did this on purpose to test other things!)
import webbrowser
import sys
trans = True
print('What language will you be translating FROM?')
lang = input()
n = 1
print('Ok, ' + lang + ', what word would you like to translate from ' + (lang) + ' to English?')
while trans == True:
if n > 99:
print('Another one: ')
word = input()
word = (word.lower())
list = word.split()
if lang == 'French':
lang = 'fren'
if lang == 'french':
lang = 'fren'
for word in list:
webbrowser.open('http://www.wordreference.com/' + (lang) + '/' + (str(word)))
n = n + 1
My question is, how would I remove things such as commas, and exclamation points from the list, but NOT apostraphes
My test sentence is 'Je vais bien, merci!', I want it to open the amount of tabs as the words, (which it does), but instead of it being
Je vais bien, merci!
I want it to be
Je vais bien merci
I know how to use
word.isalpha()
But this only makes it so I cannot use the program at all if the words are not alphabetical.
Thanks in adavance!

This will remove non-alphabet characters excepting apostrophes and spaces.
>>> s = "Je vais bien, merci!"
>>> "".join(c for c in s if c.isalpha() or c in " '")
'Je vais bien merci'
Hope it helps!

Related

Is there a way to simplify my deep string of "if" statements? None of them actually repeat they are all just similar

I have written some code to help with my GCSE revision (exams in the UK taken at age 16) which converts a string into just the first letter of every word but leaves everything else in tact. (i.e special characters at the ends of words, capitalisation, etc...)
For example:
If I input >>> "These are some words (now they're in brackets!)"
I would want it to output >>> "T a s w (n t i b!)"
I feel although there must be an easier way to do this than my string of similar "if" statements... For reference, I am reasonably new to python but I can't see to find an answer online. Thanks in advance!
Code:
line = input("What text would you like to memorise?\n")
words = line.split()
letters=''
spec_chars=[
'(',')',',','.','“','”','"',"‘","’","'",'!','¡','?','¿','…'
]
for word in words:
if word[0] in spec_chars:
if word[-1] in spec_chars:
if word[-2] in spec_chars:
if word[1] in spec_chars:
letters += word[0] + word[1] + word[2] + word[-2] + word[-1] + " "
else:
letters += word[0] + word[1] + word[-2] + word[-1] + " "
else:
if word[1] in spec_chars:
letters += word[0] + word[1] + word[2] + word[-1] + " "
else:
letters += word[0] + word[1] + word[-1] + " "
else:
if word[1] in spec_chars:
letters += word[0] + word[1] + word[2] + " "
else:
letters += word[0] + word[1] + " "
else:
if word[-1] in spec_chars:
if word[-2] in spec_chars:
letters += word[0] + word[-2] + word[-1] + " "
else:
letters += word[0] + word[-1] + " "
else:
letters += word[0] + " "
output=("".join(letters))
print(output)

Here's one alternative. We keep every punctuation except apostrophe, and we only keep the first letter encountered.
words = "These are some words (now they're in brackets!)"
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzé'"
output = []
for word in words.split():
output.append( '' )
found = False
for i in word:
if i in alphabet:
if not found:
found = True
output[-1] += i
else:
output[-1] += i
print(' '.join(output))
Output:
T a s w (n t i b!)

This might be somewhat overwhelming for now, but I'd still like to point out a solution that allows for a much more concise solution using regular expressions, because it's quite instructional in terms of how to approach problems like this.
TL;DR: It can be done in one line
import re
' '.join(re.sub(r"(\w)[\w']*\w", r'\1', word) for word in text.split())
If you look at the words individually after using .split(), it appears that what you need to do is basically remove all letters (and word-internal apostrophe) after the first letter occurring in each word.
[
'"These', # remove 'hese'
'are', # 're'
'some', # 'ome'
'words', # 'ords'
'(now', # 'ow'
"they're", # "hey're"
'in', # 'n'
'brackets!)"' # 'rackets'
]
Another way to think about it is to find sequences consisting of
A letter x
A sequence of 1 or more letters
and replace the sequence with x. E.g., in '"These', replace 'These' with 'T'. to arrive at '"T'; in brackets!)", replace 'brackets' with 'b', etc.
In regular expression syntax, this becomes:
(\w): A letter is matched by \w, but we want to reference to it later, so we need to put it in a group - hence the parentheses.
A sequence of 1 or more (indicated by +) letters is \w+. We also want to include apostrophe, so we want a class indicated by [], i.e., [\w']+, which means "match one or more instances of a letter or apostrophe".
To replace/substitute substrings matched by the pattern we use re.sub(pattern, replacement, string). In the replacement string we can tell it to insert the group we defined before by using the reference \1.
Putting it all together:
# import the re module
import re
# define the regular expression
pattern = r"(\w)[\w']+"
# some test data
texts = ["\"These are some words (now they're in brackets!)\"",
"¿Qué es lo mejor asignatura? '(¡No es dibujo!!)'",
"The kids' favourite teacher"]
# testing the pattern
for text in texts:
words = text.split()
print(text)
print(' '.join(re.sub(pattern, r'\1', word) for word in words))
print()
Result:
"These are some words (now they're in brackets!)"
"T a s w (n t i b!)"
¿Qué es lo mejor asignatura? '(¡No es dibujo!!)'
¿Q e l m a? '(¡N e d!!)'
The kids' favourite teacher
T k f t
To include word-final apostrophe, modify the pattern to
pattern = r"(\w)[\w']*\w"
so that the letter-apostrophe sequence must end with a letter.
In other words, we now match
a group consisting of a letter (\w), followed by
zero or more (indicated by *) instances of letter or apostrophe, and
a letter \w.
The result is exactly the same as above, except the last sentence becomes "T k' f t".

Below code is working fine for me.
Here, I am just checking the left and right end of each word of the given sentence.
Let me know in case of any clarification.
words = "¿Qué es lo mejor asignatura? '(¡No es dibujo!!)'"
spec_chars = ['(', ')', ',', '.', '“', '”', '"', "‘",
"’", "'", '!', '¡', '?', '¿', '…']
s_lst = words.split(' ')
tmp, rev_tmp = '', ''
for i in range(len(s_lst)):
for l in s_lst[i]:
if l in spec_chars:
tmp += l
else:
tmp += l
for j in s_lst[i][::-1]:
if j in spec_chars:
rev_tmp += j
else:
tmp += rev_tmp[::-1]
break
s_lst[i] = tmp
tmp = ''
rev_tmp = ''
break
print(' '.join(s_lst))

Since you mentioned that you are at an entry-level, you can use a for loop to simplify your if statement. It is not perfect but could solve the question you have raised.`
line = input("What text would you like to memorise?\n")
words = line.split()
spec_chars=['(',')',',','.','“','”','"',"‘","’","'",'!','¡','?','¿','…']
letters=''
for word in words:
letters+=word[0]
if word[0] in spec_chars:
letters+=word[1]
elif word[-2] in spec_chars:
letters+=word[-2]+word[-1]
elif word[-1] in spec_chars:
letters+=word[-1]
print(letters)

Unexpected outcome, manipulating string on Python

I am writing some code in Python, trying to clean a string all to lower case without special characters.
string_salada_russa = ' !! LeTRas PeqUEnAS & GraNdeS'
clean_string = string_salada_russa.lower().strip()
print(clean_string)
i = 0
for c in clean_string:
if(c.isalpha() == False and c != " "):
clean_string = clean_string.replace(c, "").strip()
print(clean_string)
for c in clean_string:
if(i >= 1 and i <= len(clean_string)-1):
if(clean_string[i] == " " and clean_string[i-1] == " " and clean_string[i+1] == " "):
clean_string = clean_string.replace(clean_string[i], "")
i += 1
print(clean_string)
Expected outcome would be:
#original string
' !! LeTRas PeqUEnAS & GraNdeS'
#expected
'letras pequenas grandes'
#actual outcome
'letraspequenasgrandes'
I am trying to remove the extra spaces, however unsucessfully. I end up removing ALL spaces.
Could anyone help me figure it out? What is wrong in my code?

How about using re?
import re
s = ' !! LeTRas PeqUEnAS & GraNdeS'
s = re.sub(r"[^a-zA-Z]+", " ", s.lower()).strip()
print(s) # letras pequenas grandes
This first translates the letters into lower case (lower), replace each run of non-alphabetical characters into a single blank (re.sub), and then remove blanks around the string (strip).
Btw, your code does not output 'letraspequenasgrandes'. Instead, it outputs 'letrasZpequenasZZZZZgrandes'.

You could get away with a combination of str.lower(), str.split(), str.join() and str.isalpha():
def clean(s):
return ' '.join(x for x in s.lower().split(' ') if x.isalpha())
s = ' !! LeTRas PeqUEnAS & GraNdeS'
print(clean(s))
# letras pequenas grandes
Basically, you first convert to lower and the split by ' '. After that you filter out non-alpha tokens and join them back.

There's no need to strip your string at each iteration of the first for loop; but, other than that, you could keep the first piece of your code:
for c in clean_string:
if (c.isalpha() == False and c != " "):
clean_string = clean_string.replace(c, "")
Then split your string, effectively removing all the spaces, and re-join the word back into a single string, with a single space between each word:
clean_string = " ".join(clean_string.split())

How can i improve this guessing algorithm?

I am trying to do a program that guess the word the user think, but right now the program is based only on elimination. Does anyone have an idea on how to make it better?
Here is a brief explanation on how it works now:
I have a list of words stored in "palavras.txt", these words are then transformed into a regular list.
First question is: "How much letters do your word have?". Based on that the program proceed to eliminate all the others words who do not have the same amount of letters. After that it creates a list that contains all the letters organized by the number of times they appear in the given position.
Then we have the second question: "Is the letter "x" the first letter of your word?". If the response is "not" it deletes all the words that contains that letter in that position, then goes to the second letter most used in that position and so on and so on. If yes it deletes all the words that doesn't contain that letter in that specific position and goes to the next letter of the word. And so on until the word is finished.
It works all the times, but sometimes it takes quite a lot of times. Is there a better way do it? AI? Machine learning maybe?
The code is not important since i'm just searching for ideas, but if anyone is curious here is how i did it:
import os
from unicodedata import normalize
import random
import string
# Define a função que retira os acentos das palavras
def remover_pont(txt):
import string
return txt.translate(str.maketrans('', '', string.punctuation))
def remover_acentos(txt):
return normalize('NFKD', txt).encode('ASCII', 'ignore').decode('ASCII')
# Retorna uma lista com as letras mais usadas naquela posição, em ordem
def letramusada(lista, pletra):
pletraordem = []
pletraordem2 = []
pl = []
for n in lista:
try:
pl.append(n[pletra - 1])
except:
pass
dict = {}
for k in pl:
if k in dict:
dict[k] += 1
else:
dict[k] = 1
pletraordem2 = (sorted(dict.items(), key=lambda t: t[1], reverse=True))
for c in pletraordem2:
pletraordem.append(c[0])
return pletraordem
# Lê o "banco de dados" que contém as palavras e as armazena na variável "palavras", sem acentos
file = open('palavras.txt')
palavras = file.read().split("\n")
# Armazena a quantidade de letras que a palavra pensada tem
nletras = int(input('Digite o número de letras da palavra (considerando hífen, caso haja) que você pensou, com máximo de 8: '))
# Declara listas que serão usadas em seguida
npalavras = []
palavras2 = []
palavras3 = []
# Armazena todas as palavras que contém a quantidade de letras escolhida anteriormente em uma nova lista chamada "nletras", desconsiderando pontos
for n in palavras:
if nletras == len(n):
npalavras.append(remover_acentos(n).lower())
c = 0
n = 0
for k in range(1, nletras + 1):
ordem = letramusada(npalavras, k)
cond = 0
try:
while cond == 0:
if len(npalavras) < 20 and c == 0:
print("\nHmmm, estou chegando perto!\n")
c += 1
if len(npalavras) < 3:
break
for c in ordem:
if c != 0:
r = str(input("A {} letra da sua palavra é a letra \"{}\"? [S/N] ".format(k, c))).lower()
r = r[0]
if r == "s":
for n in npalavras:
if n[k-1] == c:
palavras2.append(n)
npalavras.clear()
npalavras = palavras2[:]
palavras2.clear()
ordem.clear()
cond += 1
break
else:
for n in npalavras:
if n[k-1] != c:
palavras2.append(n)
npalavras.clear()
npalavras = palavras2[:]
palavras2.clear()
r = 0
pass
except:
n = 1
print("\nDesculpe, não achei nenhuma palavra :(")
escolha = random.choice(npalavras)
if n != 0:
print("\nA palavra que você pensou é: \"{}\"".format(escolha))

The Brute Force Be Your Friend
People may think "machine learning" is a silver bullet, but, what to learn? Especially when there's little information provided. What can you optimize? Your description sounds like a pure brute-force dictionary based password cracking, and hackers living in today are utilizing the power of GPU for that.
This may be a little off topic but even given a GPU the search can be hard. If you are not constrained to specific language / platform, the above link to hashcat is useful. The famous 133 MB dictionary can be enumerated in 5 minutes on a MacBookPro, which is way more powerful than guessing in Python.
The Search Space And Word Patterns
Also an average length for English words is about 8, this situation is really similar with a typical password. i.e. your search space is large - the upperbound is 26^8 = 208827064576 words! - except that player can only use a limited word list in the game.
The actual search space can be a little bit smaller since there are patterns in English words (like s is the most frequent alphabet and ae, as can appear more frequently than az things), but you are using a dictionary, so I don't think this can help.
The Non Dictionary Approach
And another idea is that the process can be quite close to recover a DNA sequence, which also has some patterns but the give information may vary. Think it as a word suggestion. Bioinfomatics uses the probabilistic patterns in DNA sequence for imputation.
This method can help when you can progressively guess the word / sequence. Otherwise, you can only use a brute force approach (when your word can only be recovered from a hash).
A classic method used for search engines, input methods and DNA imputation is hidden markov model. It guesses the next character based on your previous input, and the probability is a statistic value pre-calculated using real words.
This can be combined with dictionary to sort your suggestion (guess) and provide more accurate guessing.

you could store the words that have already been used, like say The first user used the word 'carro', then you could add that to a file, and after a few letters the program could check the list for already said words see if the word matches the description given i.e.: "has a c as first letter", and ask the next user if "carro" is their word, you could improve this further by adding a counter to each word, so that words that are more used appear on top of words that are less used.

There's another post that talks about word suggesting algorithm it even has the python code for it.
Here's the link What algorithm gives suggestions in a spell checker?

Python Case Matching Input and Output

I'm doing the pig latin question that I'm sure everyone here is familiar with it. The only thing I can't seem to get is matching the case of the input and output. For example, when the user enters Latin, my code produces atinLay. I want it to produce Atinlay.
import string
punct = string.punctuation
punct += ' '
vowel = 'aeiouyAEIOUY'
consonant = 'bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ'
final_word = input("Please enter a single word. ")
first_letter = final_word[:1]
index = 0
if any((p in punct) for p in final_word):
print("You did not enter a single word!")
else:
while index < len(final_word) and (not final_word[index] in vowel):
index = index+1
if any((f in vowel) for f in first_letter):
print(final_word + 'yay')
elif index < len(final_word):
print(final_word[index:]+final_word[:index]+'ay')

What you need is str.title(). Once you have done your piglatin conversion, you can use title() built-in function to produce the desired output, like so:
>>> "atinLay".title()
'Atinlay'
To check if a string is lower case, you can use str.islower(). Take a peek at the docs.

simply use the built in string functions.
s = "Hello".lower()
s == "hello"
s = "hello".upper()
s == "HELLO"
s = "elloHay".title()
s == "Ellohay"

Going character by character in a string and swapping whitespaces with python

Okay so I have to switch ' ' to *s. I came up with the following
def characterSwitch(ch,ca1,ca2,start = 0, end = len(ch)):
while start < end:
if ch[start] == ca1:
ch[end] == ca2
start = start + 1
sentence = "Ceci est une toute petite phrase."
print characterSwitch(sentence, ' ', '*')
print characterSwitch(sentence, ' ', '*', 8, 12)
print characterSwitch(sentence, ' ', '*', 12)
print characterSwitch(sentence, ' ', '*', end = 12)
Assigning len(ch) doesn't seem to work and also I'm pretty sure this isn't the most efficient way of doing this. The following is the output I'm aiming for:
Ceci*est*une*toute*petite*phrase.
Ceci est*une*toute petite phrase.
Ceci est une*toute*petite*phrase.
Ceci*est*une*toute petite phrase.

Are you looking for replace() ?
sentence = "Ceci est une toute petite phrase."
sentence = sentence.replace(' ', '*')
print sentence
# Ceci*sest*sune*stoute*spetite*sphrase.
See a demo on ideone.com additionally.
For your second requirement (to replace only from the 8th to the 12th character), you could do:
sentence = sentence[8:12].replace(' ', '*')

Assuming you have to do it character by character you could do it this way:
sentence = "this is a sentence."
replaced = ""
for c in sentence:
if c == " ":
replaced += "*"
else:
replaced += c
print replaced

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.