How can I extract words from a file from a string - python

So I have tried to make it so that I can extract words in a file with every English word from random letters a generator gives me. Then I would like to add the found words to a list. But I am having a bit of a problem acquiring this result. Could you help me please?
This is what I have tried:
import string
import random
def gen():
b = []
for i in range(100):
a = random.choice(string.ascii_lowercase)
b.append(a)
with open('allEnglishWords.txt') as f:
words = f.read().splitlines()
joined = ''.join([str(elem) for elem in b])
if joined in words:
print(joined)
f.close()
print(joined)
gen()
if you are wondering where I got the txt file it is located here http://www.gwicks.net/dictionaries.htm. I downloaded the one labeled ENGLISH - 84,000 words the text file

import string
import random
b = []
for i in range(100):
a = random.choice(string.ascii_lowercase)
b.append(a)
b = ''.join(b)
with open('engmix.txt', 'r') as f:
words = [x.replace('\n', '') for x in f.readlines()]
output=[]
for word in words:
if word in b:
output.append(word)
print(output)
Output:
['a', 'ad', 'am', 'an', 'ape', 'au', 'b', 'bi', 'bim', 'c', 'cb', 'd', 'e',
'ed', 'em', 'eo', 'f', 'fa', 'fy', 'g', 'gam', 'gem', 'go', 'gov', 'h',
'i', 'j', 'k', 'kg', 'ko', 'l', 'le', 'lei', 'm', 'mg', 'ml', 'mr', 'n',
'no', 'o', 'om', 'os', 'p', 'pe', 'pea', 'pew', 'q', 'ql', 'r', 's', 'si',
't', 'ta', 'tap', 'tape', 'te', 'u', 'uht', 'uk', 'v', 'w', 'wan', 'x', 'y',
'yo', 'yom', 'z', 'zed']

Focusing on acquiring this result, assume your words are seperated by a single space:
with open("allEnglishWords.txt") as f:
for line in f:
for word in line.split(" "):
print(word)
Also, you don't need f.close() inside a with block.

Related

Code can't find keys in dictionary when prompted with string sequence

Newbie at coding, doing this for university. I have written a dictionary which translates codons into single letter amino acids. However, my function can't find the keys in the dict and just adds an X to the list I've made. See code below:
codon_table = {('TTT', 'TTC'): 'F',
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'): 'L',
('ATT', 'ATC', 'ATA'): 'I',
('ATG'): 'M',
('GTT', 'GTC', 'GTA', 'GTG'): 'V',
('TCT', 'TCC', 'TCA', 'TCG'): 'S',
('CCT', 'CCC', 'CCA', 'CCG'): 'P',
('ACT', 'ACC', 'ACA', 'ACG'): 'T',
('GCT', 'GCC', 'GCA', 'GCG'): 'A',
('TAT', 'TAC'): 'Y',
('CAT', 'CAC'): 'H',
('CAA', 'CAG'): 'Q',
('AAT', 'AAC'): 'N',
('AAA', 'AAG'): 'K',
('GAT', 'GAC'): 'D',
('GAA', 'GAG'): 'E',
('TGT', 'TGC'): 'C',
('TGG'): 'W',
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'): 'R',
('AGT', 'AGC'): 'S',
('GGT', 'GGC', 'GGA', 'GGG'): 'G',
('TAA', 'TAG', 'TGA'): '*',
}
AA_seq = []
input_DNA = str(input('Please input a DNA string: '))
def translate_dna():
list(input_DNA)
global AA_seq
for codon in range(0, len(input_DNA), 3):
if codon in codon_table:
AA_seq = codon_table[codon]
AA_seq.append(codon_table[codon])
else:
AA_seq.append('X')
print(str(' '.join(AA_seq)).strip('[]').replace("'", ""))
translate_dna()
Inputted a DNA sequence, eg TGCATGCTACGTAGCGGACCTGG, which would only return XXXXXXX. What I would expect is a string of single letters corresponding to the amino acids in the dict.
I've been staring at it for the best part of an hour, so I figured it's time to ask the experts. Thanks in advance.
You need a codon dictionary keyed on single codons.
Then you need to iterate over the input sequence in groups of 3.
You also need to decide what the output should look like if a triplet is not found in your lookup dictionary.
For example:
from functools import cache
codon_table = {('TTT', 'TTC'): 'F',
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'): 'L',
('ATT', 'ATC', 'ATA'): 'I',
('ATG'): 'M',
('GTT', 'GTC', 'GTA', 'GTG'): 'V',
('TCT', 'TCC', 'TCA', 'TCG'): 'S',
('CCT', 'CCC', 'CCA', 'CCG'): 'P',
('ACT', 'ACC', 'ACA', 'ACG'): 'T',
('GCT', 'GCC', 'GCA', 'GCG'): 'A',
('TAT', 'TAC'): 'Y',
('CAT', 'CAC'): 'H',
('CAA', 'CAG'): 'Q',
('AAT', 'AAC'): 'N',
('AAA', 'AAG'): 'K',
('GAT', 'GAC'): 'D',
('GAA', 'GAG'): 'E',
('TGT', 'TGC'): 'C',
('TGG'): 'W',
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'): 'R',
('AGT', 'AGC'): 'S',
('GGT', 'GGC', 'GGA', 'GGG'): 'G',
('TAA', 'TAG', 'TGA'): '*',
}
#cache
def lookup(codon):
for k, v in codon_table.items():
if codon in k:
return v
return '?'
sequence = 'TGCATGCTACGTAGCGGACCTGG'
AA_Seq = []
for i in range(0, len(sequence), 3):
AA_Seq.append(lookup(sequence[i:i+3]))
print(AA_Seq)
Output:
['C', 'M', 'L', 'R', 'S', 'G', 'P', '?']
Note:
The ? appears because the last item extracted from the input sequence is 'GG' which is not a valid codon.
Also note that the key/value pair in codon_table of ('ATG'): 'M' is not a tuple/string pair. ('ATG') is just a string (the parentheses are irrelevant). You could write it as ('ATG',): 'M' to make the key a 1-tuple
Your for loop goes through input and inside it can't find any matches and appends "X" to your AA_seq
This is because
you are trying to access only 1 element in the input string rather than 3
your dictionary keys are tuples, which means "TTT" is not the same
thing as ("TTT",)
To fix this:
You have to reorder your dictionary to only use single value for key instead of a tuple.
You have to loop through your input such as [i:i+3] to get a string length of three

Check if a string is in a list of letters - Python3

I have this list which contains letters, and I need to check if a pre-determined word located in another list is horizontally inside this list of letters.
i.e.:
mat_input = [['v', 'e', 'd', 'j', 'n', 'a', 'e', 'o'], ['i', 'p', 'y', 't', 'h', 'o', 'n', 'u'], ['s', 'u', 'e', 'w', 'e', 't', 'a', 'e']]
words_to_search = ['python', 'fox']
I don't need to tell if a word was not found, but if it was I need to tell which one.
My problem is that so far I've tried to compare letter by letter, in a loop similar to this:
for i in range(n): # n = number of words
for j in range(len(word_to_search[i])): # size of the word I'm searching
for k in range(h): # h = height of crossword
for m in range(l): # l = lenght of crossword
But it's not working, inside the last loop I tried several if/else conditions to tell if the whole word was found. How can I solve this?
You can use str.join:
mat_input = [['v', 'e', 'd', 'j', 'n', 'a', 'e', 'o'], ['i', 'p', 'y', 't', 'h', 'o', 'n', 'u'], ['s', 'u', 'e', 'w', 'e', 't', 'a', 'e']]
words_to_search = ['python', 'fox']
joined_input = list(map(''.join, mat_input))
results = {i:any(i in b or i in b[::-1] for b in joined_input) for i in words_to_search}
Output:
{'python': True, 'fox': False}
I'd start by joining each sublist in mat_input into one string:
mat_input_joined = [''.join(x) for x in mat_input]
Then loop over your words to search and simply use the in operator to see if the word is contained in each string:
for word_to_search in words_to_search:
result = [word_to_search in x for x in mat_input_joined]
print('Word:',word_to_search,'found in indices:',[i for i, x in enumerate(result) if x])
Result:
Word: python found in indices: [1]
Word: fox found in indices: []

Python: using indices and str

I am attempting to learn Python and am working on an assignment for fun that involves translating "encrypted" messages (it's just the alphabet in reverse). My function is supposed to be able to read in an encoded string and then print out its decoded string equivalent. However, as I am new to Python, I find myself continually running into a type error with trying to use the indices of my lists to give the values. If anyone has any pointers on a better approach or if there is something that I just plain missed, that would be awesome.
def answer(s):
'''
All lowercase letters [a-z] have been swapped with their corresponding values
(e.g. a=z, b=y, c=x, etc.) Uppercase and punctuation characters are unchanged.
Write a program that can take in encrypted input and give the decrypted output
correctly.
'''
word = ""
capsETC = 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',\
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',\
' ', '?', '\'', '\"', '#', '!', '#', '$', '%', '&', '*', '(', \
') ', '-', '_', '+', '=', '<', '>', '/', '\\'
alphF = 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
alphB = 'z', 'y', 'x', 'w', 'v', 'u', 't', 's', 'r', 'q', 'p', 'o', 'n', 'm',\
'l', 'k', 'j', 'i', 'h', 'g', 'f', 'e', 'd', 'c', 'b', 'a'
for i in s:
if i in capsETC: # if letter is uppercase or punctuation
word = word + i # do nothing
elif i in alphB: # else, do check
for x in alphB: # for each index in alphB
if i == alphB[x]: # if i and index are equal (same letter)
if alphB[x] == alphF[x]: # if indices are equal
newLetter = alphF[x] # new letter equals alpf at index x
str(newLetter) # convert to str?
word = word + newLetter # add to word
print(word)
s = "Yvzs!"
answer(s)
your code is fine, just a few changes (left your old lines as comments)
def answer(s):
word = ""
capsETC = 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',\
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',\
' ', '?', '\'', '\"', '#', '!', '#', '$', '%', '&', '*', '(', \
') ', '-', '_', '+', '=', '<', '>', '/', '\\'
alphF = 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
alphB = 'z', 'y', 'x', 'w', 'v', 'u', 't', 's', 'r', 'q', 'p', 'o', 'n', 'm',\
'l', 'k', 'j', 'i', 'h', 'g', 'f', 'e', 'd', 'c', 'b', 'a'
for i in s:
if i in capsETC: # if letter is uppercase or punctuation
word = word + i # do nothing
elif i in alphB: # else, do check
for x in range(len(alphB)): # for each index in alphB
if i == alphB[x]: # if i and index are equal (same letter)
# if alphB[x] == alphF[x]: # if indices are equal
newLetter = alphF[x] # new letter equals alpf at index x
# str(newLetter) # convert to str?
word = word + newLetter # add to word
return word
s = "Yvzs!"
print(s)
print(answer(s))
ouput
Yvzs!
Yeah!
of course you can make it a lot simple and python's way... but wanted to change your code as little as possible
Your current issue is that you are trying to use letters as indices. To fix your current approach, you could use enumerate while looping through each of your strings.
If you want a much simpler approach, you can make use of str.maketrans and str.translate. These two builtin functions help easily solve this problem:
import string
unenc = string.ascii_lowercase # abcdefghijklmnopqrstuvwxyz
decd = unenc[::-1] # zyxwvutsrqponmlkjihgfedcba
secrets = str.maketrans(unenc, decd)
s = "Yvzs!"
print(s.translate(secrets))
Output:
Yeah!
If you want a looping approach, you can use try and except along with string.index() to achieve a much simpler loop:
import string
unenc = string.ascii_lowercase # abcdefghijklmnopqrstuvwxyz
decd = unenc[::-1] # zyxwvutsrqponmlkjihgfedcba
s = "Yvzs!"
word = ''
for i in s:
try:
idx = unenc.index(i)
except:
idx = -1
word += decd[idx] if idx != -1 else i
print(word)
Output:
Yeah!

Replace some accented letters from word in python

I'm trying to replace some accented letters from Portuguese words in Python.
accentedLetters = ['à', 'á', 'â', 'ã', 'é', 'ê', 'í', 'ó', 'ô', 'õ', 'ú', 'ü']
letters = ['a', 'a', 'a', 'a', 'e', 'e', 'i', 'o', 'o', 'o', 'u', 'u']
So the accentedLetters will be replaced by the letter in the letters array.
In this way, my expected results are for example:
ação => açao
frações => fraçoes
How can I do that?
A simple translation dictionary should do the trick. For each letter, if the letter is in the dictionary, use its translation. Otherwise, use the original. Join the individual characters back into a word.
def removeAccents(word):
repl = {'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a',
'é': 'e', 'ê': 'e',
'í': 'i',
'ó': 'o', 'ô': 'o', 'õ': 'o',
'ú': 'u', 'ü': 'u'}
new_word = ''.join([repl[c] if c in repl else c for c in word])
return new_word
You can view the Unidecode library for Python3.
For example:
from unidecode import unidecode
a = ['à', 'á', 'â', 'ã', 'é', 'ê', 'í', 'ó', 'ô', 'õ', 'ú', 'ü']
for k in a:
print (unidecode(u'{0}'.format(k)))
Result:
a
a
a
a
e
e
i
o
o
o
u
u
I have finally solved my problem:
#! /usr/bin/python
# -*- coding: utf-8 -*-
import sys
def removeAccents(word):
replaceDict = {'à'.decode('utf-8'): 'a',
'á'.decode('utf-8'): 'a',
'â'.decode('utf-8'): 'a',
'ã'.decode('utf-8'): 'a',
'é'.decode('utf-8'): 'e',
'ê'.decode('utf-8'): 'e',
'í'.decode('utf-8'): 'i',
'ó'.decode('utf-8'): 'o',
'ô'.decode('utf-8'): 'o',
'õ'.decode('utf-8'): 'o',
'ú'.decode('utf-8'): 'u',
'ü'.decode('utf-8'): 'u'}
finalWord = ''
for letter in word:
if letter in replaceDict:
finalWord += replaceDict[letter]
else:
finalWord += letter
return finalWord
word = (sys.argv[1]).decode('utf-8')
print removeAccents(word)
This just works as I expected.
Another simple option using regex:
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string

From text file to dictionary

I'm a txt file and taking the strings and making the first my key for my dictionary I'm creating and the rest will be my values as a tuple. There is header before hand and I've already made my code "ignore" it at the start.
Example of txt values:
"Ronald Reagan","1981","8","69","California","Republican"
"George Bush","1989","4","64","Texas","Republican"
"Bill Clinton","1993","8","46","Arkansas","Democrat"
I want to create dictionary that gives the following output:
{"Ronald Reagan": (1981,8,69,"California", "Republican") etc.}
This is what I currenltly have as my code :
def read_file(filename):
d={}
f= open(filename,"r")
first_line = f.readline()
for line in f:
#line=line.strip('"')
#line=line.rstrip()
data=line.split('"')
data=line.replace('"', "")
print(data)
key_data=data[0]
values_data= data[1:]
valuesindata=tuple(values_data)
d[key_data]=valuesindata
print(d)
read_file(filename)
The first print statement (I put it there just to see what the output at that point was and it gave me the following :
Ronald Reagan,1981,8,69,California,Republican
George Bush,1989,4,64,Texas,Republican
etc. By the time it gets to the second print statement it does the following:
{'R': ('o', 'n', 'a', 'l', 'd', ' ', 'R', 'e', 'a', 'g', 'a', 'n', ',', '1', '9', '8', '1', ',', '8', ',', '6', '9', ',', 'C', 'a', 'l', 'i', 'f', 'o', 'r', 'n', 'i', 'a', ',', 'R', 'e', 'p', 'u', 'b', 'l', 'i', 'c', 'a', 'n', '\n'), 'G': ('e', 'o', 'r', 'g', 'e', ' ', 'B', 'u', 's', 'h', ',', '1', '9', '8', '9', ',', '4', ',', '6', '4', ',', 'T', 'e', 'x', 'a', 's', ',', 'R', 'e', 'p', 'u', 'b', 'l', 'i', 'c', 'a', 'n', '\n')}
Also, I'm splitting it at the quotes because some of my strings contain a comma as part of the name, example : "Carl, Jr."
I'm not wanting to import the csv module, so is there a way to do that?
You can use the csv module like alecxe suggested or you can do it "manually" like so:
csv_dict = {}
with open(csv_file, 'r') as f:
for line in f:
line = line.strip().replace('"', '').split(',')
csv_dict[line[0]] = tuple(int(x) if x.isdigit() else str(x) for x in line[1:])
This will remove the double quotes, cast numerical values to int and create a dictionary of tuples.
The major problem in your code leading into this weird result is that data variable is a string, data[0] would give you the first character, data[1:] the rest - you needed to call split(",") to first split the string into the list.
I have a limitation to not import any modules.
The idea is to use split(",") to split each line into individual items and strip() to remove the quotes around the item values:
d = {}
with open(filename) as f:
for line in f:
items = [item.strip('"').strip() for item in line.split(",")]
d[items[0]] = items[1:]
print(d)
Prints:
{'Bill Clinton': ['1993', '8', '46', 'Arkansas', 'Democrat'],
'George Bush': ['1989', '4', '64', 'Texas', 'Republican'],
'Ronald Reagan': ['1981', '8', '69', 'California', 'Republican']}
FYI, using csv module from the standard library would make things much easier:
import csv
from pprint import pprint
d = {}
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
d[row[0]] = row[1:]
pprint(d)
You can also use a dictionary comprehension:
d = {row[0]: row[1:] for row in reader}

Categories

Resources