find all words in a certain alphabet with multi character letters

find all words in a certain alphabet with multi character letters - python

I want to find out what words can be formed using the names of musical notes.
This question is very similar: Python code that will find words made out of specific letters. Any subset of the letters could be used
But my alphabet also contains "fis","cis" and so on.
letters = ["c","d","e","f","g","a","h","c","fis","cis","dis"]
I have a really long word list with one word per list and want to use
with open(...) as f:
for line in f:
if
to check if each word is part of that "language" and then save it to another file.
my problem is how to alter
>>> import re
>>> m = re.compile('^[abilrstu]+$')
>>> m.match('australia') is not None
True
>>> m.match('dummy') is not None
False
>>> m.match('australian') is not None
False
so it also matches with "fis","cis" and so on.
e.g. "fish" is a match but "ifsh" is not a match.

I believe ^(fis|cis|dis|[abcfhg])+$ will do the job.
Some deconstruction of what's going on here:
| workds like OR conjunction
[...] denotes "any symbol from what's inside the brackets"
^ and $ stand for beginning and end of line, respectively
+ stands for "1 or more time"
( ... ) stands for grouping, needed to apply +/*/{} modifiers. Without grouping such modifiers applies to closest left expression
Alltogether this "reads" as "whole string is one or more repetition of fis/cis/dis or one of abcfhg"

This function works, it doesn't use any external libraries:
def func(word, letters):
for l in sorted(letters, key=lambda x: x.length, reverse=True):
word = word.replace(l, "")
return not s
it works because if s=="", then it has been decomposed into your letters.
Update:
It seems that my explanation wasn't clear. WORD.replace(LETTER, "") will replace the note/LETTER in WORD by nothing, here is an example :
func("banana", {'na'})
it will replace every 'na' in "banana" by nothing ('')
the result after this is "ba", which is not a note
not "" means True and not "ba" is false, this is syntactic sugar.
here is another example :
func("banana", {'na', 'chicken', 'b', 'ba'})
it will replace every 'chicken' in "banana" by nothing ('')
the result after this is "banana"
it will replace every 'ba' in "banana" by nothing ('')
the result after this is "nana"
it will replace every 'na' in "nana" by nothing ('')
the result after this is ""
it will replace every 'b' in "" by nothing ('')
the result after this is ""
not "" is True ==> HURRAY IT IS A MELODY !
note: The reason for the sorted by length is because otherwise, the second example would not have worked. The result after deleting "b" would be "a", which can't be decomposed in notes.

You can calculate the number of letters of all units (names of musical notes), which are in the word, and compare this number to the length of the word.
from collections import Counter
units = {"c","d","e","f","g","a","h", "fis","cis","dis"}
def func(word, units=units):
letters_count = Counter()
for unit in units:
num_of_units = word.count(unit)
letters_count[unit] += num_of_units * len(unit)
if len(unit) == 1:
continue
# if the unit consists of more than 1 letter (e.g. dis)
# check if these letters are in one letter units
# if yes, substruct the number of repeating letters
for letter in unit:
if letter in units:
letters_count[letter] -= num_of_units
return len(word) == sum(letters_count.values())
print(func('disc'))
print(func('disco'))
# True
# False

A solution with tkinter window opening to choose file:
import re
from tkinter import filedialog as fd
m = re.compile('^(fis|ges|gis|as|ais|cis|des|es|dis|[abcfhg])+$')
matches = list()
filename = fd.askopenfilename()
with open(filename) as f:
for line in f:
if m.match(str(line).lower()) is not None:
matches.append(line[:-1])
print(matches)
This answer was posted as an edit to the question find all words in a certain alphabet with multi character letters by the OP Nivatius under CC BY-SA 4.0.

Related

How can I remove duplicate string with Python?

I am to learn python, I have a problem to deal with
Following example:
string1 = "Galaxy S S10 Lite"
string2 = "Galaxy Note Note 10 Plus"
How can I remove the second two duplicates "S" and "S" or "Note" and "Note"?
The result should look like
string1a = "Galaxy S10 Lite"
string2a = "Galaxy Note 10 Plus"
how to only the second duplicates should be removed and the sequence of the words should not be changed!

string1a = string1.split()
del string1a[1]
string1a = " ".join(string1a)
This does what you want for the 2 strings provided.
It will only work in all the strings you want, if you know for sure that the second and third words of the string are the duplicates, giving preference to the third one.

The accepted answer only removes the second word in the sentence manually.
If you have a very long string that will be very tedious to clean.
I assume there were only two cases:
Skip the letter if it's the same with the following word's first letter
Skip the word if it's the same as the its following word
This function can clean them automatically
def clean_string(string):
"""Clean if there were sequentially duplicated word or letter"""
following_word = ''
complete_words = []
# loop through the string in reverse to be able to skip the earlier word/letter
# string.split() splits your string by each space and make it as a list
for word in reversed(string.split()):
# to skip duplicated letter, in your case is to skip "S" and retain "S10"
if (len(word) == 1) and (following_word[0] == word):
following_word = word
continue
# to skip duplicated word, in your case is to skip "Note" and retain latter "Note"
elif word == following_word:
following_word = word
continue
following_word = word
complete_words.append(word)
# join all appended word and re-reverse it to be the expected sequence
return ' '.join(reversed(complete_words))

Python 3.8 dictionary value sorting alphabetically

This code is meant to read a text file and add every word to a dictionary where the key is the first letter and the values are all the words in the file that start with that letter. It kinda works but for
two problems I run into:
the dictionary keys contain apostrophes and periods (how to exclude?)
the values aren't sorted alphabetically and are all jumbled up. the code ends up outputting something like this:
' - {"don't", "i'm", "let's"}
. - {'below.', 'farm.', 'them.'}
a - {'take', 'masters', 'can', 'fallow'}
b - {'barnacle', 'labyrinth', 'pebble'}
...
...
y - {'they', 'very', 'yellow', 'pastry'}
when it should be more like:
a - {'ape', 'army','arrow', 'arson',}
b - {'bank', 'blast', 'blaze', 'breathe'}
etc
# make empty dictionary
dic = {}
# read file
infile = open('file.txt', "r")
# read first line
lines = infile.readline()
while lines != "":
# split the words up and remove "\n" from the end of the line
lines = lines.rstrip()
lines = lines.split()
for word in lines:
for char in word:
# add if not in dictionary
if char not in dic:
dic[char.lower()] = set([word.lower()])
# Else, add word to set
else:
dic[char.lower()].add(word.lower())
# Continue reading
lines = infile.readline()
# Close file
infile.close()
# Print
for letter in sorted(dic):
print(letter + " - " + str(dic[letter]))
I'm guessing I need to remove the punctuation and apostrophes from the whole file when I'm first iterating through it but before adding anything to the dictionary? Totally lost on getting the values in the right order though.

Use defaultdict(set) and dic[word[0]].add(word), after removing any starting punctuation. No need for the inner loop.

from collections import defaultdict
def process_file(fn):
my_dict = defaultdict(set)
for word in open(fn, 'r').read().split():
if word[0].isalpha():
my_dict[word[0].lower()].add(word)
return(my_dict)
word_dict = process_file('file.txt')
for letter in sorted(word_dict):
print(letter + " - " + ', '.join(sorted(word_dict[letter])))

You have a number of problems
splitting words on spaces AND punctuation
adding words to a set that could not exist at the time of the first addition
sorting the output
Here a short program that tries to solve the above issues
import re, string
# instead of using "text = open(filename).read()" we exploit a piece
# of text contained in one of the imported modules
text = re.__doc__
# 1. how to split at once the text contained in the file
#
# credit to https://stackoverflow.com/a/13184791/2749397
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
# 2. how to instantiate a set when we do the first addition to a key,
# that is, using the .setdefault method of every dictionary
d = {}
# Note: words regularized by lowercasing, we skip the empty tokens
for word in (w.lower() for w in words if w):
d.setdefault(word[0], set()).add(word)
# 3. how to print the sorted entries corresponding to each letter
for letter in sorted(d.keys()):
print(letter, *sorted(d[letter]))
My text contains numbers, so numbers are found in the output (see below) of the above program; if you don't want numbers filter them, if letter not in '0123456789': print(...).
And here it is the output...
0 0
1 1
8 8
9 9
a a above accessible after ailmsux all alphanumeric alphanumerics also an and any are as ascii at available
b b backslash be before beginning behaviour being below bit both but by bytes
c cache can case categories character characters clear comment comments compatibility compile complement complementing concatenate consist consume contain contents corresponding creates current
d d decimal default defined defines dependent digit digits doesn dotall
e each earlier either empty end equivalent error escape escapes except exception exports expression expressions
f f find findall finditer first fixed flag flags following for forbidden found from fullmatch functions
g greedy group grouping
i i id if ignore ignorecase ignored in including indicates insensitive inside into is it iterator
j just
l l last later length letters like lines list literal locale looking
m m made make many match matched matches matching means module more most multiline must
n n name named needn newline next nicer no non not null number
o object occurrences of on only operations optional or ordinary otherwise outside
p p parameters parentheses pattern patterns perform perl plus possible preceded preceding presence previous processed provides purge
r r range rather re regular repetitions resulting retrieved return
s s same search second see sequence sequences set signals similar simplest simply so some special specified split start string strings sub subn substitute substitutions substring support supports
t t takes text than that the themselves then they this those three to
u u underscore unicode us
v v verbose version versions
w w well which whitespace whole will with without word
x x
y yes yielding you
z z z0 za
Without comments and a little obfuscation it's just 3 lines of code...
import re, string
text = re.__doc__
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
d, add2d = {}, lambda w: d.setdefault(w[0],set()).add(w) #1
for word in (w.lower() for w in words if w): add2d(word) #2
for abc in sorted(d.keys()): print(abc, *sorted(d[abc])) #3

Python Challenge #3: Loop stops way too early

I'm working on PythonChallenge #3. I've got a huge block of text that I have to sort through. I am trying to find a sequence in which the first and last three letters are caps, and the middle one is lowercase.
My function loops through the text. The variable block stores the seven letters that are currently being looped through. There's a variable, toPrint, which gets turned on and off based on whether the letters in block correspond to my pattern (AAAaAAA). Based on the last block printed according to my function, my loop stops early in my text. I have no idea why this is happening and if you could help me figure this out, that would be great.
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
words = []
for i in text:
toPrint = True
block = text[text.index(i):text.index(i)+7]
for b in block[:3]:
if b.isupper() == False:
toPrint = False
for b in block[3]:
if b.islower() == False:
toPrint = False
for b in block[4:]:
if b.isupper() == False:
toPrint = False
if toPrint == True and block not in words:
words.append(block)
print (block)
print (words)

With Regex:
This is a really good time to use regex, it's super fast, more clear, and doesn't require a bunch of nested if statements.
import re
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
print(re.search(r"[A-Z]{3}[a-z][A-Z]{3}", text).group(0))
Explanation of regex:
[A-Z]{3] ---> matches any 3 uppercase letters
[a-z] -------> matches a single lowercase letter
[A-Z]{3] ---> matches 3 more uppercase letters
Without Regex:
If you really don't want to use regex this is how you could do it:
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
for i, _ in enumerate(text[:-6]): #loop through index of each char (not including last 6)
sevenCharacters = text[i:i+7] #create chunk of seven characters
shouldBeCapital = sevenCharacters[0:3] + sevenCharacters[4:7] #combine all the chars that should be cap into list
if (all(char.isupper() for char in shouldBeCapital)): #make sure all those characters are indeeed capital
if(sevenCharacters[3].islower()): #make sure middle character is lowercase
print(sevenCharacters)

I think your first problem is that you are using str.index(). Like find(), the .index() method of a string returns the index of the first match that is found.
Thus, in your example, whenever you search for 'x' you will get the index of the first 'x' found, etc. You cannot successfully work with any character that is not unique in the string, or that is not the first occurrence of a repeated character.
In order to keep the same structure (which isn't necessary- there is an answer posted using enumerate that I prefer myself) I implemented a queuing approach with your block variable. Each iteration, a character is dropped from the front of block, while the new character is appended to the end.
I also cleaned up some of your needless comparisons with False. You will find that this is not only inefficient, it is frequently wrong, because many of the "boolean" activities you perform will not be on actual boolean values. Get out of the habit of spelling out True/False. Just use if c or if not c.
Here's the result:
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
words = []
block = '.' + text[0:6]
for i in text[6:]:
block = block[1:] + i # Drop 1st char, append 'i'
toPrint = True
for b in block[:3]:
if not b.isupper():
toPrint = False
if not block[3].islower():
toPrint = False
for b in block[4:]:
if not b.isupper():
toPrint = False
if toPrint and block not in words:
words.append(block)
print (words)

If I understood your question, then according to my opinion there is no need of loop. My this simple code can find required sequence.
# Use this code
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
import re
print(re.findall("[A-Z]{3}[a-z][A-Z]{3}", text))

Python: Remove First Character of each Word in String

I am trying to figure out how to remove the first character of a words in a string.
My program reads in a string.
Suppose the input is :
this is demo
My intention is to remove the first character of each word of the string, that is
tid, leaving his s emo.
I have tried
Using a for loop and traversing the string
Checking for space in the string using isspace() function.
Storing the index of the letter which is encountered after the
space, i = char + 1, where char is the index of space.
Then, trying to remove the empty space using str_replaced = str[i:].
But it removed the entire string except the last one.

List comprehensions is your friend. This is the most basic version, in just one line
str = "this is demo";
print " ".join([x[1:] for x in str.split(" ")]);
output:
his s emo

In case the input string can have not only spaces, but also newlines or tabs, I'd use regex.
In [1]: inp = '''Suppose we have a
...: multiline input...'''
In [2]: import re
In [3]: print re.sub(r'(?<=\b)\w', '', inp)
uppose e ave
ultiline nput...

You can simply using python comprehension
str = 'this is demo'
mstr = ' '.join([s[1:] for s in str.split(' ')])
then mstr variable will contains these values 'his s emo'

This method is a bit long, but easy to understand. The flag variable stores if the character is a space. If it is, the next letter must be removed
s = "alpha beta charlie"
t = ""
flag = 0
for x in range(1,len(s)):
if(flag==0):
t+=s[x]
else:
flag = 0
if(s[x]==" "):
flag = 1
print(t)
output
lpha eta harlie

Parse 4th capital letter of line in Python?

How can I parse lines of text from the 4th occurrence of a capital letter onward? For example given the lines:
adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ
I would like to capture:
`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`
I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; something like this:
match = re.search(r'[A-Z].*?$', line).group()

I present two approaches.
Approach 1: all-out regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
The .*?[A-Z] consumes characters up to, and including, the first uppercase letter.
The (?:...){3} repeats the above three times without creating any capture groups.
The following .*? matches the remaining characters before the fourth uppercase letter.
Finally, the ([A-Z].*) captures the fourth uppercase letter and everything that follows into a capture group.
Approach 2: simpler regex
In [1]: import re
In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'
This attacks the problem directly, and I think is easier to read.

Anyway not using regular expressions will seen to be too verbose -
although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight.
It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down:
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""

Found this simpler to deal with by using a regular expression to split the string, then slicing the resulting list:
import re
text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj",
"oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"]
for t in text:
print "".join(re.split("([A-Z])", t, maxsplit=4)[7:])
Conveniently, this gives you an empty string if there aren't enough capital letters.

A nice, one-line solution could be:
>>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ'
>>> s1[list(re.finditer('[A-Z]', s1))[3].start():]
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2[list(re.finditer('[A-Z]', s2))[3].start():]
'PlsdakjfsldgjQ'
Why this works (in just one line)?
Searches for all capital letters in the string: re.finditer('[A-Z]', s1)
Gets the 4th capital letter found: [3]
Returns the position from the 4th capital letter: .start()
Using slicing notation, we get the part we need from the string s1[position:]

I believe that this will work for you, and be fairly easy to extend in the future:
check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
print re.match('([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', check ).group(2)
The first part of the regex ([^A-Z]*[A-Z]){3} is the real key, this finds the first three upper case letters and stores them along with the characters between them in group 1, then we skip any number of non-upper case letters after the third upper case letter, and finally, we capture the rest of the string.

Testing a variety of methods. I original wrote string_after_Nth_upper and didn't post it; seeing that jsbueno's method was similar; except by doing additions/count comparisons for every character (even lowercase letters) his method is slightly slower.
s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
import re
def string_after_Nth_upper(your_str, N=4):
upper_count = 0
for i, c in enumerate(your_str):
if c.isupper():
upper_count += 1
if upper_count == N:
return your_str[i:]
return ""
def find_capital(string, n=4):
count = 0
for index, letter in enumerate(string):
# The boolean value counts as 0 for False or 1 for True
count += letter.isupper()
if count == n:
return string[index:]
return ""
def regex1(s):
return re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
def regex2(s):
return re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', s).group(2)
def regex3(s):
return s[list(re.finditer('[A-Z]', s))[3].start():]
if __name__ == '__main__':
from timeit import Timer
t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper")
print 'simple:', t_simple.timeit()
t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital")
print 'jsbueno:', t_jsbueno.timeit()
t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re")
print "Regex1:",t_regex1.timeit()
t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re")
print "Regex2:", t_regex2.timeit()
t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re")
print "Regex3:", t_regex3.timeit()
Results:
Simple: 4.80558681488
jsbueno: 5.92122507095
Regex1: 3.21153497696
Regex2: 2.80767202377
Regex3: 6.64155721664
So regex2 wins for time.

import re
strings = [
'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj',
'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ',
]
for s in strings:
m = re.match('[a-z]*[A-Z][a-z]*[A-Z][a-z]*[A-Z][a-z]*([A-Z].+)', s)
if m:
print m.group(1)

It's not the prettiest approach, but:
re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', line).group(2)

caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
temp = ''
for char in inputStr:
if char in caps:
temp += char
if len(temp) == 4:
print temp[-1] # this is the answer that you are looking for
break
Alternatively, you could use re.sub to get rid of anything that's not a capital letter and get the 4th character of what's left

Another version... not that pretty, but gets the job done.
def stringafter4thupper(s):
i,r = 0,''
for c in s:
if c.isupper() and i < 4:
i+=1
if i==4:
r+=c
return r
Examples:
stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj')
stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ')
stringafter4thupper('')
stringafter4thupper('abcdef')
stringafter4thupper('ABCDEFGH')
Respectively results:
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
'PlsdakjfsldgjQ'
''
''
'DEFGH'

Parsing almost always involves regular expressions. However, a regex by itself does not make a parser. In the most simple sense, a parser consists of:
text input stream -> tokenizer
Usually it has an additional step:
text input stream -> tokenizer -> parser
The tokenizer handles opening the input stream and collecting text in a proper manner, so that the programmer doesn't have to think about it. It consumes text elements until there is only one match available to it. Then it runs the code associated with this "token". If you don't have a tokenizer, you have to roll it yourself(in pseudocode):
while stuffInStream:
currChars + getNextCharFromString
if regex('firstCase'):
do stuff
elif regex('other stuff'):
do more stuff
This loop code is full of gotchas, unless you build them all the time. It is also easy to have a computer produce it from a set of rules. That's how Lex/flex works. You can have the rules associated with a token pass the token to yacc/bison as your parser, which adds structure.
Notice that the lexer is just a state machine. It can do anything when it migrates from state to state. I've written lexers that used would strip characters from the input stream, open files, print text, send email and so on.
So, if all you want is to collect the text after the fourth capital letter, a regex is not only appropriate, it is the correct solution. BUT if you want to do parsing of textual input, with different rules for what to do and an unknown amount of input, then you need a lexer/parser. I suggest PLY since you are using python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.