This code is meant to read a text file and add every word to a dictionary where the key is the first letter and the values are all the words in the file that start with that letter. It kinda works but for
two problems I run into:
the dictionary keys contain apostrophes and periods (how to exclude?)
the values aren't sorted alphabetically and are all jumbled up. the code ends up outputting something like this:
' - {"don't", "i'm", "let's"}
. - {'below.', 'farm.', 'them.'}
a - {'take', 'masters', 'can', 'fallow'}
b - {'barnacle', 'labyrinth', 'pebble'}
...
...
y - {'they', 'very', 'yellow', 'pastry'}
when it should be more like:
a - {'ape', 'army','arrow', 'arson',}
b - {'bank', 'blast', 'blaze', 'breathe'}
etc
# make empty dictionary
dic = {}
# read file
infile = open('file.txt', "r")
# read first line
lines = infile.readline()
while lines != "":
# split the words up and remove "\n" from the end of the line
lines = lines.rstrip()
lines = lines.split()
for word in lines:
for char in word:
# add if not in dictionary
if char not in dic:
dic[char.lower()] = set([word.lower()])
# Else, add word to set
else:
dic[char.lower()].add(word.lower())
# Continue reading
lines = infile.readline()
# Close file
infile.close()
# Print
for letter in sorted(dic):
print(letter + " - " + str(dic[letter]))
I'm guessing I need to remove the punctuation and apostrophes from the whole file when I'm first iterating through it but before adding anything to the dictionary? Totally lost on getting the values in the right order though.
Use defaultdict(set) and dic[word[0]].add(word), after removing any starting punctuation. No need for the inner loop.
from collections import defaultdict
def process_file(fn):
my_dict = defaultdict(set)
for word in open(fn, 'r').read().split():
if word[0].isalpha():
my_dict[word[0].lower()].add(word)
return(my_dict)
word_dict = process_file('file.txt')
for letter in sorted(word_dict):
print(letter + " - " + ', '.join(sorted(word_dict[letter])))
You have a number of problems
splitting words on spaces AND punctuation
adding words to a set that could not exist at the time of the first addition
sorting the output
Here a short program that tries to solve the above issues
import re, string
# instead of using "text = open(filename).read()" we exploit a piece
# of text contained in one of the imported modules
text = re.__doc__
# 1. how to split at once the text contained in the file
#
# credit to https://stackoverflow.com/a/13184791/2749397
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
# 2. how to instantiate a set when we do the first addition to a key,
# that is, using the .setdefault method of every dictionary
d = {}
# Note: words regularized by lowercasing, we skip the empty tokens
for word in (w.lower() for w in words if w):
d.setdefault(word[0], set()).add(word)
# 3. how to print the sorted entries corresponding to each letter
for letter in sorted(d.keys()):
print(letter, *sorted(d[letter]))
My text contains numbers, so numbers are found in the output (see below) of the above program; if you don't want numbers filter them, if letter not in '0123456789': print(...).
And here it is the output...
0 0
1 1
8 8
9 9
a a above accessible after ailmsux all alphanumeric alphanumerics also an and any are as ascii at available
b b backslash be before beginning behaviour being below bit both but by bytes
c cache can case categories character characters clear comment comments compatibility compile complement complementing concatenate consist consume contain contents corresponding creates current
d d decimal default defined defines dependent digit digits doesn dotall
e each earlier either empty end equivalent error escape escapes except exception exports expression expressions
f f find findall finditer first fixed flag flags following for forbidden found from fullmatch functions
g greedy group grouping
i i id if ignore ignorecase ignored in including indicates insensitive inside into is it iterator
j just
l l last later length letters like lines list literal locale looking
m m made make many match matched matches matching means module more most multiline must
n n name named needn newline next nicer no non not null number
o object occurrences of on only operations optional or ordinary otherwise outside
p p parameters parentheses pattern patterns perform perl plus possible preceded preceding presence previous processed provides purge
r r range rather re regular repetitions resulting retrieved return
s s same search second see sequence sequences set signals similar simplest simply so some special specified split start string strings sub subn substitute substitutions substring support supports
t t takes text than that the themselves then they this those three to
u u underscore unicode us
v v verbose version versions
w w well which whitespace whole will with without word
x x
y yes yielding you
z z z0 za
Without comments and a little obfuscation it's just 3 lines of code...
import re, string
text = re.__doc__
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
d, add2d = {}, lambda w: d.setdefault(w[0],set()).add(w) #1
for word in (w.lower() for w in words if w): add2d(word) #2
for abc in sorted(d.keys()): print(abc, *sorted(d[abc])) #3
Related
I'm trying to take the last two letters of a string, swap them, make them lowercase, and leave a space in the middle. For some reason the output gives me white space before the word.
For example if input was APPLE then the out put should be e l
It would be nice to also be nice to ignore non string characters so if the word was App3e then the output would be e p
def last_Letters(word):
last_two = word[-2:]
swap = last_two[-1:] + last_two[:1]
for i in swap:
if i.isupper():
swap = swap.lower()
return swap[0]+ " " +swap[1]
word = input(" ")
print(last_Letters(word))
You can try with the following function:
import re
def last_Letters(word):
letters = re.sub(r'\d', '', word)
if len(letters) > 1:
return letters[-1].lower() + ' ' + letters[-2].lower()
return None
It follows these steps:
removes all the digits
if there are at least two characters:
lowers every character
builds the required string by concatenation of the nth letter, a space and the nth-1 letter
and returns the string
returns "None"
Since I said there was a simpler way, here's what I would write:
text = input()
result = ' '.join(reversed([ch.lower() for ch in text if ch.isalpha()][-2:]))
print(result)
How this works:
[ch.lower() for ch in text] creates a list of lowercase characters from some iterable text
adding if ch.isalpha() filters out anything that isn't an alphabetical character
adding [-2:] selects the last two from the preceding sequence
and reversed() takes the sequence and returns an iterable with the elements in reverse
' '.join(some_iterable) will join the characters in the iterable together with spaces in between.
So, result is set to be the last two characters of all of the alphabetical characters in text, in reverse order, separated by a space.
Part of what makes Python so powerful and popular, is that once you learn to read the syntax, the code very naturally tells you exactly what it is doing. If you read out the statement, it is self-describing.
So, what I want to do is to convert some words from the string into their respective words in dictionary and rest as it is.For example by giving input as:
standarisationn("well-2-34 2 #$%23beach bend com")
I want output as:
"well-2-34 2 #$%23bch bnd com"
The codes I was using is:
def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res)
but its giving the wrong output as:
'well - 2 - 34 2 # $ % 23beach bnd com'
that is with too many spaces and not even converting "beach" to "bch".So, that's the issue.What I thought is too first split the string by spaces and then split the resultant elements by special characters and numbers and the use the dictionary and then first join the separated strings by special characters without space and then all the list by space.Can anyone suggest how to go about this or any better method?
You can build you regular expression with the keys of your dictionary, ensuring they're not enclosed in another word (i.e. not directly preceded nor followed by a letter):
import re
def standarisationn(addr):
addr = re.sub(r'(,|\s+)', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2 #$%23beach bend com"))
The expression is built in three parts:
^ matches the beginning of the string
(?<=[^a-zA-Z]) is a lookbehind (ie a non capturing expression), checking that the preceding character is a letter
{wrd} is the key of your dictionary
(?=[^a-zA-Z]|$) is a lookahead (ie a non capturing expression), checking that the following character is a letter or the end of the string
Output:
well-2-34 2 #$%23bch bnd com
Edit: you can compile a whole expression and use re.sub only once if you replace the loop with:
repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
This should be much faster if your dictionary grows because we build a single expression with all your dictionary keys:
({'|'.join(lookp_dict.keys())}) is interpreted as (allee|alley|...
a lambda function in re.sub replaces the matching element with the corresponding value in lookp_dict (see for example this link for more details about this)
I want to find out what words can be formed using the names of musical notes.
This question is very similar: Python code that will find words made out of specific letters. Any subset of the letters could be used
But my alphabet also contains "fis","cis" and so on.
letters = ["c","d","e","f","g","a","h","c","fis","cis","dis"]
I have a really long word list with one word per list and want to use
with open(...) as f:
for line in f:
if
to check if each word is part of that "language" and then save it to another file.
my problem is how to alter
>>> import re
>>> m = re.compile('^[abilrstu]+$')
>>> m.match('australia') is not None
True
>>> m.match('dummy') is not None
False
>>> m.match('australian') is not None
False
so it also matches with "fis","cis" and so on.
e.g. "fish" is a match but "ifsh" is not a match.
I believe ^(fis|cis|dis|[abcfhg])+$ will do the job.
Some deconstruction of what's going on here:
| workds like OR conjunction
[...] denotes "any symbol from what's inside the brackets"
^ and $ stand for beginning and end of line, respectively
+ stands for "1 or more time"
( ... ) stands for grouping, needed to apply +/*/{} modifiers. Without grouping such modifiers applies to closest left expression
Alltogether this "reads" as "whole string is one or more repetition of fis/cis/dis or one of abcfhg"
This function works, it doesn't use any external libraries:
def func(word, letters):
for l in sorted(letters, key=lambda x: x.length, reverse=True):
word = word.replace(l, "")
return not s
it works because if s=="", then it has been decomposed into your letters.
Update:
It seems that my explanation wasn't clear. WORD.replace(LETTER, "") will replace the note/LETTER in WORD by nothing, here is an example :
func("banana", {'na'})
it will replace every 'na' in "banana" by nothing ('')
the result after this is "ba", which is not a note
not "" means True and not "ba" is false, this is syntactic sugar.
here is another example :
func("banana", {'na', 'chicken', 'b', 'ba'})
it will replace every 'chicken' in "banana" by nothing ('')
the result after this is "banana"
it will replace every 'ba' in "banana" by nothing ('')
the result after this is "nana"
it will replace every 'na' in "nana" by nothing ('')
the result after this is ""
it will replace every 'b' in "" by nothing ('')
the result after this is ""
not "" is True ==> HURRAY IT IS A MELODY !
note: The reason for the sorted by length is because otherwise, the second example would not have worked. The result after deleting "b" would be "a", which can't be decomposed in notes.
You can calculate the number of letters of all units (names of musical notes), which are in the word, and compare this number to the length of the word.
from collections import Counter
units = {"c","d","e","f","g","a","h", "fis","cis","dis"}
def func(word, units=units):
letters_count = Counter()
for unit in units:
num_of_units = word.count(unit)
letters_count[unit] += num_of_units * len(unit)
if len(unit) == 1:
continue
# if the unit consists of more than 1 letter (e.g. dis)
# check if these letters are in one letter units
# if yes, substruct the number of repeating letters
for letter in unit:
if letter in units:
letters_count[letter] -= num_of_units
return len(word) == sum(letters_count.values())
print(func('disc'))
print(func('disco'))
# True
# False
A solution with tkinter window opening to choose file:
import re
from tkinter import filedialog as fd
m = re.compile('^(fis|ges|gis|as|ais|cis|des|es|dis|[abcfhg])+$')
matches = list()
filename = fd.askopenfilename()
with open(filename) as f:
for line in f:
if m.match(str(line).lower()) is not None:
matches.append(line[:-1])
print(matches)
This answer was posted as an edit to the question find all words in a certain alphabet with multi character letters by the OP Nivatius under CC BY-SA 4.0.
i've got this function that checks all the words in the 1st sequence,
if they are ending with one of the words in the 2nd sequence, remove that end substring.
I'm trying to achieve all that in one simple lambda function that is supposed to go into a pipeline processing, and can't find a way to do it.
I'll be grateful if you could help me with this:
str_test = ("Thiship is a test string testing slowly i'm helpless")
stem_rules = ('less', 'ship', 'ing', 'es', 'ly','s')
str_test2 = str_test.split()
for i in str_test2:
for j in stem_rules:
if(i.endswith(j)):
str_test2[str_test2.index(i)] = i[:-len(j)]
break
This is a one-liner that activates a (simple?) lambda that does it.
(lambda words, rules: sum([[word[:-len(rule)]] if word.endswith(rule) else [] for word in words for rule in rules], []))(str_test.split(), stem_rules)
It's not clear how it's working, and it's not good practice to do it.
What it generally does is create a list with a single string out of matches, or an empty list out of misses, and then aggregates everything to single list, containing only the matches.
Currently it will output on every match, and not just longest match or anything like that, but once you figure out how it's working, maybe you can select the shortest match from the list of matches for each word in the input.
May god be with you.
The first thing I'd do is toss your i.endswith(j) for j in stem_rules out and make it a regex that matches and captures the prefix string and matches (but doesn't capture) any suffix
import re
match_end = re.compile("(.*?)(?:" + "|".join(".*?" + stem + "$" for stem in stem_rules) + ")")
# This is the same as:
re.compile(r"""
(.*?) # Capturing group matching the prefix
(?: # Begins a non-capturing group...
stem1$|
stem2$|
stem3$ # ...which matches an alternation of the stems, asserting end of string
) # ends the non-capturing group""", re.X)
Then you can use that regex to sub each item in the list.
f = lambda word: match_end.sub(r"\1", word)
Use that wrapped in a list comprehension and you should have your result
words = [f(word) for word in str_test.split()]
# or map(f, str_test.split())
To convert you current code into a single lambda, each step in the pipeline needs to behave in a very functional manner: receive some data, and then emit some data. You need to avoid anything that deviates from that paradigm -- in particular, the use of things like break. Here's one way to rewrite the steps in that manner:
text = ("Thiship is a test string testing slowly i'm helpless")
stems = ('less', 'ship', 'ing', 'es', 'ly','s')
# The steps:
# - get words from the text
# - pair each word with its matching stems
# - create a list of cleaned words (stems removed)
# - make the new text
words = text.split()
wstems = [ (w, [s for s in stems if w.endswith(s)]) for w in words ]
cwords = [ w[0:-len(ss[0])] if ss else w for w, ss in wstems ]
text2 = ' '.join(cwords)
print text2
With those parts in hands, a single lambda can be created using ordinary substitution. Here's the monstrosity:
f = lambda txt: [
w[0:-len(ss[0])] if ss else w
for w, ss in [ (w, [s for s in stems if w.endswith(s)]) for w in txt.split() ]
]
text3 = ' '.join(f(text))
print text3
I wasn't sure whether you want the lambda to return the new words or the new text -- adjust as needed.
I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1