comparing occurrence of strings in list in python

comparing occurrence of strings in list in python - python

i'm super duper new in python. I'm kinda stuck for one of my class exercises.
The question goes something like this: You have a file that contains characters i.e. words. (I'm still at the stage where all the terms get mixed up, I apologize if that is not the correct term)
Example of the file.txt content: accbd
The question asks me to import the file to python editor and make sure that no letter occurs more than letter that comes later than it in the alphabet. e.g. a cannot occur more frequently than b; b cannot occur more than c, and so on. In the example file, c occurs more frequently than d, so I need to raise an error message.
Here's my pathetic attempt :
def main():
f=open('.txt','r') # 1st import the file and open it.
data = f.read() #2nd read the file
words = list(data) #3rd create a list that contains every letter
newwords = sorted(words) # sort according to alphabetical order
I'm stuck at the last part which is to count that the former word doesn't occur more than the later word, and so on. I tried two ways but neither is working. Here's trial 1:
from collections import counter
for i in newwords:
try:
if counter(i) <=counter(i+1):
print 'ok'
else:
print 'not ok between indexes %d and %d' % (i, i+1)
except:
pass
The 2nd trial is similar
for i in newwords:
try:
if newwords.count(i) <= newwords.count(i+1):
print 'ok'
else:
print 'ok between indexes %d and %d' % (i, i+1)
except:
pass
What is the correct way to compare the count for each word in sequential order?

I had posted an answer, but I see it's for an assignment, so I'll try to explain instead of just splatting a solution here.
My suggestion would be to solve it in three steps:
1) in the first line, create a list of sorted characters that appear in the string:
from the data string you can use set(data) to pick every unique character
if you use sort() on this set you can create a list of characters, sorted alphabetically.
2) then use this list in a for loop (or list comprehension) to create a second list, of their number of occurrences in data, using data.count(<letter in the list>); note that the elements in this second list are technically sorted based on the alphabetical order of the letters in the first list you made (because of the for loop).
3) compare this second list of values with a sorted version of itself (now sorted by values), and see if they match or not. If they don't match, it's because some of the initial letters appears too many times compared to the next ones.

To be a little more clear:
In [2]: string = 'accbd'
In [3]: import collections
In [4]: collections.Counter(string)
Out[4]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})
Then it's just a for loop with enumerate(list_).

Related

Trying to sort two combined strings alphabetically without duplicates

Challenge: Take 2 strings s1 and s2 including only letters from a to z. Return a new sorted string, the longest possible, containing distinct letters - each taken only once - coming from s1 or s2.
# Examples
a = "xyaabbbccccdefww"
b = "xxxxyyyyabklmopq"
assert longest(a, b) == "abcdefklmopqwxy"
a = "abcdefghijklmnopqrstuvwxyz"
assert longest(a, a) == "abcdefghijklmnopqrstuvwxyz"
So I am just starting to learn, but so far I have this:
def longest(a1, a2):
for letter in max(a1, a2):
return ''.join(sorted(a1+a2))
which returns all the letters but I am trying to filter out the duplicates.
This is my first time on stack overflow so please forgive anything I did wrong. I am trying to figure all this out.
I also do not know how to indent in the code section if anyone could help with that.

You have two options here. The first is the answer you want and the second is an alternative method
To filter out duplicates, you can make a blank string, and then go through the returned string. For each character, if the character is already in the string, move onto the next otherwise add it
out = ""
for i in returned_string:
if i not in out:
out += i
return out
This would be empedded inside a function
The second option you have is to use Pythons sets. For what you want to do you can consider them as lists with no dulicate elements in them. You could simplify your function to
def longest(a: str, b: str):
return "".join(set(a).union(set(b)))
This makes a set from all the characters in a, and then another one with all the characters in b. It then "joins" them together (union) and you get another set. You can them join all the characters together in this final set to get your string. Hope this helps

In Python: How can i get my code to print out all the possible words I can spell based on my input?

I think I was close to figuring out how to print out all the possible words based on user input from my set dictionary. it's based on the assumption that the user input is 'ART' so the possible words I have in my dictionary are ART, RAT, TART, and TAR but only the three letter combinations are printing out. can anyone tell me where I am going wrong? Thanks!
Dictionary = ["tar","art","tart","rat"] #creates dictionary of set words
StoredLetters = input('input your word here: ') #allows the user to input any word
list(StoredLetters)
def characters(word):
Dictionary = {}
for i in word:
Dictionary[i] = Dictionary.get(i, 0) + 1
return Dictionary
def all_words(StoredLetters, wordSet):
for word in StoredLetters:
flag = 1
words = characters(word)
for key in words:
if key not in wordSet:
flag = 0
else:
if wordSet.count(key) != words[key]:
flag = 0
if flag == 1:
print(word)
if __name__ == "__main__":
print(all_words(Dictionary, StoredLetters))

It appears there are a few things that could contribute to this.
You are swapping the parameters on all words def allwords(Dictionary, StoredLetters): when you call it in main allwords(StoredLetters, Dictionary). Without specifying the name (look up named parameters in python) you would be swapping the input.
In the characters function it would appear you are resetting the dictionary variable. Try using unique names when creating new variables. This is causing the dictionary of words you set at the top to be emptied out when characters(word) is called

First off, you are confusing things by having the name of your variable StoredLetters also being the name of one of the arguments to your all_words function.
Second, you are actually passing in StoredLetters, which is art, as the 2nd argument to the function, so it is wordSet in the function, not StoredLetters!
You should really keep things more clear by using different variable names, and making it obvious what are you using for which argument. words isn't really words, it's a dictionary with letters as keys, and how many times they appear as the values! Making code clear and understandable goes a long way to making it easy to debug. You have word, StoredLetters, wordSet, another StoredLetters argument, words = characters(word) which doesn't do what is expected. This could all use a good cleanup.
As for the functionality, with art, each letter only appears once, so for tart, which has t twice, if wordSet.count(key) != words[key] will evaluate as True, and flag will be set to 0, and the word will not be printed.
Hope that helps, and happy coding!

Based on the follow-up comments, the rule is that we must use all characters in the target word, but we can use each character as many times as we want.
I'd set up the lookup "dictionary" data structure as a Python dict which maps sorted, unique characters as tuples in each dictionary word to a list of the actual words that can be formed from those characters.
Next, I'd handle the lookups as follows:
Sort the unique characters of the user input (target word) and index into the dictionary to get the list of words it could make. Using a set means that we allow repetition and sorting the characters means we normalize for all of the possible permutations of those letters.
The above alone can give false positives, so we filter the resulting word list to remove any actual result words that are shorter than the target word. This ensures that we handle a target word like "artt" correctly and prevent it from matching "art".
Code:
from collections import defaultdict
class Dictionary:
def __init__(self, words):
self.dictionary = defaultdict(list)
for word in words:
self.dictionary[tuple(sorted(set(word)))].append(word)
def search(self, target):
candidates = self.dictionary[tuple(sorted(set(target)))]
return [x for x in candidates if len(x) >= len(target)]
if __name__ == "__main__":
dictionary = Dictionary(["tar", "art", "tart", "rat"])
tests = ["art", "artt", "ar", "arttt", "aret"]
for test in tests:
print(f"{test}\t=> {dictionary.search(test)}")
Output:
art => ['tar', 'art', 'tart', 'rat']
artt => ['tart']
ar => []
arttt => []
aret => []
The issues in the original code have been addressed nicely in the other answers. The logic doesn't seem clear since it's comparing characters to words and variable names often don't match the logic represented by the code.
It's fine to use a frequency counter, but you'll be stuck iterating over the dictionary, and you'll need to check that each count of a character in a dictionary word is greater than the corresponding count in the target word. I doubt the code I'm offering is optimal, but it should be much faster than the counter approach, I think.

Most frequents words in Python

I was trying to implement a code that would allow me to find the 10 most frequent words in a text. I'm new at python, and am more used to languages like C#, java or even C++. Here is what I did:
f = open("bigtext.txt","r")
word_count = {}
Basicaly, my idea is to create a dictionary that contains the number of times that each word is present in my text. If the word is not present, I will add it to the dictionary with the value of 1. If the world is already present in the dictionary, I will increment its value by 1.
for x in f.read().split():
if x not in word_count:
word_count[x] = 1
else:
word_count[x] += 1
sorted(word_count.values)
Here, I will sort my dictionary by values (since I'm looking for the 10 most frequent worlds, I need the 10 words with the biggest values).
for keys,values in word_count.items():
values = values + 1
print(word_count[-values])
if values == 10:
break
Here is the part were it all fails. I know now for sure (since I sorted my dictionary by the value of the values). That my 10 most frequent words are the 10 last elements of my dictionary. I want to display those. So I decided to initialize values at 1 and to display my dictionary backward till values = 10 so that I won't need to display more than what I need. But unfortunately, I get this following error:
File "<ipython-input-19-f5241b4c239c>", line 13
for keys,values in word_count.items()
^
SyntaxError: invalid syntax
I do know that my mistake is that I didn't display my dictionary backwards correctly. But I don't know how to proceed elsewhere. So if someone can tell me how to properly display my last 10 elements in my dictionary, I would very much appreciate it. Thank You.

If you didn’t want to use collections.Counter, you could do something like this:
for word, count in sorted(word_count.items(), key=lambda x: -x[1])[:10]:
print(word, count)
This gets all the words in the dictionary, along with their counts, into a list of tuples; sorts that list by the 2nd item in each tuple (the count) descending, and then only prints the first (I.e. highest) ten of those.

I would like to address a big thank you to Ben who told me that I can't sort a dictionary like that.
So this is my final solution (hoping it would help someone else);
my_words = []
for keys, values in word_count.items():
my_words.append((values,keys))
I created a list and I added to it the values I had in my dictionary with the following word for each value.
my_words.sort(reverse = True)
I then sorted my list according to the value in reverse (so that my 10 most frequent worlds would be the 10 first element of my list)
print("The 10 most frequent words in this text are:")
print()
for key, val in my_words[:10]:
print (key, val)
I then simply displayed the 10 first elements of my list.
I would also like to thank all of you who told me about NLTK. I will try it later to have a more optimal and accurate solution.
Thank You so much for your help.

how can i search for common elements in two integers with while loop

in my code im having a problem because i cannot compare to list as i wanted. what i try to do is looking for first indexes of inputs firstly and then if indexes not the same looking for the next index of the longer input as i guess1. and then after finishing comparing the first index of elements i want to compare second indexes .. what i mean first checking (A-C)(A-A)(A-T) and then (C-A)(C-T).. and then (T-T)...
and want an input list as (A,T) beacuse of ATT part of guess1..
however i stuck in a moment that i always find ACT not A and T..
where i am wrong.. i will be very glad if you enlighten me..
edit..
what i'm trying to do is looking for the best similarity in the longer list of guess1 and find the most similiar list as ATT
GUESS1="CATTCG"
GUESS2="ACT"
if len(str(GUESS1))>len(str(GUESS2)):
DNA_input_list=list((GUESS1))
DNA_input1_list=list((GUESS2))
common_elements=[]
i=0
while i<len(DNA_input1_list)-1:
j=0
while j<len(DNA_input_list)-len(DNA_input1_list):
if DNA_input_list[i] == DNA_input1_list[j]:
common_elements.append(DNA_input1_list[j])
i+=1
j+=1
if j>len(DNA_input1_list)-1:
break
print(common_elements)

As far as I understand, you want to find a shorter substring in a longer substring, and if not found, remove an element from shorter substring then repeat the search.
You can use string find function in python for that. i.e. "CATTCG".find('ACT'), this function will return -1 because there are no substing ACT. What then you can do is remove an element from the shorter string using slice operator [::] and repeat the search like this --
>>> for x in range(len('ACT')):
... if "CATTCG".find('ACT'[x:]) > -1 :
... print("CATTCG".find('ACT'[x:]))
... print("Match found for " + 'ACT'[x:])
In code here, first a range of lengths is generated i.e. [0, 1, 2, 3] this is the number of items we're gonna slice off from the beginning.
In second line we do the slicing with 'ACT'[x:] (for x==0, we get 'ACT', for x == 1, we get 'CT' and for x==2, we get 'T').
The last two lines print out the position and the string that matched.

If I have understood everything correctly, you want to return the longest similar substring from GUESS2, with is included in GUESS1.
I would use something like this.
<!-- language: lang-py -->
for count in range(len(GUESS2)):
if GUESS2[:count] in GUESS1:
common_elements = GUESS2[:count]
print(GUESS2[:count]) #if a function, return GUESS2[:count]
A loop as long as the count from the searching string.
Then check if the substring is included in the other.
If so, save it to a variable and print/return it after the loop has finished.

Sorting a concordance?

For my homework, I need to isolate the most frequent 50 words in a text. I have tried a whole lot of things, and in my most recent attempt, I have done a concordance using this:
concordance = {}
lineno = 0
for line in vocab:
lineno = lineno + 1
words = re.findall(r'[A-Za-z][A-Za-z\'\-]*', line)
for word in words:
word = word.title()
if word in concordance:
concordance[word].append(lineno)
else:
concordance[word] = [lineno]
listing = []
for key in sorted(concordance.keys()):
listing.append( [key, concordance[key] ])
What I would like to know is whether I can sort the subsequent concordance in order of most frequently used word to least frequently used word, and then isolate and print the top 50? I am not permitted to import any modules other than re and sys, and I'm struggling to come up with a solution.

sorted is a builtin which does not require import. Try something like:
list(sorted(concordance.items(), key = lambda (k,v): v))[:50]
Not tested, but you get the idea.
The list constructor is there because sorted returns a generator, which you can't slice directly (itertools provides a utility to do that, but you can't import it).
There are probably slightly more efficient ways to take the first 50, but I doubt it matters here.

Few hints:
Use enumerate(list) in your for loop to get the line number and the line at once.
Try using \w for word characters in your regular expression instead of listing [A-Za-z...].
Read about the dict.items() method. It will return a list of (key, value) pairs.
Manipulate that list with list.sort(key=function_to_compare_two_items).
You can define that function with a lambda, but it is not necessary.
Use the len(list) function to get the length of the list. You can use it to get the number of matches of a word (which are stored in a list).
UPDATE: Oh yeah, and use slices to get a part of the resulting list. list[:50] to get the first 50 items (equivalent to list[0:50]), and list[5:10] to get the items from index 5 inclusive to index 10 exclusive.
To print them, loop through the resulting list, then print every word. Alternatively, you can use something similar to print '[separator]'.join(list) to print a string with all the items separated by '[separator]'.
Good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

comparing occurrence of strings in list in python - python

To be a little more clear: In [2]: string = 'accbd' In [3]: import collections In [4]: collections.Counter(string) Out[4]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1}) Then it's just a for loop with enumerate(list_).

Related

Trying to sort two combined strings alphabetically without duplicates

In Python: How can i get my code to print out all the possible words I can spell based on my input?

Most frequents words in Python

how can i search for common elements in two integers with while loop

Sorting a concordance?

Categories

Resources