Using split after a set statement in Python - python

I have a list of words (equivalent to about two full sentences) and I want to split it into two parts: one part containing 90% of the words and another part containing 10% of them. After that, I want to print a list of the unique words within the 10% list, lexicographically sorted. This is what I have so far:
pos_90 = (90*len(words)) // 100 #list with 90% of the words
pos_90 = pos_90 + 1 #I incremented the number by 1 in order to use it as an index
pos_10 = (10*len(words)) // 100 #list with 10% of the words
list_90 = words[:pos_90] #Creation of the 90% list
list_10 = words[pos_10:] #Creation of the 10% list
uniq_10 = set(list_10) #List of unique words out of the 10% list
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
print(sorted_10)
I get an error saying that split cannot be applied to set, so I assume my mistake must be in the last lines of code. Any idea about what I'm missing here?

split only makes sense when converting from one long str to a list of the components of said str. If the input was in the form 'word1 word2 word3', yes, split would convert that str to ['word1', 'word2', 'word3'], but your input is a set, and there is no sane way to "split" a set like you seem to want; it's already a bag of separated items.
All you really need to do is convert your set back to a sorted list. Replace:
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
with either:
sorted_10 = list(uniq_10)
sorted_10.sort() # NEVER assign the result of .sort(); it's always going to be None
or the simpler one-liner that encompasses both listifying and sorting:
sorted_10 = sorted(uniq_10) # sorted, unlike list.sort, returns a new list
The final option is generally the most Pythonic approach to converting an arbitrary iterable to list and sorting that new list, returning the result. It doesn't mutate the input, doesn't rely on the input being a specific type (set, tuple, list, it doesn't matter), and it's simpler to boot. You only use list.sort() when you already have a known list, and don't mind mutating it.

Related

Multiple set unions along with list comprehension

I am trying to understand this code:
edit_two_set = set()
edit_two_set = set.union(*[edit_two_set.union(edit_one_letter(w, allow_switches)) for w in one])
Here one is a set of strings. allow_switches is True.
edit_one_letter takes in one word and makes either one character insertion, deletion or one switch of corresponding characters.
I understand:
[edit_two_set.union(edit_one_letter(w, allow_switches)) for w in one]
is performing a list comprehension in which for every word in one we make one character edit and then take the union of the resulting set with the previous set.
I am mainly stuck at trying to understand what:
set.union(*[])
is doing?
Thanks!
You can refer to this:
https://docs.python.org/3/library/stdtypes.html#frozenset.union
The list comprehension returns a list of sets.
set.union(*) would perform a union of the sets within the list and return a new set.

the code is giving me a number from the list instead of the mode

in one of my work i need to find the mode a list called "dataset" using no modual or function that would find the mode by itself.
i tried to make it so it can output the mode or the list of modes depending on the list of numbers. I used 2 for loops so the first number of the list checks each number of the list including its self to see how many numbers of its self there is, for example if my list was 123415 it would say there is 2 ones, and it does this for all the numbers of the list. the number with the most counts would be the mode. The bottom section of the code where the if elif and else is, there is where it checks if the number has the most counts by comparing with the other numbers of the list checking if it has more numbers or the same as the previous biggest number.
I've tried to change the order of the codes but i'm still confused why it is doing this error
pop_number = []
pop_amount = 0
amount = 0
for i in range(len(dataset)):
for x in dataset:
if dataset[i] == x:
amount += 1
if amount>pop_amount:
pop_amount = amount
pop_number = []
pop_number.append(x)
amount = 0
elif amount==pop_amount:
pop_amount = amount
if x not in pop_number:
pop_number.append(x)
pop_amount = amount
amount = 0
else:
continue
print(pop_number)
i expected the output to be the mode of the list or the list of modes but it came up with the last number from the list
As this is apparently homework, I will present a sketch, not working code.
Observe that a dict in Python can hold key-value mappings.
Let the numbers in the input list be the keys, and the values the number of times they occur. Going over the list, use each item as the key for the dict, and add one to the value (starting at 0 -- defaultdict(int) is good for this). If the result is bigger than any previous maximum, remember this key.
Since you want to allow for more than one mode value, the variable which remembers the maximum key should be a list; but since you have a new maximum, replace the old list with a list containing just this key. If another value also reaches the maximum, add it to the list. (That's the append method.)
(See how this is if bigger than maximum so far and then else if equal to maximum so far and then otherwise there is no need to do anything.)
When you have looped over all items in the input list, the list of remembered keys is your result.
Go back and think about what variables you need already before the loop. The maximum so far should be defined but guaranteed to be smaller than any value you will see -- it makes sense to start this at 0 because as soon as you see one key, it will have a bigger count than zero. And the keys you want to remember can start out as an empty list.
Now think about how you would test this. What happens if the input list is empty? What happens if the input list contains just the same number over and over? What happens if every item on the input list is unique? Can you think of other corner cases?
Without using any module or function that will specifically find the mode itself, you can do that with much less code. Your code will work with a little more effort, I highly suggest you to try to solve the problem on your own logic, but meanwhile let me show you how to take the help of all the built-in data structures in Python List, Tuples, Dictionaries and Sets within 7-8 lines. Also there is unzipping at the end (*). I will suggest you to look these up, when you get time.
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
# finds the unique elements
unique_elems = set(lst)
# creates a dictionary with the unique elems as keys and initializes the values to 0
count = dict.fromkeys(unique_elems,0)
# gets the frequency of each element in the lst
for elem in unique_elems:
count[elem] = lst.count(elem)
# finds max frequency
max_freq = max(count.values())
# stores list of mode(s)
modes = [i for i in count if count[i] == max_freq]
# prints mode(s), I have used unzipping here so that in case there is one mode,
# you don't have to print ugly [x]
print(*modes)
Or if you want to go for the shortest (I really shouldn't be making such bold claims in StackOverflow), then I guess this will be it (even though, writing short codes for the sake of it is discouraged)
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
freq_dist = [(i, lst.count(i)) for i in set(lst)]
[print(i,end=' ') for i,j in freq_dist if j==max(freq_dist, key=lambda x:x[1])[1]]
And if you just want to go bonkers and say goodbye to loops (Goes without saying, this is ugly, really ugly):
lst = [1,1,1,1,2,2,2,3,3,3,3,3,3,4,2,2,2,5,5,6]
unique_elems = set(lst)
freq_dist = list(map(lambda x:(x, lst.count(x)), unique_elems))
print(*list(map(lambda x:x[0] if x[1] == max(freq_dist,key = lambda y: y[1])[1] else '', freq_dist)))

finding first item in a list whose first item in a tuple is matched

I have a list of several thousand unordered tuples that are of the format
(mainValue, (value, value, value, value))
Given a main value (which may or may not be present), is there a 'nice' way, other than iterating through every item looking and incrementing a value, where I can produce a list of indexes of tuples that match like this:
index = 0;
for destEntry in destList:
if destEntry[0] == sourceMatch:
destMatches.append(index)
index = index + 1
So I can compare the sub values against another set, and remove the best match from the list if necessary.
This works fine, but just seems like python would have a better way!
Edit:
As per the question, when writing the original question, I realised that I could use a dictionary instead of the first value (in fact this list is within another dictionary), but after removing the question, I still wanted to know how to do it as a tuple.
With list comprehension your for loop can be reduced to this expression:
destMatches = [i for i,destEntry in enumerate(destList) if destEntry[0] == sourceMatch]
You can also use filter()1 built in function to filter your data:
destMatches = filter(lambda destEntry:destEntry[0] == sourceMatch, destList)
1: In Python 3 filter is a class and returns a filter object.

Sorting a concordance?

For my homework, I need to isolate the most frequent 50 words in a text. I have tried a whole lot of things, and in my most recent attempt, I have done a concordance using this:
concordance = {}
lineno = 0
for line in vocab:
lineno = lineno + 1
words = re.findall(r'[A-Za-z][A-Za-z\'\-]*', line)
for word in words:
word = word.title()
if word in concordance:
concordance[word].append(lineno)
else:
concordance[word] = [lineno]
listing = []
for key in sorted(concordance.keys()):
listing.append( [key, concordance[key] ])
What I would like to know is whether I can sort the subsequent concordance in order of most frequently used word to least frequently used word, and then isolate and print the top 50? I am not permitted to import any modules other than re and sys, and I'm struggling to come up with a solution.
sorted is a builtin which does not require import. Try something like:
list(sorted(concordance.items(), key = lambda (k,v): v))[:50]
Not tested, but you get the idea.
The list constructor is there because sorted returns a generator, which you can't slice directly (itertools provides a utility to do that, but you can't import it).
There are probably slightly more efficient ways to take the first 50, but I doubt it matters here.
Few hints:
Use enumerate(list) in your for loop to get the line number and the line at once.
Try using \w for word characters in your regular expression instead of listing [A-Za-z...].
Read about the dict.items() method. It will return a list of (key, value) pairs.
Manipulate that list with list.sort(key=function_to_compare_two_items).
You can define that function with a lambda, but it is not necessary.
Use the len(list) function to get the length of the list. You can use it to get the number of matches of a word (which are stored in a list).
UPDATE: Oh yeah, and use slices to get a part of the resulting list. list[:50] to get the first 50 items (equivalent to list[0:50]), and list[5:10] to get the items from index 5 inclusive to index 10 exclusive.
To print them, loop through the resulting list, then print every word. Alternatively, you can use something similar to print '[separator]'.join(list) to print a string with all the items separated by '[separator]'.
Good luck.

Check if string in strings

I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]

Categories

Resources