Python intersection of 2 UNICODE tuple/List - python

I am trying to search words from a file and appending resulting words from each line to a Tuple. And then I want to find intersecting words from the two tuples list_1 and list_2. But i get error-
TypeError: unhashable type: 'list'
# -*- coding: utf-8 -*-
import re
list_1 = []
list_2 = []
datafile = open(filename)
for line1 in datafile:
if '1st word to be searched' in line1:
s = line1
left, right = re.findall(r'(\S+\s+\S+)\s+1stWordToBeSearched\s+(\S+\s+\S+)', s)[0]
set1 = {left, right}
list_1.extend([left,right])
list_1 = list(list_1)
datafile1 = open(filename)
for line2 in datafile1:
if ' 2nd word to be searched' in line2:
s = line2
left, right = re.findall(r'(\S+\s+\S+)\s+2ndWordTbeSearched\s+(\S+\s+\S+)', s)[0]
set2 = {left, right}
list_2.extend([left,right])
list_2 = list(list_2)
result = set1.intersection(set2)
print (result)
in first for loop- The 'findall' searches for sentences with the word "number".
And then finds words to Left and Right of the word "number". And Creates a list-
list_1 = [of, a, of, elements]
in Second for loop- Findall searches for word "modern". and gives words to its Left and Right. And creates a 2nd list-
list_2 = [of, all, elements, are]
The File- Essays can consist of a number of elements, including literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author of all modern elements are written in prose, but works in verse have been dubbed essays.
When list_1 and list_2 are obtained, the words common in them should be obtained.
Please note the file is NOT a English file. It is in a different Language.

There you have list inside list. fix it.
result = set(list_1).intersection(list_2)
set([]) = Ok
set([[],[]]) = Failed because list can't be hashed

You append one list object to each of your lists:
list_1.append([left,right])
and
list_2.append([left,right])
This gives you [[left, right]] for both lists, so you are trying to put the nested [left, right] list into a set as one element.
Normally, if you wanted to add multiple elements to an existing list, you'd use list.extend():
list_1.extend([left, right])
However, since your lists were empty in the first place and all you wanted to do was create a set intersection, you could just produce sets from those two elements in one step:
left, right = re.findall(r'(\S+\s+\S+)\s+1stWordToBeSearched\s+(\S+\s+\S+)', s)[0]
set1 = {left, right}
left, right = re.findall(r'(\S+\s+\S+)\s+2ndWordToBeSearched\s+(\S+\s+\S+)', s)[0]
set2 = {left, right}
result = set1.intersection(set2)
Note that you are ignoring all but the first two words! You are using [0] to take the first result of the findall() list here.
If you wanted to create an intersection of all the words, you could use a set comprehension to extract all the words into a set:
set1 = {word for matched in re.findall(r'(\S+\s+\S+)\s+1stWordToBeSearched\s+(\S+\s+\S+)', s)
for word in matched}
set1 = {word for matched in re.findall(r'(\S+\s+\S+)\s+2ndWordToBeSearched\s+(\S+\s+\S+)', s)
for word in matched}
result = set1.intersection(set2)

Related

Using two lists of indices to search through a third list

Suppose I have two lists of indices
letters = ['a', 'c']
numbers = ['1','2','6']
These lists are generated based on an interactive web interface element and is a necessary part of this scenario.
What is the most computationally efficient way in python I can use these two lists to search through the third list below for items?
list3 = ['pa1','pa2','pa3','pa4','pa5','pa6',
'pb1','pb2','pb3','pb4','pb5','pb6',
'pc1','pc2','pc3','pc4','pc5','pc6',
'pd1','pd2','pd3','pd4','pd5','pd6']
Using the letters and numbers lists, I want to search through list3 and return this list
sublist = ['pa1', 'pa2, 'pa6', 'pc1', 'pc2', 'pc6']
I could do something like this:
sublist = []
for tag in list3:
for l in letters:
for n in numbers:
if l in tag and n in tag:
sublist.append(tag)
But I'm wondering if there's a better or more recommended way?
Most of all, do not iterate through the character lists. Instead, use simple in operations; use any or all operations where necessary. In this case, since your tags are all of the form p[a-d][0-9], you can directly check the appropriate character:
for tag in list3:
if tag[1] in "ac" and tag[2] in "126":
sublist.append(tag)
For many uses or a generalized case, replace the strings with sets for O(1) time:
letter = set('a', 'c')
number = set('1', '2', '6')
for tag in list3:
if tag[1] in letter and tag[2] in number:
sublist.append(tag)
Next, get rid of the O(n^2) append series (adding to a longer list each time). Replace it with a list comprehension.
sublist = [tag for tag in list3
if tag[1] in letter and tag[2] in number]
If the letters and numbers can appear anywhere in the list, then you need a general search for each: look for an overlap in character sets:
sublist = [tag for tag in list3
if any(char in letter for char in tag) and
any(char in number for char in tag)
]
or with sets:
sublist = [tag for tag in list3
if set(tag).intersection(letter) and
set(tag).intersection(number)
]
Try this, in simple way
letters = ['a', 'c']
numbers = ['1','2','6']
sublist = []
result = []
list3 = ['pa1','pa2','pa3','pa4','pa5','pa6',
'pb1','pb2','pb3','pb4','pb5','pb6',
'pc1','pc2','pc3','pc4','pc5','pc6',
'pd1','pd2','pd3','pd4','pd5','pd6']
for letter in letters:
sublist.extend(list(filter(lambda tag: letter in tag, list3)))
for number in numbers:
result.extend(list(filter(lambda tag: number in tag, sublist)))
print(result)

Python removing word if subset of other word in list

A simple puzzle but I cannot wrap my head around it:
In words:
I have a list of words. If in my list, the word is a "subset" of another value in list, then remove.
Input: ['car', 'car-10', 'truck-20']
Output: ['car-10', 'truck-20']
We have removed 'car' because it is a subset of 'car-10'. 'car-10' is not a subset of 'car'
Input: ['car', 'car-10', 'car-100']
Output: ['car-100']
We have removed 'car' and 'car-10' because it is a subset of 'car-100'.
The one I am really trying to solve, don't use numbers:
Input: ['car-strong', 'car', 'truck-weak']
Output: ['car-strong', 'truck-weak']
We might have 'truck', 'bananas', 'apple', and things would be 'apple-10'.
Note that the "type" (car, truck, apple etc) is always the beginning of the word.
The typical list to parse is around 5-10 elements long. (brute forceable i guess?)
But there are around 200,000 of these short lists to "clean"... is also the issue.
brute force
l =['car', 'car-10', 'truck-20']
remove_me = [x for x in l
if any([y.startswith(x) for y in l if x!=y])]
result = [x for x in l if x not in remove_me]
For better performance, order the list alphabetically to find candidate 'superset' faster, e.g. along the lines of
Python: Remove elements from the list which are prefix of other
This is a solution that should work for all kind input formats:
input = ['car-strong', 'car', 'truck-weak']
delete = []
for idx,str in enumerate(input):
for idx2,str2 in enumerate(input):
if str in str2 and idx != idx2:
delete.append(str)
for str in delete:
input.remove(str)
print(input)

Extracting all words starting with a certain character

I have a list of lists, in which I store sentences as strings. What I want to do is to get only the words starting with #. In order to do that, I split the sentences into words and now trying to pick only the words that start with # and exclude all the other words.
# to create the empty list:
lst = []
# to iterate through the columns:
for i in range(0,len(df)):
lst.append(df['col1'][i].split())
If I am mistaken you just need flat list containing all words starting with particular character. For doing that I would employ list flattening (via itertools):
import itertools
first = 'f' #look for words starting with f letter
nested_list = [['This is first sentence'],['This is following sentence']]
flat_list = list(itertools.chain.from_iterable(nested_list))
nested_words = [i.split(' ') for i in flat_list]
words = list(itertools.chain.from_iterable(nested_words))
lst = [i for i in words if i[0]==first]
print(lst) #output: ['first', 'following']

Python looping through lists

I have a list called:
word_list_pet_image = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
There is more data in this list but I kept it short. I am trying to iterate through this list and check to see if the word is only alphabetical characters if this is true append the word to a new list called
pet_labels = []
So far I have:
word_list_pet_image = []
for word in low_pet_image:
word_list_pet_image.append(word.split("_"))
for word in word_list_pet_image:
if word.isalpha():
pet_labels.append(word)
print(pet_labels)
For example I am trying to put the word beagle into the list pet_labels, but skip 01125.jpg. see below.
pet_labels = ['beagles', 'Saint Bernard']
I am getting a atributeError
AtributeError: 'list' object has no attribute 'isalpha'
I am sure it has to do with me not iterating through the list properly.
It looks like you are trying to join alphabetical words in each sublist. A list comprehension would be effective here.
word_list = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
pet_labels = [' '.join(w for w in l if w.isalpha()) for l in word_list]
>>> ['beagle', 'saint bernard']
You have lists of lists, so the brute force method would be to nest loops. like:
for pair in word_list_pet_image:
for word in pair:
if word.isalpha():
#append to list
Another option might be single for loop, but then slicing it:
for word in word_list_pet_image:
if word[0].isalpha():
#append to list
word_list = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
Why not list comprehension (only if non-all alphabetical letters element is always at last):
pet_labels = [' '.join(l[:-1]) for l in word_list]
word_list_pet_image.append(word.split("_"))
.split() returns lists, so word_list_pet_image itself contains lists, not plain words.

How do I organize lower and upper case words in Python?

Here is the code I have.
All I need to do is make sure the list is organized with upper case words first and lower case words second. I looked around but no luck with .sort or .sorted command.
string = input("Please type in a string? ")
words = string.strip().split()
for word in words:
print(word)
The sorted() function should sort items alphabetically taking caps into account.
>>> string = "Don't touch that, Zaphod Beeblebox!"
>>> words = string.split()
>>> print( sorted(words) )
['Beeblebox!', "Don't", 'Zaphod', 'that,', 'touch']
But if for some reason sorted() ignored caps, then you could do it manually with a sort of list comprehension if you wanted:
words = sorted([i for i in words if i[0].isupper()]) + sorted([i for i in words if i[0].islower()])
This creates two separate lists, the first with capitalized words and the second without, then sorts both individually and conjoins them to give the same result.
But in the end you should definitely just use sorted(); it's much more efficient and concise.
EDIT: Sorry, I might have miss-interpreted your question; if you want to organize just Caps and not without sorting alphabetically, then this works:
>>> string = "ONE TWO one THREE two three FOUR"
>>> words = string.split()
>>> l = []
>>> print [i for i in [i if i[0].isupper() else l.append(i) for i in words] if i!=None]+l
['ONE', 'TWO', 'THREE', 'FOUR', 'one', 'two', 'three']
I can't find a method that's more efficient then that, so there you go.
string = raw_input("Please type in a string? ")
words = string.strip().split()
words.sort()
As to how to separate upper and lower case words into separate columns:
string = raw_input("Please type in a string? ")
words = string.split()
column1 = []
column2 = []
for word in words:
if word.islower():
column1.append(word)
else
column2.append(word)
The .islower() function evaluates to true if all the letters are lower case. If this doesn't work for your problem's definition of upper and lower case, look into the .isupper() and .istitle() methods here.

Categories

Resources