Search list items in another longer list in python

Search list items in another longer list in python - python

I am new to this forum, hence apologies if this is a very long question.
I am trying create a generic keyword parser that accepts a keyword list and a list of text lines (that could have been either generated from a DB or a free format text file). Now I am trying to extract the entities from the Text lines list based on the keyword list so that I can generate three key outputs;
Keyword that was mentioned
The text line where this keyword was mentioned and,
the number of times this keyword was mentioned in the text line
The following is a sample of the python code I have written to do this. As you can see that I am trying to accomplish this in three stages;
Stage 1 - accept a reject sequence so that I can remove all known unwanted lines from the Text lines list
Stage 2 (Pass 1 parsing) - Carry out a index-type search on the keywords to reduce the list of lines I need to do a full looped search
Stage 3 - Carry out a full looped search.
Problem: The problem I have is that the stage 3 (or pass 2 in the code) is extremely in-efficient and as an example for the keyword list that has 4500 elements and for the text lines with nearly 2 million rows the code runs for more than 24 hours.
Can anyone suggest a better method of doing the pass 2?
or
If there is a better method of writing the whole function?
I am a Python beginner hence if I have missed something obvious, then apologies in advance.
##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
# keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
# KeywordDict - is the Dict of all the keywords and the associated ID.
# (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
# valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
# valuesDict - Is the Dict of all the value lines and the associated IDs.
# (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
# rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
# parsedHashIDsList - Is the a hash value that is generated for every successful parse results
# parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
# successResultIDsList - list of all unique value references that were parsed successfully
# rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################
def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
parsedResultsDict = {}
parsedHashIDsList = []
successResultIDsList = []
rejectResultIDsList = []
processListPass1 = []
processListPass2 = []
idxkeyWordDict = {}
for keyID in keywordIDsList:
keywordID, keyWord = keywordDict[keyID]
idxkeyWordDict[keyWord] = (keywordID, keyWord)
percCount = 1
# optional: if rejectPattern is provided then reject lines
# ## Some python code for processing the reject patterns - this works fine
# Pass 1: Index based matching - partial code for index based search
for valueID in processListPass1:
valKey, valText = valuesDict[valueID]
try:
keyWordVal, keywordID = idxkeyWordDict[valText]
except:
processListPass2.append(valueID)
percCount = 0
# Pass 2: Text based search and lookup - this part of the code is extremely inefficient
for valueID in processListPass2:
percCount += 1
valKey, valText = valuesDict[valueID]
valSuccess = 'N'
for keyID in keywordIDsList:
keyWordVal, keywordID = keywordDict[keyID]
keySearch = re.findall(keyWordVal, valText, re.DOTALL)
if keySearch:
parsedHashID = hash(str(valueID) + str(keyID))
parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
valSuccess = 'Y'
if valSuccess == 'Y':
successResultIDsList.append(valueID)
else:
rejectResultIDsList.append(valueID)
return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)

This is a perfect use case for the Aho-Corasick string matching algorithm. There is an explanation of a similar use case using code examples in python in this blog post.

Related

Iterate over Python list with clear code - rewriting functions

I've followed a tutorial to write a Flask REST API and have a special request about a Python code.
The offered code is following:
# data list is where my objects are stored
def put_one(name):
list_by_id = [list for list in data_list if list['name'] == name]
list_by_id[0]['name'] = [new_name]
print({'list_by_id' : list_by_id[0]})
It works, which is nice, and even though I understand what line 2 is doing, I would like to rewrite it in a way that it's clear how the function iterates over the different lists. I already have an approach but it returns Key Error: 0
def put(name):
list_by_id = []
list = []
for list in data_list:
if(list['name'] == name):
list_by_id = list
list_by_id[0]['name'] = request.json['name']
return jsonify({'list_by_id' : list_by_id[0]})
My goal with this is also to be able to put other elements, that don't necessarily have the type 'name'. If I get to rewrite the function in an other way I'll be more likely to adapt it to my needs.
I've looked for tools to convert one way of coding into the other and answers in forums before coming here and couldn't find it.

It may not be beatiful code, but it gets the job done:
def put(value):
for i in range(len(data_list)):
key_list = list(data_list[i].keys())
if data_list[i][key_list[0]] == value:
print(f"old value: {key_list[0], data_list[i][key_list[0]]}")
data_list[i][key_list[0]] = request.json[test_key]
print(f"new value: {key_list[0], data_list[i][key_list[0]]}")
break
Now it doesn't matter what the key value is, with this iteration the method will only change the value when it finds in the data_list. Before the code breaked at every iteration cause the keys were different and they played a role.

how to invert another function in a dictionary and how to count the inverted value if its not unique in the report?

my load library is working well but the other too failed..
how to invert another function in a dictionary(index by author) and how to count the inverted value if its not unique in the report(report author count)?
def load_library(f):
with open(f,'rt') as x:
return dict(map(str.strip, line.split("|")) for line in x)
def index_by_author(f):
return {value:key for key, value in load_library(f).items()}
def count_authors(file_name):
invert = {}
for k, v in load_library(file_name).items():
invert[v] = invert.get(v, 0) + 1
return invert
def write_authors_counts(counts, file_name):
with open(file_name, 'w') as fobj:
for name, count in counts.items():
fobj.write('{}: {}\n'.format(name, count))
def report_author_counts(lib_fpath, rep_filepath):
counts = count_authors(lib_fpath)
write_authors_counts(counts, rep_filepath)
Load library
In module library.py, implement function load_library().
Inputs:
Path to a text file (with contents similar to those above) containing the individual books.
Outputs:
The function shall produce a dictionary where the book titles are used as keys and the authors' names are stored as values.
You can expect that the input text file will always exist, but it may be empty. In that case, the function shall return an empty dictionary.
It can then be used as follows:
>>> from library import load_library
>>> book_author = load_library('books.txt')
>>> print(book_author['RUR'])
Capek, Karel
>>> print(book_author['Dune'])
Herbert, Frank
Index by author
In module library.py, create function index_by_author(), which - in a sense - inverts the dictionary of books produced by load_library().
Inputs:
A dictionary with book titles as keys and book authors as values (the same structure as produced by load_library() function).
Outputs:
A dictionary containing book authors as keys and a list of all books of the respective author as values.
If the input dictionary is empty, the function shall produce an empty dictionary as well.
For example, running the function on the following book dictionary (with reduced contents for the sake of brevity) would produce results shown below in the code:
>>> book_author = {'RUR': 'Capek, Karel', 'Dune': 'Herbert, Frank', 'Children of Dune': 'Herbert, Frank'}
>>> books_by = index_by_author(book_author)
>>> print(books_by)
{'Herbert, Frank': ['Dune', 'Children of Dune'], 'Capek, Karel': ['RUR']}
>>> books_by['Capek, Karel']
['RUR']
>>> books_by['Herbert, Frank']
['Dune', 'Children of Dune']
Report author counts
In module library.py, create function report_author_counts(lib_fpath, rep_filepath) which shall compute the number of books of each author and the total number of books, and shall store this information in another text file.
Inputs:
Path to a library text file (containing records for individual books).
Path to report text file that shall be created by this function.
Outputs: None
Assuming the file books.txt has the same contents as above, running the function like this:
>>> report_author_counts('books.txt', 'report.txt')
shall create a new text file report.txt with the following contents:
Clarke, Arthur C.: 2
Herbert, Frank: 2
Capek, Karel: 1
Asimov, Isaac: 3
TOTAL BOOKS: 8
The order of the lines is irrelevant. Do not forget the TOTAL BOOKS line! If the input file is empty, the output file shall contain just the line TOTAL BOOKS: 0.
Suggestion: There are basically 2 ways how to implement this function. You can either
use the 2 above functions to load the library, transform it using index_by_author() and then easilly iterate over the dictionary, or
you can work directly with the source text file, extract the author names, and count their occurences.
Both options are possible, provided the function will accept the specified arguments and will produce the right file contents. The choice is up to you.
python

The index_by_author function needs to be a little more complex than the dict comprehension you suggested. dict.setdefault() comes in handy here, as described in Efficient way to either create a list, or append to it if one already exists?. Notice too that your assignment says a dictionary should be the parameter, not a file. Here is what I recommend:
def index_by_author(book_author):
dict_by_author = {}
for key, value in book_author.items():
dict_by_author.setdefault(value, []).append(key)
return dict_by_author
Then in your report_author_counts(), you can use index_by_author() to invert the dictionary. Then loop through the inverted dictionary. For each item, the count will be the length of the value, which will be a list of titles. The length of a list is determined with len(list).

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Python extract substring with location of field and symbols

I have been trying to clean a field in a csv file. The field is populated with numbers and characters, which I read into a panda dataframe and convert to a string.
Goal is to extract following variables: StopId, StopCode (possible to have multiple for each record), rte, routeId from the long string. Here is what I attempted so far.
After extracting the variables listed above, I need to merge the variable/codes with another file with location data per each stop/route/rte.
Sample records for the FIELD:
'Web Log: Page generated Query [cid=SM&rte=50183&dir=S&day=5761&dayid=5761&fst=0%2c&tst=0%2c]'
'Web Log: Page generated Query: [_=1407744540393&agencyId=SM&stopCode=361096&rte=7878%7eBus%7e251&dir=W]'
Web Log: Page generated Query: [_=1407744956001&agencyId=AC&stopCode=55451&stopCode=55452stopCode=55489&&rte=43783%7eBus%7e88&dir=S]
Solutions I tried below, but I am stuck! Advice and recommendations are appreciated
# Idea 1: Splits field above in a loop by '&' into a list. This is useful but I'll
# have to write additional code to pull out relevant variables
i = 0
for t in data['EVENT_DESCRIPTION']:
s = list(t.split('&'))
data['STOPS'][i] = [ x for x in s if "Web Log" not in x ]
i+=1
# Idea 1 next step help - how to pull out necessary variables from the list in data['STOPS']
# Idea2: Loop through field with string to find the start and end of variable names. The output for stopcode_pl (et. al. variables) is tuple or list of tuples (if there are more than one in the string)
for i in data['EVENT_DESCRIPTION']:
stopcode_pl = [(a.start(), a.end() ) for a in list(re.finditer('stopCode=', i))]
stopid_pl = i[(a.start(), a.end() ) for a in list(re.finditer('stopId=', i))]
rte_pl = [(a.start(), a.end() ) for a in list(re.finditer('rte=', i))]
routeid_pl = [(a.start(), a.end() ) for a in list(re.finditer('routeId=', i))]
#Idea2: Next Step Help - how to use the string location for variable names to pull the number of the relevant variable. Is there a trick to grab the characters in between the variable name last place (i.e. after the '=' of the variable name) and the next '&'?

This function
def qdata(rec):
return [tuple(item.split('=')) for item in rec[rec.find('[')+1:rec.find(']')].split('&')]
yields, for instance, on the first record:
[('cid', 'SM'), ('rte', '50183'), ('dir', 'S'), ('day', '5761'), ('dayid', '5761'), ('fst', '0%2c'), ('tst', '0%2c')]
You can then step across that list searching for your specific items.

Python sorting question - given list of ['url', 'tag1', 'tag2',..]s and search specification ['tag3', 'tag1',...], return relevant url list

I'm quite new to programming so I'm sure there's a terser way to pose this, but I'm trying to create a personal bookmarking program. Given multiple urls each with a list of tags ordered by relevance, I want to be able to create a search consisting of a list of tags that returns a list of most relevant urls. My first solution, below, is to give the first tag a value of 1, the second 2, and so on & let the python list sort function do the rest. 2 questions:
1) Is there a much more elegant/efficient way of doing this (embarrass me!)
2) Any other general approaches to the sorting by relevance given the inputs above problem?
Much obliged.
# Given a list of saved urls each with a corresponding user-generated taglist
# (ordered by relevance), the user enters a "search" list-of-tags, and is
# returned a sorted list of urls.
# Generate sample "content" linked-list-dictionary. The rationale is to
# be able to add things like 'title' etc at later stages and to
# treat each url/note as in independent entity. But a single dictionary
# approach like "note['url1']=['b','a','c','d']" might work better?
content = []
note = {'url':'url1', 'taglist':['b','a','c','d']}
content.append(note)
note = {'url':'url2', 'taglist':['c','a','b','d']}
content.append(note)
note = {'url':'url3', 'taglist':['a','b','c','d']}
content.append(note)
note = {'url':'url4', 'taglist':['a','b','d','c']}
content.append(note)
note = {'url':'url5', 'taglist':['d','a','c','b']}
content.append(note)
# An example search term of tags, ordered by importance
# I'm using a dictionary with an ordinal number system
# This seems clumsy
search = {'d':1,'a':2,'b':3}
# Create a tagCloud with one entry for each tag that occurs
tagCloud = []
for note in content:
for tag in note['taglist']:
if tagCloud.count(tag) == 0:
tagCloud.append(tag)
# Create a dictionary that associates an integer value denoting
# relevance (1 is most relevant etc) for each existing tag
d={}
for tag in tagCloud:
try:
d[tag]=search[tag]
except KeyError:
d[tag]=100
# Create a [[relevance, tag],[],[],...] result list & sort
result=[]
for note in content:
resultNote=[]
for tag in note['taglist']:
resultNote.append([d[tag],tag])
resultNote.append(note['url'])
result.append(resultNote)
result.sort()
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
# Its so hacky i've forgotten how it works!
# It's mostly for display, but suggestions on "best-practice"
# intermediate-form data storage?
finalResult=[]
for note in result:
temp=[]
temp.append(note.pop())
for tag in note:
temp.append(tag[1])
finalResult.append(temp)
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult

1) Is there a much more elegant/efficient way of doing this (embarrass me!)
Sure thing. The basic idea: quit trying to tell Python what to do, and just ask it for what you want.
content = [
{'url':'url1', 'taglist':['b','a','c','d']},
{'url':'url2', 'taglist':['c','a','b','d']},
{'url':'url3', 'taglist':['a','b','c','d']},
{'url':'url4', 'taglist':['a','b','d','c']},
{'url':'url5', 'taglist':['d','a','c','b']}
]
search = {'d' : 1, 'a' : 2, 'b' : 3}
# We can create the tag cloud like this:
# tagCloud = set(sum((note['taglist'] for note in content), []))
# But we don't actually need it: instead, we'll just use a default value
# when looking things up in the 'search' dict.
# Create a [[relevance, tag],[],[],...] result list & sort
result = sorted(
[
[search.get(tag, 100), tag]
for tag in note['taglist']
] + [[note['url']]]
# The result will look like [ [relevance, tag],... , [url] ]
# Note that the url is wrapped in a list too. This makes the
# last processing step easier: we just take the last element of
# each nested list.
for note in content
)
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
finalResult = [
[x[-1] for x in note]
for note in result
]
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult

I suggest you also give a weight to each tag, depending on how rare it is (e.g. a “tarantula” tag would weigh more than a “nature” tag¹). For a given URL, rare tags that are common with other URLs should mark a stronger relevance, while frequently used tags of the given URL not existing in another URL should mark down the relevance.
It's easy to convert the rules I describe above as calculations of a numerical relevance for every other URL.
¹ unless all your URLs are related to “tarantulas”, of course :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search list items in another longer list in python - python

This is a perfect use case for the Aho-Corasick string matching algorithm. There is an explanation of a similar use case using code examples in python in this blog post.

Related

Iterate over Python list with clear code - rewriting functions

how to invert another function in a dictionary and how to count the inverted value if its not unique in the report?

Regular expressions matching words which contain the pattern but also the pattern plus something else

Python extract substring with location of field and symbols

Python sorting question - given list of ['url', 'tag1', 'tag2',..]s and search specification ['tag3', 'tag1',...], return relevant url list

Categories

Resources