Python extract substring with location of field and symbols - python

I have been trying to clean a field in a csv file. The field is populated with numbers and characters, which I read into a panda dataframe and convert to a string.
Goal is to extract following variables: StopId, StopCode (possible to have multiple for each record), rte, routeId from the long string. Here is what I attempted so far.
After extracting the variables listed above, I need to merge the variable/codes with another file with location data per each stop/route/rte.
Sample records for the FIELD:
'Web Log: Page generated Query [cid=SM&rte=50183&dir=S&day=5761&dayid=5761&fst=0%2c&tst=0%2c]'
'Web Log: Page generated Query: [_=1407744540393&agencyId=SM&stopCode=361096&rte=7878%7eBus%7e251&dir=W]'
Web Log: Page generated Query: [_=1407744956001&agencyId=AC&stopCode=55451&stopCode=55452stopCode=55489&&rte=43783%7eBus%7e88&dir=S]
Solutions I tried below, but I am stuck! Advice and recommendations are appreciated
# Idea 1: Splits field above in a loop by '&' into a list. This is useful but I'll
# have to write additional code to pull out relevant variables
i = 0
for t in data['EVENT_DESCRIPTION']:
s = list(t.split('&'))
data['STOPS'][i] = [ x for x in s if "Web Log" not in x ]
i+=1
# Idea 1 next step help - how to pull out necessary variables from the list in data['STOPS']
# Idea2: Loop through field with string to find the start and end of variable names. The output for stopcode_pl (et. al. variables) is tuple or list of tuples (if there are more than one in the string)
for i in data['EVENT_DESCRIPTION']:
stopcode_pl = [(a.start(), a.end() ) for a in list(re.finditer('stopCode=', i))]
stopid_pl = i[(a.start(), a.end() ) for a in list(re.finditer('stopId=', i))]
rte_pl = [(a.start(), a.end() ) for a in list(re.finditer('rte=', i))]
routeid_pl = [(a.start(), a.end() ) for a in list(re.finditer('routeId=', i))]
#Idea2: Next Step Help - how to use the string location for variable names to pull the number of the relevant variable. Is there a trick to grab the characters in between the variable name last place (i.e. after the '=' of the variable name) and the next '&'?

This function
def qdata(rec):
return [tuple(item.split('=')) for item in rec[rec.find('[')+1:rec.find(']')].split('&')]
yields, for instance, on the first record:
[('cid', 'SM'), ('rte', '50183'), ('dir', 'S'), ('day', '5761'), ('dayid', '5761'), ('fst', '0%2c'), ('tst', '0%2c')]
You can then step across that list searching for your specific items.

Related

Parsing and arranging text in python

I'm having some trouble figuring out the best implementation
I have data in file in this format:
|serial #|machine_name|machine_owner|
If a machine_owner has multiple machines, I'd like the machines displayed in a comma separated list in the field. so that.
|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble|mach3|
|3838|Barney Rubble|mach4|
|1212|Betty Rubble|mach5|
Looks like this:
|Fred Flinstone|mach1|
|Barney Rubble|mach2,mach3,mach4|
|Betty Rubble|mach5|
Any hints on how to approach this would be appreciated.
You can use dict as temporary container to group by name and then print it in desired format:
import re
s = """|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble||mach3|
|3838|Barney Rubble||mach4|
|1212|Betty Rubble|mach5|"""
results = {}
for line in s.splitlines():
_, name, mach = re.split(r"\|+", line.strip("|"))
if name in results:
results[name].append(mach)
else:
results[name] = [mach]
for name, mach in results.items():
print(f"|{name}|{','.join(mach)}|")
You need to store all the machines names in a list. And every time you want to append a machine name, you run a function to make sure that the name is not already in the list, so that it will not put it again in the list.
After storing them in an array called data. Iterate over the names. And use this function:
data[i] .append( [ ] )
To add a list after each machine name stored in the i'th place.
Once your done, iterate over the names and find them in in the file, then append the owner.
All of this can be done in 2 steps.

Combine items in List based on values in another List using Python

I'm learning Python programming and I wrote a python script to extract all tags and values from a XML and convert it to excel. The problem I'm having is that I need the tags in a specific format. I have the below tags and values in two separate lists.
If the value is not present in a row, then the Tag should be appended to the next one before till we reach a tag that is having a value.
So if the input is
Tags Values
OAuth
UserAuth
Vendor ABC
Time 21.04
AppName
Request Temporary
and the output required is like
Tags Values
OAuth
OAuth.UserAuth
OAuth.UserAuth.Vendor ABC
OAuth.UserAuth.Time 21.04
AppName
AppName.Request Temporary
Since this does'nt have a fixed repeatable change, I am completely unsure how to proceed doing it with Python. I know this can be done with List comprehension but not sure how.
tags = ['OAuth', 'UserAuth', 'Vendor', 'Time', 'AppName', 'Request']
values = ['', '', 'ABC', '21.04', '', 'Temporary']
def parse_tree(tags, values):
value_found = False
path = []
for tag, value in zip(tags, values):
if value:
value_found = True
yield ((".".join(path + [tag])), value)
else:
if value_found:
# new branch from root
path = []
value_found = False
path.append(tag)
list(parse_tree(tags, values))
gives
[
('OAuth.UserAuth.Vendor', 'ABC'),
('OAuth.UserAuth.Time', '21.04'),
('Request', 'Temporary')
]

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Search list items in another longer list in python

I am new to this forum, hence apologies if this is a very long question.
I am trying create a generic keyword parser that accepts a keyword list and a list of text lines (that could have been either generated from a DB or a free format text file). Now I am trying to extract the entities from the Text lines list based on the keyword list so that I can generate three key outputs;
Keyword that was mentioned
The text line where this keyword was mentioned and,
the number of times this keyword was mentioned in the text line
The following is a sample of the python code I have written to do this. As you can see that I am trying to accomplish this in three stages;
Stage 1 - accept a reject sequence so that I can remove all known unwanted lines from the Text lines list
Stage 2 (Pass 1 parsing) - Carry out a index-type search on the keywords to reduce the list of lines I need to do a full looped search
Stage 3 - Carry out a full looped search.
Problem: The problem I have is that the stage 3 (or pass 2 in the code) is extremely in-efficient and as an example for the keyword list that has 4500 elements and for the text lines with nearly 2 million rows the code runs for more than 24 hours.
Can anyone suggest a better method of doing the pass 2?
or
If there is a better method of writing the whole function?
I am a Python beginner hence if I have missed something obvious, then apologies in advance.
##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
# keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
# KeywordDict - is the Dict of all the keywords and the associated ID.
# (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
# valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
# valuesDict - Is the Dict of all the value lines and the associated IDs.
# (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
# rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
# parsedHashIDsList - Is the a hash value that is generated for every successful parse results
# parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
# successResultIDsList - list of all unique value references that were parsed successfully
# rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################
def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
parsedResultsDict = {}
parsedHashIDsList = []
successResultIDsList = []
rejectResultIDsList = []
processListPass1 = []
processListPass2 = []
idxkeyWordDict = {}
for keyID in keywordIDsList:
keywordID, keyWord = keywordDict[keyID]
idxkeyWordDict[keyWord] = (keywordID, keyWord)
percCount = 1
# optional: if rejectPattern is provided then reject lines
# ## Some python code for processing the reject patterns - this works fine
# Pass 1: Index based matching - partial code for index based search
for valueID in processListPass1:
valKey, valText = valuesDict[valueID]
try:
keyWordVal, keywordID = idxkeyWordDict[valText]
except:
processListPass2.append(valueID)
percCount = 0
# Pass 2: Text based search and lookup - this part of the code is extremely inefficient
for valueID in processListPass2:
percCount += 1
valKey, valText = valuesDict[valueID]
valSuccess = 'N'
for keyID in keywordIDsList:
keyWordVal, keywordID = keywordDict[keyID]
keySearch = re.findall(keyWordVal, valText, re.DOTALL)
if keySearch:
parsedHashID = hash(str(valueID) + str(keyID))
parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
valSuccess = 'Y'
if valSuccess == 'Y':
successResultIDsList.append(valueID)
else:
rejectResultIDsList.append(valueID)
return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)
This is a perfect use case for the Aho-Corasick string matching algorithm. There is an explanation of a similar use case using code examples in python in this blog post.

Python sorting question - given list of ['url', 'tag1', 'tag2',..]s and search specification ['tag3', 'tag1',...], return relevant url list

I'm quite new to programming so I'm sure there's a terser way to pose this, but I'm trying to create a personal bookmarking program. Given multiple urls each with a list of tags ordered by relevance, I want to be able to create a search consisting of a list of tags that returns a list of most relevant urls. My first solution, below, is to give the first tag a value of 1, the second 2, and so on & let the python list sort function do the rest. 2 questions:
1) Is there a much more elegant/efficient way of doing this (embarrass me!)
2) Any other general approaches to the sorting by relevance given the inputs above problem?
Much obliged.
# Given a list of saved urls each with a corresponding user-generated taglist
# (ordered by relevance), the user enters a "search" list-of-tags, and is
# returned a sorted list of urls.
# Generate sample "content" linked-list-dictionary. The rationale is to
# be able to add things like 'title' etc at later stages and to
# treat each url/note as in independent entity. But a single dictionary
# approach like "note['url1']=['b','a','c','d']" might work better?
content = []
note = {'url':'url1', 'taglist':['b','a','c','d']}
content.append(note)
note = {'url':'url2', 'taglist':['c','a','b','d']}
content.append(note)
note = {'url':'url3', 'taglist':['a','b','c','d']}
content.append(note)
note = {'url':'url4', 'taglist':['a','b','d','c']}
content.append(note)
note = {'url':'url5', 'taglist':['d','a','c','b']}
content.append(note)
# An example search term of tags, ordered by importance
# I'm using a dictionary with an ordinal number system
# This seems clumsy
search = {'d':1,'a':2,'b':3}
# Create a tagCloud with one entry for each tag that occurs
tagCloud = []
for note in content:
for tag in note['taglist']:
if tagCloud.count(tag) == 0:
tagCloud.append(tag)
# Create a dictionary that associates an integer value denoting
# relevance (1 is most relevant etc) for each existing tag
d={}
for tag in tagCloud:
try:
d[tag]=search[tag]
except KeyError:
d[tag]=100
# Create a [[relevance, tag],[],[],...] result list & sort
result=[]
for note in content:
resultNote=[]
for tag in note['taglist']:
resultNote.append([d[tag],tag])
resultNote.append(note['url'])
result.append(resultNote)
result.sort()
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
# Its so hacky i've forgotten how it works!
# It's mostly for display, but suggestions on "best-practice"
# intermediate-form data storage?
finalResult=[]
for note in result:
temp=[]
temp.append(note.pop())
for tag in note:
temp.append(tag[1])
finalResult.append(temp)
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult
1) Is there a much more elegant/efficient way of doing this (embarrass me!)
Sure thing. The basic idea: quit trying to tell Python what to do, and just ask it for what you want.
content = [
{'url':'url1', 'taglist':['b','a','c','d']},
{'url':'url2', 'taglist':['c','a','b','d']},
{'url':'url3', 'taglist':['a','b','c','d']},
{'url':'url4', 'taglist':['a','b','d','c']},
{'url':'url5', 'taglist':['d','a','c','b']}
]
search = {'d' : 1, 'a' : 2, 'b' : 3}
# We can create the tag cloud like this:
# tagCloud = set(sum((note['taglist'] for note in content), []))
# But we don't actually need it: instead, we'll just use a default value
# when looking things up in the 'search' dict.
# Create a [[relevance, tag],[],[],...] result list & sort
result = sorted(
[
[search.get(tag, 100), tag]
for tag in note['taglist']
] + [[note['url']]]
# The result will look like [ [relevance, tag],... , [url] ]
# Note that the url is wrapped in a list too. This makes the
# last processing step easier: we just take the last element of
# each nested list.
for note in content
)
# Remove the relevance values & recreate a list containing
# the url string followed by corresponding tags.
finalResult = [
[x[-1] for x in note]
for note in result
]
print "Content: ", content
print "Search: ", search
print "Final Result: ", finalResult
I suggest you also give a weight to each tag, depending on how rare it is (e.g. a “tarantula” tag would weigh more than a “nature” tag¹). For a given URL, rare tags that are common with other URLs should mark a stronger relevance, while frequently used tags of the given URL not existing in another URL should mark down the relevance.
It's easy to convert the rules I describe above as calculations of a numerical relevance for every other URL.
¹ unless all your URLs are related to “tarantulas”, of course :)

Categories

Resources