Elegant parsing of text-based key-value list - python

I'm writing a parser for text-based sequence alignment/map (SAM) files. One of the fields is a concatenated list of key-value pairs comprising a single alphabet character and an integer (the integer comes first). I have working code, but it just feels a bit clunky. What's an elegant pattern for parsing a format such as this? Thanks.
Input:
record['cigar_str'] = '6M1I69M1D34M'
Desired output:
record['cigar'] = [
{'type':'M', 'length':6},
{'type':'I', 'length':1},
{'type':'M', 'length':69},
{'type':'D', 'length':1},
{'type':'M', 'length':34}
]
EDIT: My current approach
cigarettes = re.findall('[\d]{0,}[A-Z]{1}', record['cigar_str'])
for cigarette in cigarettes:
if cigarette[-1] == 'I':
errors['ins'] += int(cigarette[:-1])
...

Here's what I'd do:
>>> import re
>>> s = '6M1I69M1D34M'
>>> matches = re.findall(r'(\d+)([A-Z]{1})', s)
>>> import pprint
>>> pprint.pprint([{'type':m[1], 'length':int(m[0])} for m in matches])
[{'length': 6, 'type': 'M'},
{'length': 1, 'type': 'I'},
{'length': 69, 'type': 'M'},
{'length': 1, 'type': 'D'},
{'length': 34, 'type': 'M'}]
It's pretty similar to what you have, but it uses regex groups to tease out the individual components of the match.

Related

How to create dictionary with list with regex and defaultdict

A dictionary is below
my = [{'Name':'Super', 'Gender':'Male', 'UNNO':111234},
{'Name':'Spider', 'Gender':'Male', 'UNNO':11123},
{'Name':'Bat', 'Gender':'Female', 'UNNO':113456},
{'Name':'pand', 'Gender':'Female', 'UNNO':13456}]
The unique number is the value for key "UNNO" for each dictionary.
All UNNO numbers must contain 6 digits.
UNNO number start from 11 is only valid
Expected Out
my_dict_list = {'Male':['Super'], 'Female':['Bat']}
Original Code with out regex
d = {}
for i in my:
if str(i['UNNO']).startswith('11') and len(str(i['UNNO'])) == 6:
# To get {'Male':['Super'], 'Female':['Bat']}
d[i['Gender']] = [i['Name']]
How to write with help of regex, wrote regular expression, how to complete with help of defaultdict
import re
from collections import defaultdict
# regular expression
rx = re.compile(r'^(?=\d{6}$)(?P<Male>11\d+)|(?P<Female>11\d+)')
# output dict
output = defaultdict(list)
To engage regex matching in solving your issue - use the following approach:
import re
from collections import defaultdict
my_list = [{'Name': 'Super', 'Gender': 'Male', 'UNNO': 111234},
{'Name': 'Spider', 'Gender': 'Male', 'UNNO': 11123},
{'Name': 'Bat', 'Gender': 'Female', 'UNNO': 113456},
{'Name': 'pand', 'Gender': 'Female', 'UNNO': 13456}]
genders = defaultdict(list)
pat = re.compile(r'^11\d{4}$') # crucial pattern to validate `UNNO` number
for d in my_list:
if pat.search(str(d['UNNO'])):
genders[d['Gender']].append(d['Name'])
print(dict(genders)) # {'Male': ['Super'], 'Female': ['Bat']}

How to determine number of elements with a non-empty property in Python in a collection?

I have the following collection.
[
{'propertyA': {},
'propertyB': 12345,
'id': 1},
{'propertyA': {},
'propertyB': 12345,
'id': 2},
{'propertyA': {},
'propertyB': 12345,
'id': 3},
{'propertyA': {'subProperty1': 'x',
'subProperty2': 'y'},
'propertyB': 67890,
'id': 4},
{'propertyA': {'subProperty1': 'x',
'subProperty2': 'y'},
'propertyB': 67890,
'id': 5}
]
As you can observe, the first three items have the same 'propertyA' and 'propertyB', but they all have unique IDs. So it is safe to assume that 'propertyA' and 'propertyB' are like a bundle, the combination of the two stay consistently.
I want to determine the number of UNIQUE items (unique in this case is defined as unique combination of 'propertyA' and 'propertyB') in this array with empty ({}) the field 'propertyA'. In this case, it's 1.
To make myself clearer, let's add another item
{'propertyA': {},
'propertyB': 13579,
'id': 6},
The number of unique items is now two. I understand it is a little confusing, please feel free to ask me to clarify further.
Use a generator expression to filter collections, and use set to get unique elements.
print len(set(item['propertyB'] for item in a if item['propertyA']=={}))
>>> len(set(map(lambda x: x["propertyB"] if 'propertyA' == {} else None ,l)))
1

Find matching phrases and words in a string python

Using python, what would be the most efficient way for one to extract common phrases or words from to given string?
For example,
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
Would return:
["a","time","there","was a very","called Jack"]
How would one go about in doing this efficiently (in my case I would need to do this over thousands of 1000 word documents)?
You can split each string, then intersect the sets.
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))
Result
set(['a', 'very', 'Jack', 'time', 'was', 'called'])
Note this only matches individual words. You have to be more specific on what you would consider a "phrase". Longest consecutive matching substring? That could get more complicated.
In natural language processing, you usually extract common patterns and sequences from sentences using n-grams.
In python, you can use the excellent NLTK module for that.
For counting and finding the most common, you can use collections.Counter.
Here's a example for 2-grams:
from nltk.util import ngrams
from collections import Counter
from itertools import chain
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)
counter= Counter(chain(ngrams1,ngrams2)) #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once
output:
[('called', 'Jack'), ('was', 'a'), ('a', 'very')]
output with n=3:
[('was', 'a', 'very')]
output with n=1 (without tuples):
['Jack', 'a', 'was', 'time', 'called', 'very']
This is a classic dynamic programming problem. All you need to do is build a suffix tree for string1, with words instead of letters (which is the usual formulation). Here is an illustrative example of a suffix tree.
Label all nodes in your tree as s1.
Insert all suffixes of string2 one by one.
All nodes that the suffixes in step 2 pass through are labeled s2.
Any new nodes created in step 2 are also labeled s2.
In the final tree, path labels of every node labeled both s1 and s2 is a common substring.
This algorithm is succinctly explained in this lecture note.
For two strings of lengths n and m, the suffix tree construction takes O(max(n,m)), and all the matching substrings (in your case, words or phrases) can be searched in O(#matches).
A couple of years later, but I tried this way using 'Counter' below:
Input[ ]:
from collections import Counter
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()
count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
dics = {'tag': n, 'count': c}
tag_count.append(dics)
Output[ ]:
[{'tag': 'a', 'count': 4},
{'tag': 'very', 'count': 3},
{'tag': 'time', 'count': 2},
{'tag': 'was', 'count': 2},
{'tag': 'called', 'count': 2},
{'tag': 'Jack', 'count': 2},
{'tag': 'once', 'count': 1},
{'tag': 'upon', 'count': 1},
{'tag': 'there', 'count': 1},
{'tag': 'large', 'count': 1}]
Hopefully, it would be useful for someone :)

Use of dictionary in Python

I'm writing a concept learning programs, where I need to convert from index to the name of categories.
For example:
# binary concept learning
# candidate eliminaton learning algorithm
import numpy as np
import csv
def main():
d1={0:'0', 1:'Japan', 2: 'USA', 3: 'Korea', 4: 'Germany', 5:'?'}
d2={0:'0', 1:'Honda', 2: 'Chrysler', 3: 'Toyota', 4:'?'}
d3={0:'0', 1:'Blue', 2:'Green', 3: 'Red', 4:'White', 5:'?'}
d4={0:'0', 1:1970,2:1980, 3:1990, 4:2000, 5:'?'}
d5={0:'0', 1:'Economy', 2:'Sports', 3:'SUV', 4:'?'}
a=[0,1,2,3,4]
print a
if __name__=="__main__":
main()
So [0,1,2,3,4] should convert to ['0', 'Honda', 'Green', '1990', '?']. What is the most pythonic way to do this?
I think you need a basic dictionary crash course:
this is a proper dictionary:
>>>d1 = { 'tires' : 'yoko', 'manufacturer': 'honda', 'vtec' : 'no' }
You can call invidual things in the dictionary easily:
>>>d1['tires']
'yoko'
>>>d1['vtec'] = 'yes' #mad vtec yo
>>>d1['vtec']
'yes'
Dictionaries are broken up into two different sections, the key and the value
testDict = {'key':'value'}
You were using a dictionary the exact same way as a list:
>>>test = {0:"thing0", 1:"thing1"} #dictionary
>>>test[0]
'thing0'
which is pretty much the exact same as saying
>>>test = ['thing0','thing1'] #list
>>>test[0]
'thing0'
in your particular case, you may want to either format your dictionaries properly ( i would suggest something like masterdictionary = {'country': ['germany','france','USA','japan], 'manufacturer': ['honda','ferrarri','hoopty'] } etcetera because you could call each individual item you wanted a lot easier
with that same dictionary:
>>>masterdictionary['country'][1]
'germany'
which is
dictionaryName['key'][iteminlistindex]
of course there is nothing preventing you from putting dictionaries as values inside of dictionaries.... inside values of other dictionaries...
You can do:
data = [d1,d2,d3,d4,d5]
print [d[key] for key, d in zip(a, data)]
The function zip() can be used to combine to iterables; lists in this case.
You've already got the answer to your direct question, but you may wish to consider re-structuring the data. To me, the following makes a lot more sense, and will enable you to more easily index into it for what you asked, and for any possible later queries:
from pprint import pprint
items = [[el.get(i, '?') for el in (d1,d2,d3,d4,d5)] for i in range(6)]
pprint(items)
[['0', '0', '0', '0', '0'],
['Japan', 'Honda', 'Blue', 1970, 'Economy'],
['USA', 'Chrysler', 'Green', 1980, 'Sports'],
['Korea', 'Toyota', 'Red', 1990, 'SUV'],
['Germany', '?', 'White', 2000, '?'],
['?', '?', '?', '?', '?']]
I would use a list of dicts d = [d1, d2, d3, d4, d5], and then a list comprehension:
[d[i][key] for i, key in enumerate(a)]
To make the whole thing more readable, use nested dictionaries - each of your dictionaries seems to represent something you could give a more descriptive name than d1 or d2:
data = {'country': {0: 'Japan', 1: 'USA' ... }, 'brand': {0: 'Honda', ...}, ...}
car = {'country': 1, 'brand': 2 ... }
[data[attribute][key] for attribute, key in car.items()]
Note this would not necessarily be in order if that is important, though I think there is an ordered dictionary type.
As suggested by the comment, a dictionary with contiguous integers as keys can be replaced by a list:
data = {'country': ['Japan', 'USA', ...], 'brand': ['Honda', ...], ...}
If you need to keep d1, d2, etc. as is:
newA = [locals()["d%d"%(i+1)][a_value] for i,a_value in enumerate(a)]
Pretty ugly, and fragile, but it should work with your existing code.
You don't need a dictionary for this at all. Lists in python automatically support indexing.
def main():
d1=['0','Japan','USA','Korea','Germany',"?"]
d2=['0','Honda','Chrysler','Toyota','?']
d3=['0','Blue','Green','Red','White','?']
d4=['0', 1970,1980,1990,2000,'?']
d5=['0','Economy','Sports','SUV','?']
ds = [d1, d2, d3, d4, d5] #This holds all your lists
#This is what range is for
a=range(5)
#Find the nth index from the nth list, seems to be what you want
print [ds[n][n] for n in a] #This is a list comprehension, look it up.

How to slice a string in Python using a dictionary containing character positions?

I have a dictionary containing the character positions of different fields in a string. I'd like to use that information to slice the string. I'm not really sure how to best explain this, but the example should make it clear:
input:
mappings = {'name': (0,4), 'job': (4,11), 'color': (11, 15)}
data = "JohnChemistBlue"
desired output:
{'name': 'John', 'job': 'Chemist', 'color': 'Blue'}
Please disregard the fact that jobs, colors and names obviously vary in character length. I'm parsing fixed-length fields but simplified it here for illustrative purposes.
>>> dict((f, data[slice(*p)]) for f, p in mappings.iteritems())
{'color': 'Blue', 'job': 'Chemist', 'name': 'John'}
dict([(name, data[range[0]:range[1]]) for name, range in mappings.iteritems()])
>>> dict([(k, data[ mappings[k][0]:mappings[k][1] ]) for k in mappings])
{'color': 'Blue', 'job': 'Chemist', 'name': 'John'}
or with a generator instead of a list (probably more efficient):
>>> dict(((k, data[ mappings[k][0]:mappings[k][1] ]) for k in mappings))
{'color': 'Blue', 'job': 'Chemist', 'name': 'John'}

Categories

Resources