Clean up irregularities in dictionary values using regex - python

I need to create a dictionary from a text file that contains coordinates for named polygons. The output needs to be a dictionary where the polygon name is the key and corresponding x and y coordinates are the values. Most of the entries in the file follow a standard layout as follows:
Name of polygon
(12.345, 1.2567)
(5.6789, 2.9876)
(9.0345, 3.7654)
(3.4556, 2.3445)
Name of next polygon
(x, y values)
However there are some entries that have irregularities such as all the values are on one line or have extra characters between the parentheses. I need to loop over the values and split the values contained in parentheses.
So far I have created the dictionary in an initial pass over the file and am trying to use regex to split the values based on contents of parentheses:
with open(fpath, 'r') as infile:
d = {}
#split the data into keys and values
for group in infile.read().split('\n\n'):
entry = group.split('\n')
key, *val = entry
d[key] = val
for value in d.values():
value = re.split("*[\(.+$\)]*", str(value))
print(d)
I was hoping that this would clean up the values and create individual values for each set of coordinates contained in the parentheses, however I am getting the following error:
re.error: nothing to repeat at position 0

I think I've found a solution to my problem. I needed to account for multiple values per key in the loop and use re.findall() instead of re.split(). So my final loop looks like:
for key, *value in d.items():
d[key] = re.findall("\(.+\)", str(value))

Related

How to assign a dictionary with empty or null value

I have a code which uses a dict to create a bunch of key-value pairs.
The no. of key-value pairs is undefined and not fixed. In one iteration, it can have 2 key value, in another it can 4 or 5.
The way I am doing is I currently use an empty dict
like
cost_dict = {}
Now when a regular expression pattern is found in a text string then I extract part of those text as key value pairs and populate the above dict with it.
However wherever the pattern is not found, I am trying to catch that exception of AttributeError and then in that specific case I want this above dict to be assigned like null or blank value.
So like
cost_dict ={}
try:
cost_breakdown = re.search(regex, output).group()
except AttributeError:
cost_dict =' ' # this part I am not sure how to do
... (if pattern matches extract the text and populate the above dict as key-value)
But I am not sure how to assign null or blank value to this dict then as above obviously creates a string variable cost_dict and does not assign it to the above defined empty dict.

How to add the value pairs to excel without including the brackets from a python dictionary?

I want to append the key value pairs in my python dictionary without including the brackets... I'm not really sure how to do that.
I've tried looking at similar questions but it isn't working for me.
#this creates a new workbook call difference
file = xlrd.open_workbook('/Users/im/Documents/Exception_Cases/Orders.xls')
wb = xl_copy(file)
Sheet1 = wb.add_sheet('differences')
#this creates header for two columns
Sheet1.write(0,0,"S_Numbers")
Sheet1.write(0,1," Values")
#this would store all the of Key, value pair of my dictionary into their respective SO_Numbers, Booking Values column
print(len(diff_so_keyval))
rowplacement = 1
while rowplacement < len(diff_so_keyval):
for k, v in diff_so_keyval.items():
Sheet1.write(rowplacement,0,k)
Sheet1.write(rowplacement,1,str(v))
rowplacement = rowplacement + 1
#This is what I have in my diff_so_keyval dictionary
diff_so_keyval = {104370541:[31203.7]
106813775:[187500.0]
106842625:[60349.8]
106843037:[492410.5]
106918995:[7501.25]
106919025:[427090.0]
106925184:[30676.4]
106941476:[203.58]
106941482:[203.58]
106941514:[407.16]
106962317:[61396.36]}
#this is the output
S_numbers Values
104370541 [31203.7]
106813775 [187500.0]
106842625 [60349.8]
I want the values without the brackets
Looks to me like the 'values' in the dictionary are actually single-element lists.
If you simply extract the 0th element out of the list, then that should work for 'removing the brackets':
Sheet1.write(rowplacement, 1, v[0])

Error in sorting operation on dictionary

I am trying to sort a file of sequences according to a certain parameter. The data looks as follows:
ID1 ID2 32
MVKVYAPASSANMSVGFDVLGAAVTP ...
ID1 ID2 18
MKLYNLKDHNEQVSFAQAVTQGLGKN ...
....
There are about 3000 sequences like this, i.e. the first line contains two ID field and one rank field (the sorting key) while the second one contains the sequence. My approach is to open the file, convert the file object to a list object, separate the annotation line (ID1, ID2, rank) from the actual sequence (annotation lines always occur on even indices, while sequence lines always occur on odd indices), merge them into a dictionary and sort the dictionary using the rank field. The code reads like so:
#!/usr/bin/python
with open("unsorted.out","rb") as f:
f = f.readlines()
assert type(f) == list, "ERROR: file object not converted to list"
annot=[]
seq=[]
for i in range(len(f)):
# IDs
if i%2 == 0:
annot.append(f[i])
# Sequences
elif i%2 != 0:
seq.append(f[i])
# Make dictionary
ids_seqs = {}
ids_seqs = dict(zip(annot,seq))
# Solub rankings are the third field of the annot list, i.e. annot[i].split()[2]
# Use this index notation to rank sequences according to solubility measurements
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: val[0].split()[2], reverse=False)
# Save to file
with open("sorted.out","wb") as out:
out.write("".join("%s %s" % i for i in sorted_niwa))
The problem I have encountered is that when I open the sorted file to inspect manually, as I scroll down I notice that some sequences have been wrongly sorted. For example, I see the rank 9 placed after rank 89. Up until a certain point the sorting is correct, but I don't understand why it hasn't worked throughout.
Many thanks for any help!
Sounds like you're comparing strings instead of numbers. "9" > "89" because the character '9' comes lexicographically after the character '8'. Try converting to integers in your key.
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: int(val[0].split()[2]), reverse=False)

Pandas Dataframe to Dictionary with Multiple Keys

I am currently working with a dataframe consisting of a column of 13 letter strings ('13mer') paired with ID codes ('Accession') as such:
However, I would like to create a dictionary in which the Accession codes are the keys with values being the 13mers associated with the accession so that it looks as follows:
{'JO2176': ['IGY....', 'QLG...', 'ESS...', ...],
'CYO21709': ['IGY...', 'TVL...',.............],
...}
Which I've accomplished using this code:
Accession_13mers = {}
for group in grouped:
Accession_13mers[group[0]] = []
for item in group[1].iteritems():
Accession_13mers[group[0]].append(item[1])
However, now I would like to go back through and iterate through the keys for each Accession code and run a function I've defined as find_match_position(reference_sequence, 13mer) which finds the 13mer in in a reference sequence and returns its position. I would then like to append the position as a value for the 13mer which will be the key.
If anyone has any ideas for how I can expedite this process that would be extremely helpful.
Thanks,
Justin
I would suggest creating a new dictionary, whose values are another dictionary. Essentially a nested dictionary.
position_nmers = {}
for key in H1_Access_13mers:
position_nmers[key] = {} # replicate key, val in new dictionary, as a dictionary
for value in H1_Access_13mers[key]:
position_nmers[key][value] = # do something
To introspect the dictionary and make sure it's okay:
print position_nmers
You can iterate over the groupby more cleanly by unpacking:
d = {}
for key, s in df.groupby('Accession')['13mer']:
d[key] = list(s)
This also makes it much clearer where you should put your function!
... However, I think that it might be better suited to an enumerate:
d2 = {}
for pos, val in enumerate(df['13mer']):
d2[val] = pos

Adding Multiple Values to a Single Key in Python Dictionary

Python dictionaries really have me today. I've been pouring over stack, trying to find a way to do a simple append of a new value to an existing key in a python dictionary adn I'm failing at every attempt and using the same syntaxes I see on here.
This is what i am trying to do:
#cursor seach a xls file
definitionQuery_Dict = {}
for row in arcpy.SearchCursor(xls):
# set some source paths from strings in the xls file
dataSourcePath = str(row.getValue("workspace_path")) + "\\" + str(row.getValue("dataSource"))
dataSource = row.getValue("dataSource")
# add items to dictionary. The keys are the dayasource table and the values will be definition (SQL) queries. First test is to see if a defintion query exists in the row and if it does, we want to add the key,value pair to a dictionary.
if row.getValue("Definition_Query") <> None:
# if key already exists, then append a new value to the value list
if row.getValue("dataSource") in definitionQuery_Dict:
definitionQuery_Dict[row.getValue("dataSource")].append(row.getValue("Definition_Query"))
else:
# otherwise, add a new key, value pair
definitionQuery_Dict[row.getValue("dataSource")] = row.getValue("Definition_Query")
I get an attribute error:
AttributeError: 'unicode' object has no attribute 'append'
But I believe I am doing the same as the answer provided here
I've tried various other methods with no luck with various other error messages. i know this is probably simple and maybe I couldn't find the right source on the web, but I'm stuck. Anyone care to help?
Thanks,
Mike
The issue is that you're originally setting the value to be a string (ie the result of row.getValue) but then trying to append it if it already exists. You need to set the original value to a list containing a single string. Change the last line to this:
definitionQuery_Dict[row.getValue("dataSource")] = [row.getValue("Definition_Query")]
(notice the brackets round the value).
ndpu has a good point with the use of defaultdict: but if you're using that, you should always do append - ie replace the whole if/else statement with the append you're currently doing in the if clause.
Your dictionary has keys and values. If you want to add to the values as you go, then each value has to be a type that can be extended/expanded, like a list or another dictionary. Currently each value in your dictionary is a string, where what you want instead is a list containing strings. If you use lists, you can do something like:
mydict = {}
records = [('a', 2), ('b', 3), ('a', 4)]
for key, data in records:
# If this is a new key, create a list to store
# the values
if not key in mydict:
mydict[key] = []
mydict[key].append(data)
Output:
mydict
Out[4]: {'a': [2, 4], 'b': [3]}
Note that even though 'b' only has one value, that single value still has to be put in a list, so that it can be added to later on.
Use collections.defaultdict:
from collections import defaultdict
definitionQuery_Dict = defaultdict(list)
# ...

Categories

Resources