Error handling numpy.float? - python

I am working on a csv file using python.
I wrote the following script to treat the file:
import pickle
import numpy as np
from csv import reader, writer
dic1 = {'a': 2, 'b': 2, 'c': 2}
dic2 = {'a': 2,'b': 2,'c': 0}
number = dict()
for k in dic1:
number[k] = dic1[k] + dic2[k]
ctVar = {'a': [0.093323751331788565, -1.0872670058072453, '', 8.3574590513050264], 'b': [0.053169909627947334, -1.0825742255395172, '', 8.0033788558001984], 'c': [-0.44681777279768059, 2.2380488442495348]}
Var = {}
for k in number:
Var[k] = number[k]
def findIndex(myList, number):
n = str(number)
m = len(n)
for elt in myList:
e = str(elt)
l = len(e)
mi = min(m,l)
if e[:mi-1] == n[:mi-1]:
return myList.index(elt)
def sortContent(myList):
if '' in myList:
result = ['']
myList.remove('')
else:
result = []
myList.sort()
result = myList + result
return result
An extract of the csv file follows: (INFO: The blanks are important. To increase the readability, I noted them BL but they should just be empty cases)
The columns contain few elements (including '') repeated many times.
a
0.0933237513
-1.0872670058
0.0933237513
BL
BL
0.0933237513
0.0933237513
0.0933237513
BL
Second column:
b
0.0531699096
-1.0825742255
0.0531699096
BL
BL
0.0531699096
0.0531699096
0.0531699096
BL
Third column:
c
-0.4468177728
2.2380488443
-0.4468177728
-0.4468177728
-0.4468177728
-0.4468177728
-0.4468177728
2.2380488443
2.2380488443
I just posted an extract of the code (where I am facing a problem) and we can't see its utility. Basically, it is part of a larger code that I use to modify this csv file and encode it differently.
In this extract, I am trying at some point (line 68) to sort elements of a list that contains numbers and ''.
When I remove the line that does this, the elements printed are those of each column (without any repetition).
The problem is that, when I try to sort them, the '' are no longer taken into account. Yet, when I tested my function sortContent with lists that have '', it worked perfectly.
I thought this problem was related to the use of numpy.float64 elements in my list. So I converted all these elements into floats, but the problem remains.
Any help would be greatly appreciated!

I assume you mean to use sortContent on something else (as obviously if you want the values in your predefined lists in ctVar in a certain order, you can just put them in order in your code rather than sorting them at runtime).
Let's go through your sortContent piece by piece.
if '' in myList:
result = ['']
myList.remove('')
If the list object passed in (let's call this List 1) has items '', create a new list object (let's call it List 2) with just '', and remove the first instance of '' from list 1.
mylist.Sort()
Now, sort the contents of list 1.
result = myList + result
Now create a new list object (call it list 3) with the contents of list 1 and list 2.
return result
Keep in mind that list 1 (the list object that was passed in) still has the '' removed.

Related

Algorithm to split the values of a list into a specific format

Can you help me with my algorithm in Python to parse a list, please?
List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']
In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU
In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU
As long as the names are duplicated, we take the level above.
But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR
My final list =
['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']
Thank you in advance.
Here is the code I started to write:
json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU',
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']
cols_name = []
for path in json_paths:
acc=2
col_name = '_'.join(path.split('_')[-acc:])
tmp = cols_name
while col_name in tmp:
acc += 1
idx = tmp.index(col_name)
cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
col_name = '_'.join(path.split('_')[-acc:])
tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
cols_name.append(col_name)
print(cols_name.index(col_name), col_name)
cols_name
help ... with ... algorithm
use a dictionary for the initial container while iterating
keys will be PARENT_CHILD's and values will be lists containing grandparents.
>>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
>>> d = collections.defaultdict(list)
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
>>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
>>>
after iteration determine if there are multiple grandparents in a value
if there are, join/append the parent_child to each grandparent
additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
if the parent_child only has one grandparent use as-is
Instead of splitting each string the info could be extracted with a regular expression.
pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'

Loop over each item in a row and compare with each item from another row then save the result in a new column_python

I want to loop in python, over each item from a row against other items from the correspondent row from another column.
If item is not present in the row of the second column then should append to the new list that will be converted in another column (this should also eliminate duplicates when appending through if i not in c).
The goal is to compare items from each row of a column against items from the correspondent row in another column and to save the unique values from the first column, in a new column same df.
df columns
This is just an example, I have much many items in each row
I tried using this code but nothing happened and conversion of the list into the column it's not correct from what I have tested
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
c = []
for i in df.values:
for i in a:
if i in a:
if i not in b:
if i not in c:
c.append(i)
print(c)
df['new'] = pd.Series(c)
Any help is more than needed, thanks in advance
So seeing as you have these two variables one way would be:
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
Try something like this:
new = {}
for index, items in enumerate(a):
for thing in items:
if thing not in b[index]:
if index in new:
new[index].append(thing)
else:
new[index] = [thing]
Then map the dictionary to the df.
df['new'] = df.index.map(new)
There are better ways to do it but this should work.
This should be what you want:
import pandas as pd
data = {'final_key_concat':[['Camiseta', 'Tecnica', 'hombre', 'barate'],
['deportivas', 'calcetin', 'hombres', 'deportivas', 'shoes']],
'attributes_tokenize':[['The', 'North', 'Face', 'manga'], ['deportivas',
'calcetin', 'shoes', 'North']]} #recreated from your image
df = pd.DataFrame(data)
a= df['final_key_concat'].tolist() #this generates a list of lists
b = df['attributes_tokenize'].tolist()#this also generates a list of lists
#Both list a and b need to be flattened so as to access their elements the way you want it
c = [itm for sblst in a for itm in sblst] #flatten list a using list comprehension
d = [itm for sblst in b for itm in sblst] #flatten list b using list comprehension
final_list = [itm for itm in c if itm not in d]#Sort elements common to both list c and d
print (final_list)
Result
['Camiseta', 'Tecnica', 'hombre', 'barate', 'hombres']
def parse_str_into_list(s):
if s.startswith('[') and s.endswith(']'):
return ' '.join(s.strip('[]').strip("'").split("', '"))
return s
def filter_restrict_words(row):
targets = parse_str_into_list(row[0]).split(' ', -1)
restricts = parse_str_into_list(row[1]).split(' ', -1)
print(restricts)
# start for loop each words
# use set type to save words or list if we need to keep words in order
words_to_keep = []
for word in targets:
# condition to keep eligible words
if word not in restricts and 3 < len(word) < 45 and word not in words_to_keep:
words_to_keep.append(word)
print(words_to_keep)
return ' '.join(words_to_keep)
df['FINAL_KEYWORDS'] = df[[col_target, col_restrict]].apply(lambda x: filter_restrict_words(x), axis=1)

Issue with Configobj-python and list items

I am trying to read .ini file with keywords having single items or list items. When I try to print single item strings and float values, it prints as h,e,l,l,o and 2, ., 1 respectively, whereas it should have been just hello and 2.1. Also, when I try to write new single item string/float/integer, there is , at the end. I am new to python and dealing with configobj. Any help is appreciated and if this question has been answered previously, please direct me to it. Thanks!
from configobj import ConfigObj
Read
config = ConfigObj('para_file.ini')
para = config['Parameters']
print(", ".join(para['name']))
print(", ".join(para['type']))
print(", ".join(para['value']))
Write
new_names = 'hello1'
para['name'] = [x.strip(' ') for x in new_names.split(",")]
new_types = '3.1'
para['type'] = [x.strip(' ') for x in new_types.split(",")]
new_values = '4'
para['value'] = [x.strip(' ') for x in new_values.split(",")]
config.write()
My para_file.ini looks like this,
[Parameters]
name = hello1
type = 2.1
value = 2
There are two parts to your question.
Options in ConfigObj can be either a string, or a list of strings.
[Parameters]
name = hello1 # This will be a string
pets = Fluffy, Spot # This will be a list with 2 items
town = Bismark, ND # This will also be a list of 2 items!!
alt_town = "Bismark, ND" # This will be a string
opt1 = foo, # This will be a list of 1 item (note the trailing comma)
So, if you want something to appear as a list in ConfigObj, you must make sure it includes a comma. A list of one item must have a trailing comma.
In Python, strings are iterable. So, even though they are not a list, they can be iterated over. That means in an expression like
print(", ".join(para['name']))
The string para['name'] will be iterated over, producing the list ['h', 'e', 'l', 'l', 'o', '1'], which Python dutifully joins together with spaces, producing
h e l l o 1

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?
Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.
I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}
You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}
original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.
a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

Python: Linking Lists Together

Suppose I have a list where each index is either a name, or a list of rooms the preceding name index reserved.
[["Bob"],["125A, "154B", "643A"],["142C", "192B"], ["653G"],
["Carol"], ["95H", 123C"], ["David"], ["120G"]]
So in this case, Bob has the rooms: 125A, 154B, 643A, 152C, 192B, and 653G reserved, etc.
How do I construct a function which would make the above into the following format:
[["Bob", "125A, "154B", "643A", "142C", "192B", "653G"], ["Carol"...
Essentially concatenating [name] with all the [list of room reservations], until the next instance of [name]. I have a function which takes a list, and returns True if a list is a name, and False if it is a list of room reservations, so effectively I have:
[True, False, False, False, True, False, True False] for the above list, but not sure how that would help me, if at all. Assume that if a list contains names, it only has one name.
Given the following method
def is_name(x):
return # if x is a name or not
a simply and short solution is to use a defaultdict
Example:
from collections import defaultdict
def do_it(source):
dd = defaultdict(lambda: [])
for item in sum(source, []): # just use your favourite flattening method here
if is_name(item):
name = item
else:
dd[name].append(item)
return [[k]+v for k,v in dd.items()]
for s in do_it(l):
print s
Output:
['Bob', '125A', '154B', '643A', '142C', '192B', '653G']
['Carol', '95H', '123C']
['David', '120G']
Bonus:
This one uses a generator for laziness
import itertools
def do_it(source):
name, items = None, []
for item in itertools.chain.from_iterable(source):
if is_name(item):
if name:
yield [name] + items
name, items = None, []
name = item
else:
items.append(item)
yield [name] + items
I'll preface this by saying that I strongly agree with #uʍopǝpısdn's suggestion. However if your setup precludes changing it for some reason, this seems to work (although it isn't pretty):
# Original list
l = [["Bob"],["125A", "154B", "643A"],["142C", "192B"], ["653G"], ["Carol"], ["95H", "123C"], ["David"], ["120G"]]
# This is the result of your checking function
mapper = [True, False, False, False, True, False, True, False]
# Final list
combined = []
# Generic counters
# Position in arrays
i = 0
# Position in combined list
k = 0
# Loop through the main list until the end.
# We don't use a for loop here because we want to be able to control the
# position of i.
while i < len(l):
# If the corresponding value is True, start building the list
if mapper[i]:
# This is an example of how the code gets messy quickly
combined.append([l[i][0]])
i += 1
# Now that we've hit a name, loop until we hit another, adding the
# non-name information to the original list
while i < len(mapper) and not mapper[i]:
combined[k].append(l[i][0])
i += 1
# increment the position in our combined list
k += 1
print combined
Assume that the function which takes a list and returns True or False based on whether list contains name or rooms is called containsName() ...
def process(items):
results = []
name_and_rooms = []
for item in items:
if containsName(item):
if name_and_rooms:
results.append(name_and_rooms[:])
name_and_rooms = []
name_and_rooms.append(item[0])
else:
name_and_rooms.extend(item)
if name_and_rooms:
results.append(name_and_rooms[:])
return results
This will print out name even if there are no list of rooms to follow, e.g. [['bob'],['susan']].
Also, this will not merge repeated names, e.g. [['bob'],['123'],['bob'],['456']]. If that is desired, then you'll need to shove names into a temporary dict instead, with each room list as values to it. And then spit out the key-values of the dict at the end. But that on its own will not preserve the order of the names. If you care to preserve the order of the names, you can have another list that contains the order of the names and use that when spitting out the values in the dict.
Really, you should be using a dict for this. This assumes that the order of lists doesn't change (the name is always first).
As others suggested you should re-evaluate your data structure.
>>> from itertools import chain
>>> li_combo = list(chain.from_iterable(lst))
>>> d = {}
>>> for i in li_combo:
... if is_name(i):
... k = i
... if k not in d:
... d[k] = []
... else:
... d[k].append(i)
...
>>> final_list = [[k]+d[k] for k in d]
>>> final_list
[['Bob', '125A', '154B', '643A', '142C', '192B', '653G'], ['Carol', '95H', '123C'], ['David', '120G']]
reduce is your answer. Your data is this:
l=[['Bob'], ['125A', '154B', '643A'], ['142C', '192B'], ['653G'], ['Carol'], ['95H', '123C'], ['David'], ['120G']]
You say you've already got a function that determines if an element is a name. Here is my one:
import re
def is_name(s):
return re.match("[A-z]+$",s) and True or False
Then, using reduce, it is a one liner:
reduce(lambda c, n: is_name(n[0]) and c+[n] or c[:-1]+[c[-1]+n], l, [])
Result is:
[['Bob', '125A', '154B', '643A', '142C', '192B', '653G'], ['Carol', '95H', '123C'], ['David', '120G']]

Categories

Resources