Python np where , variable as array index, tuple - python

I want to search a value in a 2d array and get the value of the correspondent "pair"
in this example i want to search for 'd' and get '14'.
I did try with np location with no success and i finished with this crap code, someone else has a smarter solution?
`
import numpy as np
ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d']]
arr = np.array(ar)
x = np.where(arr == 'd')
print(x)
print("x[0]:"+str(x[0]))
print("x[1]:"+str(x[1]))
a = str(x[0]).replace("[", "")
a = a.replace("]", "")
a = int (a)
print(a)
b = str(x[1]).replace("[", "")
b = b.replace("]", "")
b = int (b) -1
print(b)
print(ar[a][b])
#got 14
`

So you want to lookup a key and get a value?
It feels like you need to use dict!
>>> ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d']]
>>> d = dict([(k,v) for v,k in ar])
>>> d
{'a': 11, 'b': 12, 'c': 13, 'd': 14}
>>> d['d']
14

Use a dict, simple and straight forward:
dct = {k:v for v,k in ar}
dct['d']
If you are hell bent on using np.where, then you can use this:
import numpy as np
ar = np.array([[11,'a'],[12,'b'],[13,'c'],[14,'d']])
i = np.where(ar[:,1] == 'd')[0][0]
result = ar[i, 0]

I didn't know about np.where! It's docstring mentions using nonzero directly, so here's a code snippet that uses that to print the rows that match your requirement: note I add another row with 'd' to show it works for the general case where you want multiple rows matching the condition:
ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d'],[15,'e'],[16,'d']]
arr = np.array(ar)
rows = arr[(arr=='d').nonzero()[0], :]
# array([['14', 'd'],
# ['16', 'd']], dtype='<U21')
This works because nonzero (or where) returns a tuple of row/column indexes of the match. So we just use the first entry in the tuple (an array of row indexes) to index the array row-wise and ask Numpy for all columns (:). This makes the code a bit fragile if you move to 3D or higher dimensions, so beware.
This is assuming you really do intend to use Numpy! Dict is better for many reasons.

Related

''Boolean value of Tensor with more than one value is ambiguous'' when broadcasting torch Tensor

My objective is to extract the dimensions of a pytorch Tensor, whose indices are not in a given list. I want to use broadcasting to do that like follows:
Sim = torch.rand((5, 5))
samples_idx = [0] # the index of dim that I don't want to extract
a = torch.arange(Sim.size(0)) not in samples_idx
result = Sim[a]
I assume a would be a Tensor with True/Flase with the dimension of 5.But I get the error RuntimeError: Boolean value of Tensor with more than one value is ambiguous. Anyone can help me to point out where it goes wrong? Thanks.
There is a misunderstanding between the concept of "dimension" and "indices". What you want is to filter Sim and keep only rows (the 0th dimension) which indices match a given rule.
Here is how you could do that:
Sim = torch.rand((5, 5))
samples_idx = [0] # the index of dim that I don't want to extract
a = [v for v in range(Sim.size(0)) if v not in samples_idx]
result = Sim[a]
a is not a boolean Tensor but a list of indices to keep. You then use it to index Sim on the 0th dimension (the rows).
not in is not an operation that can be broadcasted, you should use a regular Python comprehension list for it.
You could create a set containing the desired indices by substracting samples_idx from a set containing all indices:
>>> Sim = torch.rand(5, 5)
tensor([[0.9069, 0.3323, 0.8358, 0.3738, 0.3516],
[0.1894, 0.5747, 0.0763, 0.8526, 0.2351],
[0.0304, 0.7631, 0.3799, 0.9968, 0.6143],
[0.0647, 0.2307, 0.4061, 0.9648, 0.0212],
[0.8479, 0.6400, 0.0195, 0.2901, 0.4026]])
>>> samples_idx = [0]
The following essentially acts as your torch.arange not in sample_idx:
>>> idx = set(range(len(Sim))) - set(samples_idx)
{1, 2, 3, 4}
Then perform the indexing with idx:
>>> Sim[tuple(idx),:]
tensor([[0.1894, 0.5747, 0.0763, 0.8526, 0.2351],
[0.0304, 0.7631, 0.3799, 0.9968, 0.6143],
[0.0647, 0.2307, 0.4061, 0.9648, 0.0212],
[0.8479, 0.6400, 0.0195, 0.2901, 0.4026]])
Maybe this is a bit out of focus, but you could also try using boolean indexing.
>>> Sim = torch.rand((5, 5))
tensor([[0.8128, 0.2024, 0.3673, 0.2038, 0.3549],
[0.4652, 0.4304, 0.4987, 0.2378, 0.2803],
[0.2227, 0.1466, 0.6736, 0.0929, 0.3635],
[0.2218, 0.9078, 0.2633, 0.3935, 0.2199],
[0.7007, 0.9650, 0.4192, 0.4781, 0.9864]])
>>> samples_idx = [0]
>>> a = torch.ones(Sim.size(0))
>>> a[samples_idx] = 0
>>> result = Sim[a.bool(), :]
tensor([[0.4652, 0.4304, 0.4987, 0.2378, 0.2803],
[0.2227, 0.1466, 0.6736, 0.0929, 0.3635],
[0.2218, 0.9078, 0.2633, 0.3935, 0.2199],
[0.7007, 0.9650, 0.4192, 0.4781, 0.9864]])
This way you don't have to iterate all the samples_idx list checking the inclusion.

How to make a numpy matrix from the values of a dictionary of tuples?

I have a dictionary with tuples consisting of pairs of words, and probabilities as values, for example
d = {('a','b'): 0.5, ('b', 'c'): 0.5, ('a', 'd'): 0.25 ...} and so forth where every word in a tuple has a pair with another one. So for example if there are 4 words total, the length of the dictionary would be 16.
I am trying to put the values in a numpy array, in the format
/// a b c d
a
b
c
d
However, I am having a hard time doing this. Any help would be appreciated. Thanks in advance!
The easiest way to think about this is that your letters/words are indices. You want to convert the letter a to the index 0 and the letter b to the index 1.
With that in mind, a simple way to do this is to use the index method of a list. For example, if we have a list with unique values like x = ['cat', 'dog', 'panda'] we can do x.index('dog') to get the index 1 where 'dog' occurs in the list x.
With that in mind, let's generate some data similar to what you described:
import numpy as np
# I'm going to cheat a bit and use some numpy tricks to generate the data
x = np.random.random((5, 5))
values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
your_data = {}
for (i, j), prob in np.ndenumerate(x):
your_data[(values[i], values[j])] = prob
print(your_data)
This gives something like:
{('alpha', 'beta'): 0.8066925762434737, ('alpha', 'gamma'): 0.7778280007471104, ...}
So far, we've just generated some example data. Now let's do the inverse to solve your problem:
values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
output = np.zeros((len(values), len(values)), dtype=float)
for key in your_data:
i = values.index(key[0])
j = values.index(key[1])
output[i, j] = your_data[key]
print(output)
That will give us a numpy array with values similar to what you described.

Using element of list both as string and integer

Is there a way to use the elements of a string-list first as strings and then as int?
l = ['A','B','C']
for n in l:
use n as string (do some operations)
convert n to int(element of NG)
use n as int
I tried to Play around with range/len but I didnt come to a solution.
Edit2:
This is what I have:
import pandas as pd
import matplotlib.pyplot as plt
NG = ['A','B','C']
l = [1,2,3,4,5,6]
b = [6,5,4,3,2,1]
for n in NG:
print(n)
dflist = []
df = pd.DataFrame(l)
dflist.append(df)
df2 = pd.DataFrame(b)
dflist.append(df2)
df = pd.concat(dflist, axis = 1)
df.plot()
The Output are 3 figures that look like this:
But I want them to be in one figure:
import pandas as pd
import matplotlib.pyplot as plt
NG = ['A','B','C']
l = [1,2,3,4,5,6]
b = [6,5,4,3,2,1]
for n in NG:
print(n)
dflist = []
df = pd.DataFrame(l)
dflist.append(df)
df2 = pd.DataFrame(b)
dflist.append(df2)
df = pd.concat(dflist, axis = 1)
ax = plt.subplot(6, 2, n + 1)
df.plot(ax = ax)
This code works, but only if the list NG is made out of integers [1,2,3]. But I have it in strings. And I Need them in the Loop.
How to access both the element of list and its index?
That is the real question as I understood mainly from this comment. And it is a pretty common and straightforward piece of code:
NG = ['A', 'B', 'C']
for i in range(len(NG)):
print(i)
print(NG[i])
Here is my 2 cents:
>>> for n in l:
... print ord(n)
... print n
...
65
A
66
B
67
C
To convert back to char
>>> chr(65)
'A'
I think integer mean here is ascii value of character
so you can use again your ascii value to play with characters
my solution is you have to type cast your variables like this ,
and use ord() function to get ascii values
l = ['A','B','C']
for i in range(0,len(l)):
print("string value is ",l[i])
# now for integer values
if type(l[i]) != 'int':
print("ascii value of this char is ",ord(l[i]))
else:
print("already int type go on..")
as because there is no meaning of having a int value of characters , int value of character generally refers as ascii value may be some other formats
Use enumerate to iterate both on the indices and the letters.
NG = ['A','B','C']
for i, n in enumerate(NG, 1):
print(i, n)
Will output:
(1, 'A')
(2, 'B')
(3, 'C')
In your case, because you don't need the letters at all in your loop, you can use the underscore _ to notify coders in your future about what your code do - it uses the len of NG just for the indices.
for i, _ in enumerate(NG, 1):

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?
Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.
I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}
You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}
original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.
a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

python pandas beginner: multi-dimensional data-analysis workflow (groupby+agg+plot)

I'm new into pandas and try to learn how to process my multi-dimensional data.
My data
Let's assume, my data is a big CSV of the columns ['A', 'B', 'C', 'D', 'E', 'F', 'G']. This data describes some simulation results, where ['A', 'B', ..., 'F'] are simulation parameters and 'G' is one of the ouputs (only existing output in this example!).
EDIT / UPDATE:
As Boud suggested in the comments, let's generate some data which is compatible to mine:
import pandas as pd
import itertools
import numpy as np
npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])
A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario
# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
(5,2),
(4,3),
(3,4),
(2,5),
(1,6),
(1,7) ]
counter = 0
for i in itertools.product(A,B,C,D,E,F):
npData[counter]['A'] = i[0]
npData[counter]['B'] = i[1]
npData[counter]['C'] = i[2]
npData[counter]['D'] = i[3]
npData[counter]['E'] = i[4]
npData[counter]['F'] = i[5]
np.random.seed = i[5]
npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
counter += 1
data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees
Because the parameters ['A', 'B', ..., 'F'] are generated as a cartesian-product (meaning: nested for-loops; a priori), i want to use groupby for obtaining each possible 'simulation scenario' before analysing the output.
The parameter 'F' describe multiple runs for each scenario (each scenario defined by 'A', 'B', ..., 'E' ; let's assume, that 'F' is the random-seed), so my code becomes:
grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario
grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario
What do i want to do now?
I: display all the (unique) values of each scenario-parameter within these groups -> grouped_agg gives me an iterable of tuples, where for example all the entries at each position 0 give me all the values for 'A' (so with a few lines of python i would get my unique values, but maybe there is a function for that)
Update: my approach
list(set(grouped_agg.index.get_level_values('A'))) (when interested in 'A'; using set for obtaining unique values; probably not the stuff you want to do, if you need high performance)
=> [0, 1, 2, 3, 6]
II: generate some plots (of lower dimension) -> i need to make some variables constant and filter/select my data before plotting (therefore step I needed) =>
'B' const
'C', const
'E' const
'D' = x-axis
'G' = y-axis / output from my aggregation
'A' = one more dimension = multiple colors within 2d-plot -> one G/y-axis for each value of 'A'
How would i generate a plot like that?
I think, that reshaping my data is the key step and pandas plotting capabilities will handle it then. Maybe achieving a shape, where there are 5 columns (one for each value of parameter A) and the corresponding G-values for each index-selection + param-A-selection is enough, but i wasn't able to achieve that form yet.
Thanks for your input!
(i'm using pandas 0.12 within enthought canopy)
Sascha
I: If I understand your example and desired output, I don't see why grouping is necessary.
data.A.unique()
II: Updated....
I will implement the example you sketch above. Assume that we have averaged 'G' over the random seed ('F') like so:
data = data.groupby(['A','B','C','D','E']).agg(({'G' : np.mean})).reset_index()
Start by selecting the rows where B, C, and E have some constant values that you specify.
df1 = data[(data['B'] == const1) & (data['C'] == const2) & (data['E'] == const3)]
Now we want to plot 'G' as a function of 'D', with a different color for every value of 'A'.
df1.set_index('D').groupby('A')['G'].plot(legend=True)
I tested the above on some dummy data, and it works as you describe. The range of 'G' corresponding to each 'A' are plotting in the distinct color on the same axes.
III: I don't know how to answer that broad question.
IV: No, I don't think that's an issue for you here.
I suggest playing with simpler, small data sets and getting more familiar with pandas.

Categories

Resources