Getting a vector of dictionary values in an array, python - python

I am trying to get a vector of specific dictionary values which are in a numpy array. Here is what the array looks like:
import numpy as np
edge_array = np.array(
[[1001, 7005, {'lanes': 9, 'length': 0.35, 'type': '99', 'modes': 'cw'}],
[1001, 8259, {'lanes': 10, 'length': 0.46, 'type': '99', 'modes': 'cw'}],
[1001, 14007, {'lanes': 7, 'length': 0.49, 'type': '99', 'modes': 'cw'}]])
I have a vector for the first two values of each row (i.e. 1001 and 7005, but I need another vector for the values associated with 'lanes'.
Here is my code so far:
row_idx = edge_array[:, 0]
col_idx = edge_array[:, 1]
lane_values = edge_array[:, 2['lanes']]
The error I get is as follows:
lane_values = edge_array[:, 2['lanes']]
TypeError: 'int' object has no attribute '__getitem__'
Please let me know if you need any further clarification, thanks!

The subexpression 2['lanes'] does not make sense: you are indexing into the number 2.
Instead, try:
[rec['lanes'] for rec in edge_array[:, 2]]
Or:
import operator
map(operator.itemgetter('lanes'), edge_array[:,2])
The above will give you a regular Python list; if you want a NumPy array you'll have to call np.array() on the list.
But the better solution here is to transform your data into a "structured array" which has named columns and then you can index efficiently by name. If your array has many rows, this will have a big impact on efficiency.

This is not a fully working example. Hard to work with that. The types are unclear. I suspect, that you work with numpy somehow, but well, hard to tell.
In all means, the indexing with 2['something'] is incorrect and the error tells you why. It is tried to index with a key in an integer. Look up how indexing is done in python / numpy.
But this is how you could extract your 'lanes':
map(lambda x: x['lanes'], edge_array[:, 2]))
# OR (if you want a vector/np-array)
vec_of_lanes = np.array(map(lambda x: x['lanes'], edge_array[:, 2])))
More in numpy-style:
vec_of_lanes = np.apply_along_axis(lambda x: x[2]['lanes'], 1, edge_array)

#Zwinck suggested a structured array. Here's one way of doing that
Define a dtype for the dictionary part. It has fields with different dtypes
dt1 = np.dtype([('lanes',int), ('length',float), ('type','S2'),('modes','S2')])
Embed that dtype in a larger one. I used a sub-array format for the first 2 values:
dt = np.dtype([('f0',int,(2,)), ('f1',dt1)])
Now create the array. I edited your expression to fit dt. The mix of tuples and lists is important. I could have transferred the data from your object array instead (todo?)
edge_array1 = np.array(
[([1001, 7005], ( 9, 0.35, '99','cw')),
([1001, 8259], ( 10, 0.46, '99','cw')),
([1001, 14007], (7, 0.49, '99', 'cw'))], dtype=dt)
Now the 2 int values can be accessed by the 'f0' field name:
In [513]: edge_array1['f0']
Out[513]:
array([[ 1001, 7005],
[ 1001, 8259],
[ 1001, 14007]])
while 'lanes' are accessed by a double application of field name indexing (since they are a field within the field):
In [514]: edge_array1['f1']['lanes']
Out[514]: array([ 9, 10, 7])

Related

Convert dataframe with two array columns into list of arrays

I have a dataframe with two columns containing arrays in each cell. Here's some code to create a small example dataframe with the same features as mine.
import pandas as pd
data = {'time': [
np.array(['2017-06-28T22:47:51.213500000', '2017-06-28T22:48:37.570900000', '2017-06-28T22:49:46.736800000']),
np.array(['2017-06-28T22:46:27.321600000', '2017-06-28T22:46:27.321600000', '2017-06-28T22:47:07.220500000', '2017-06-28T22:47:04.293000000']),
np.array(['2017-06-28T23:10:20.125000000', '2017-06-28T23:10:09.885000000', '2017-06-28T23:11:31.902000000'])
],
'depth': [
np.array([215.91168091, 222.89173789, 215.21367521]),
np.array([188.68945869, 208.23361823, 217.30769231, 229.87179487]),
np.array([169.84330484, 189.38746439, 178.91737892])
]
}
df = pd.DataFrame(data)
df
I want to plot the data as three individual shapes, one for each row, where the time values are treated as the x coordinates and the depth values are treated as the y coordinates. To do this, I want to make a list of arrays that looks something like this.
[array([['2017-06-28T22:47:51.213500000', 215.91168091],
['2017-06-28T22:48:37.570900000', 222.89173789],
['2017-06-28T22:49:46.736800000', 215.21367521], dtype=object),
array([['2017-06-28T22:46:27.321600000', 188.68945869],
['2017-06-28T22:46:27.321600000', 208.23361823],
['2017-06-28T22:47:07.220500000', 217.30769231],
['2017-06-28T22:47:04.293000000', 229.87179487], dtype=object),
array([['2017-06-28T23:10:20.125000000', 169.84330484],
['2017-06-28T23:10:09.885000000', 189.38746439],
['2017-06-28T23:11:31.902000000', 178.91737892], dtype=object)]
Try zip with for loop
l = [np.array(list(zip(x,y))) for x, y in zip(df.time,df.depth)]
Out[385]:
[array([['2017-06-28T22:47:51.213500000', '215.91168091'],
['2017-06-28T22:48:37.570900000', '222.89173789'],
['2017-06-28T22:49:46.736800000', '215.21367521']], dtype='<U29'),
array([['2017-06-28T22:46:27.321600000', '188.68945869'],
['2017-06-28T22:46:27.321600000', '208.23361823'],
['2017-06-28T22:47:07.220500000', '217.30769231'],
['2017-06-28T22:47:04.293000000', '229.87179487']], dtype='<U29'),
array([['2017-06-28T23:10:20.125000000', '169.84330484'],
['2017-06-28T23:10:09.885000000', '189.38746439'],
['2017-06-28T23:11:31.902000000', '178.91737892']], dtype='<U29')]

IndexingError: Too many indexers while using iloc

I have a dataframe from which I am trying to add attributes to my graph edges.
dataframe having mean_travel_time which is going to be the attribute for my edge
Plus, I have a data list which consists of source nodes and destination nodes as a tuple, like this.
[(1160, 2399),
(47, 1005)]
Now, while using set_edge_attribute to add attributes, I need my data into a dictionary:
{(1160, 2399):1434.67,
(47, 1005):2286.10,
}
I did something like this:
data_dict={}#Empty dictionary
for i in data:
data_dict[i] = df1['mean_travel_time'].iloc[i]#adding values
But, I am getting error saying too many indexers
Can anyone help me out with the error?
Please provide your data in a format easy to copy:
df = pd.DataFrame({
'index': [1, 9, 12, 18, 26],
'sourceid': [1160, 70, 1190, 620, 1791],
'dstid': [2399, 1005, 4, 103, 1944],
'month': [1] * 5,
'distance': [1434.67, 2286.10, 532.69, 593.20, 779.05]
})
If you are trying to iterate through a list of edges such as (1,2) you need to set an index for your DataFrame first:
df1.set_index(['sourceid', 'dstid'])
You could then access specific edges:
df.set_index(['sourceid', 'dstid']).loc[(1160, 2399)]
Or use a list of edges:
edges = zip(df['sourceid'], df['dstid'])
df.set_index(['sourceid', 'dstid']).loc[edges]
But you don't need to go any of this because, in fact, you can get your entire dict all in one go:
df.set_index(['sourceid', 'dstid'])['mean_travel_time'].to_dict()

Python converting nested dictionary with list containing float elements as individual elements

I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]

Fitting beta distribution over each item in a dictionary of items

I have a dictionary with unique ID and [sample distribution of scores] pairs, e.g.: '100': [0.5, 0.6, 0.2, 0.7, 0.3]. The arrays are not all the same length.
For each item/'scores' array in my dictionary, I want to fit a beta distribution like scipy.stats.beta.fit() over the distribution of scores and get the alpha/beta parameters for each sample. And then I want this in a new dictionary — so it'd be like, '101': (1.5, 1.8).
I know I could do this by iterating over my dictionary with a for-loop, but the dictionary is pretty massive/I'd like to know if there's a more computationally efficient way of doing it.
For context, the way I get this dictionary is from a pandas dataframe, where I do:
my_dictionary = df.groupby('unique_id')['score'].apply(list).to_dict()
The df looks like this:
For example:
df = pd.DataFrame({
'id': ['100', '100', '100', '101', '101', '102'],
'score' : [0.5, 0.3, 0.2, 1, 0.2, 0.9]
})
And then the resulting dictionary looks like:
{'100': [0.5, 0.3, 0.2], '101': [0.2, 0.1], '102': [0.9]}
Is there maybe also a way of fitting the beta distribution straight from the df.groupby level/without having to convert it into a dictionary first and then looping over the dictionary with scipy? Like is there something where I could do:
df.groupby('unique_id')['score'].apply(stats.beta.fit()).to_dict()
...or something like that?
Try this:
df=df.groupby('id').apply(lambda x: list(beta.fit(x.score)))
dc=df.to_dict()
Output:
df
id
100 [0.2626434905176847, 0.37866242902872393, 0.18...
101 [1.253982875508286, 0.8832540117966552, -0.093...
102 [1.044551187075241, 1.0167687597781938, 0.8999...
dtype: object
dc
{'100': [0.2626434905176847, 0.37866242902872393, 0.18487097639113187, 0.3151290236088682],
'101': [1.253982875508286, 0.8832540117966552, -0.09383386122371801, 1.0938338612237182],
'102': [1.044551187075241, 1.0167687597781938, 0.8999999999999999, 1.1272504901983386e-16]}
As I recognize You need to fit multiple beta.fit per row of dataframe df:
df['beta_fit'] = df['score'].apply( lambda x: stats.beta.fit(x))
Now result is stored in df['beta_fit']:
0 (0.5158954356434775, 0.4824876600627905, 0.154...
1 (0.18219650169013427, 0.18228236200252418, 0.1...
2 (2.874609362944296, 0.8497751096020354, -0.341...
3 (1.313976940871222, 0.5956397575363881, -0.093...
Name: beta_fit, dtype: object
If you want to keep the location (loc) and scale (scale) fixed, you need to indicate this in scipy.stats.beta.fit. You can use functools.partial for this:
import pandas as pd
>>> import scipy.stats
>>> from functools import partial
>>> df = pd.DataFrame({
... 'id': ['100', '100', '100', '101', '101', '102'],
... 'score' : [0.5, 0.3, 0.2, 0.1, 0.2, 0.9]
... })
>>> beta = partial(scipy.stats.beta.fit, floc=0, fscale=1)
>>> df.groupby('id')['score'].apply(beta)
id
100 (4.82261025047374, 9.616623800842953, 0, 1)
101 (0.7079910251948778, 0.910200073771759, 0, 1)
Name: score, dtype: object
Note that I have adjusted your input example, since it contains an incorrect value (1.0), and too few values for the fit to succeed in some cases.

python - create new sub-matrix by filtering columns from matrix/bidimensional list

for instance the following a matrix, e.g.
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101',201',302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
I'd like to create a new matrix based on a list of columns indexes or names (filter in), so
filter_col_indx = {0,2}
filter_col_name = {'month','val2'}
would produce the same output:
matrix2 = [
['month,'val2'],
['jan','200'],
['feb','201'],
['march','202'],
['april','203'],
['march','204']
]
For large matrices what would be the most efficient way to do this? The list_of_columns can vary.
Thanks
This can be done using operator.itemgetter:
import operator
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
filter_col_indx = [0,2]
getter = operator.itemgetter(*filter_col_indx)
matrix2 = [list(getter(row)) for row in matrix]
print(matrix2)
yields
[['month', 'val2'],
['jan', '200'],
['feb', '201'],
['march', '202'],
['april', '203'],
['march', '204']]
operator.itemgetter(*filter_col_indx) returns a function which takes a sequence as its argument and returns the 0th and 2th items from the sequence. Thus, you can apply this function to each row to select the desired values from matrix.
If you install pandas, then you could make matrix a DataFrame and select the desired columns like this:
import pandas as pd
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
df = pd.DataFrame(matrix[1:], columns=matrix[0])
print(df[['month', 'val2']])
yields
month val2
0 jan 200
1 feb 201
2 march 202
3 april 203
4 march 204
You might enjoy using pandas since it make a lot of data-munging operations very easy.
If you're always interested in whole columns, I think it would be appropriate to store the data using a dictionary containing the columns as lists:
data = {'month': ['jan', 'feb', 'march', 'april', 'march'],
'val1': [100, 101, 102, 103, 104],
'val2': [200, 201, 202, 203, 204],
...
}
To retrieve columns (which I have now written horizontally...), you do:
{key: data[key] for key in ['month', 'val2']}
This is a numpy version for this:
import numpy as np
matrix = np.array([
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
])
search = ['month', 'val2']
indexes = matrix[0,:].searchsorted(search) #search only the first row
# or indexes = [0, 2]
print matrix[:,indexes]
>>> [['month' 'val2']
['jan' '200']
['feb' '201']
['march' '202']
['april' '203']
['march' '204']]

Categories

Resources