how to subset pandas dataframe with list

how to subset pandas dataframe with list - python

I have a pandas dataframe like this.
order_id latitude longitude
0 519 19.119677 72.905081
1 520 19.138250 72.913190
2 521 19.138245 72.913183
3 523 19.117662 72.905484
4 524 19.137793 72.913088
5 525 19.119372 72.893768
6 526 19.116275 72.892951
7 527 19.133430 72.913268
8 528 19.136800 72.917185
9 529 19.118284 72.901114
10 530 19.127193 72.914269
11 531 19.114269 72.904039
12 532 19.136292 72.913941
13 533 19.119075 72.895115
14 534 19.119677 72.905081
15 535 19.119677 72.905081
And one list
DB
Out[658]:
[['523'],
['526', '533'],
['527', '528', '532', '535'],
['530', '519'],
['529', '531', '525', '534'],
['520', '521', '524']]
Now I want to subset dataframe on list elements. There are 6 elements in a list and every element have a sublist of order_id. So, for every sub element I want corresponding latitude and longitude. And then I want to calculate haversine distance between each order_id location:
DB[2]
['527', '528', '532', '535']
Then I want to subset on main dataframe for latitude and longitude pairs. So it should return me an array like this:
array([[ 19.11824057, 72.8939447 ],
[ 19.1355074 , 72.9147978 ],
[ 19.11917348, 72.90518167],
[ 19.127193 , 72.914269 ]])
(Just an example not a correct lat long pairs).
I am doing following:
db_lat = []
db_long = []
for i in range(len(DB)):
l = len(DB[i])
for j in range(l):
db_lat.append(tsp_data_unique.latitude[tsp_data_unique['order_id'] ==
''.join(DB[i][j])])
db_long.append(tsp_data_unique.longitude[tsp_data_unique['order_id']
== ''.join(DB[i][j])])
But it gives me a list of all the lat and long present in DB. Here I am not able to distinguish which lat and long belong to which DB elements. So, for every DB elements (6 in my case) I want 6 arrays of lat and long. Please help.

First of all I would convert your int column to str to compare the dataframe with the values of the list:
df['order_id'] = df['order_id'].apply(str)
and then set the index on order_id:
df = df.set_index('order_id')
Then you can do something like:
pairs = df.loc[DB[2]].values
obtaining:
array([[ 19.13343 , 72.913268],
[ 19.1368 , 72.917185],
[ 19.136292, 72.913941],
[ 19.119677, 72.905081]])
EDIT:
Iterating over your list you can then:
In [93]: for i in range(len(DB)):
....: p = df.loc[DB[i]].values
....: print p
....:
[[ 19.117662 72.905484]]
[[ 19.116275 72.892951]
[ 19.119075 72.895115]]
[[ 19.13343 72.913268]
[ 19.1368 72.917185]
[ 19.136292 72.913941]
[ 19.119677 72.905081]]
[[ 19.127193 72.914269]
[ 19.119677 72.905081]]
[[ 19.118284 72.901114]
[ 19.114269 72.904039]
[ 19.119372 72.893768]
[ 19.119677 72.905081]]
[[ 19.13825 72.91319 ]
[ 19.138245 72.913183]
[ 19.137793 72.913088]]

This is how I solved it. Similar to what #Fabio has posted.
new_DB=[]
for i in range(len(DB)):
new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])]
[['latitude','longitude']].values)

Related

Find indices for max values subarrays and applying it on that subarray

I have a file f which holds N (unknown) events. Each event carries an (unknown and different for each event, call it i, j etc) amount of reconstructed tracks. Then, each track has properties like energy E and likelihood lik. So,
>>> print(f.events.tracks.lik)
[[lik1, lik2, ..., likX], [lik1, lik2, ..., likj], ..., [lik1, lik2, ..., likz]]
prints an array holding N subarrays (1 per event), each presenting the lik for all its tracks.
GOAL: call f.events.tracks[:, Inds].E to get the energies for the tracks with max likelihood.
Minimal code example
>>>import numpy as np
>>>lik = np.random.randint(low=0, high=100, size=50).reshape(5, 10)
>>>print(lik)
[[ 3 49 27 3 80 59 96 99 84 34]
[88 62 61 83 90 9 62 30 92 80]
[ 5 21 69 40 2 40 13 63 42 46]
[ 0 55 71 67 63 49 29 7 21 7]
[40 7 68 46 95 34 74 88 79 15]]
>>>energy = np.random.randint(low=100, high=2000, size=50).reshape(5, 10)
>>>print(energy)
[[1324 1812 917 553 185 743 358 877 1041 905]
[1407 663 359 383 339 1403 1511 1964 1797 1096]
[ 315 1431 565 786 544 1370 919 1617 1442 925]
[1710 698 246 1631 1374 1844 595 465 908 953]
[ 305 384 668 952 458 793 303 153 661 791]]
>>> Inds = np.argmax(lik, axis=1)
>>> print(Inds)
[2 1 8 6 7]
PROBLEM:
>>> # call energy[Inds] to get
# [917, 663, 1442, 1844, 153]
What is the correct way of accessing these energies?

You can select the values indexed by Inds for each line using a 2D indexing with a temporary array containing [0,1,2,...] (generated using np.arange).
Here is an example:
energy[np.arange(len(Inds)), Inds]

Python numpy : Matrix Inverses give unprecise results when multiplied

Alright, so I have 3 numpy matrices :
m1 = [[ 3 2 2 ... 2 2 3]
[ 3 2 2 ... 3 3 2]
[500 501 502 ... 625 626 627]
...
[623 624 625 ... 748 749 750]
[624 625 626 ... 749 750 751]
[625 626 627 ... 750 751 752]]
m2 = [[ 3 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]
[ 2 3 500 ... 623 624 625]
...
[ 2 2 500 ... 623 624 625]
[ 2 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]]
m3 = [[ 813 827 160500 ... 199983 200304 200625]
[ 830 843 164000 ... 204344 204672 205000]
[ 181317 185400 36064000 ... 44935744 45007872 45080000]
...
[ 221046 225867 43936000 ... 54744256 54832128 54920000]
[ 221369 226196 44000000 ... 54824000 54912000 55000000]
[ 221692 226525 44064000 ... 54903744 54991872 55080000]]
m1, m2 and m3 are very large square matrices (those examples are 128x128, but they can go up to 2048x2048). Also m1*m2=m3.
My goal is to obtain m2 by using only m1 and m3. Someone told me this was possible, as m1*m2=m3 implies that (m1**-1) * m3 = m2 (I believe it was that, please correct me if I'm wrong) ; so I calculated the inverse of m1 :
m1**-1 = [[ 7.70884284e-01 -8.13188394e-01 -1.65131146e+13 ... -2.49697170e+12
-7.70160676e+12 -4.13395320e+13]
[-3.38144598e-01 2.54532610e-01 1.01286404e+13 ... -3.64296085e+11
2.60327813e+12 2.41783491e+13]
[ 1.77721050e-01 -3.54566231e-01 -5.00564604e+12 ... 5.82415184e+10
-5.98354744e+11 -1.29817153e+13]
...
[-6.56772812e-02 1.54498025e-01 3.21826474e+12 ... 2.61432526e+11
1.14203762e+12 3.61036457e+12]
[ 5.82732587e-03 -3.44252762e-02 -4.79430664e+11 ... 5.10855381e+11
-1.07679881e+11 -1.71485373e+12]
[ 6.55360708e-02 -8.24446025e-02 -1.19618881e+12 ... 4.45713678e+11
-3.48073716e+11 -4.89344092e+12]]
The result looked rather messy so I ran a test and multiplied m1**-1 and m1 to see if it worked :
(m1**-1)*m1 = [[-125.296875 , -117.34375 , -117.390625 , ..., -139.15625 ,
-155.203125 , -147.25 ],
[ 483.1640625 , 483.953125 , 482.7421875 , ..., 603.796875 ,
590.5859375 , 593.375 ],
[-523.22851562, -522.36328125, -523.49804688, ..., -633.07421875,
-635.20898438, -637.34375 ],
...,
[ 10.58691406, 11.68945312, 10.29199219, ..., 14.40429688,
13.00683594, 11.609375 ],
[ -5.32177734, -5.47949219, -4.63720703, ..., -5.28613281,
-5.31884766, -5.6015625 ],
[ -4.93554688, -3.58984375, -3.24414062, ..., -8.72265625,
-5.37695312, -8.03125 ]]
The result is different from the one expected (identity matrix). My guess is that m1 is too big, causing numerical imprecision. But if that previous calculation to get an identity matrix doesn't work properly, then (m1**-1)*m3 surely won't (and it doesn't).
But I really can't decrease the matrix sizes for m1, m2 and m3 and in fact I'd like it to work with even bigger sizes (as said before, max size would be 2048x2048).
Would there be any way to be more precise with such calculations ? Is there an alternative that could work for bigger matrices ?

You are right, inverting a large matrix can be inefficient and numerically unstable. Luckily, there are methods in linear algebra that solve this problem without requiring an inverse.
In this case, m2 = np.linalg.solve(m1, m3) works.

storing data in numpy array without overwriting previous data

i am running a for loop to store the data in a numpy array. the problem is that after every iteration previous data is overwritten by latest one . i want to be able to store all data bu using some "extend" function as used for simple arrays. i tried append but it's not storing all the values of all arrays.
code
data1=np.empty((3,3),dtype=np.int8)
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
for row in df.itertuples():
data = io.imread(row[1])
data1=np.append(data1,data)
print(data1)
expected output
[[[ 34 34 34]
[ 35 35 35]
[ 40 40 40]
...,
[ 8 8 8]
[ 12 12 12]
[ 12 12 12]]
[[ 39 39 39]
[ 30 30 30]
[ 25 25 25]
...,
[ 11 11 11]
[ 1 1 1]
[ 5 5 5]]
[[ 54 54 54]
[ 44 44 44]
[ 34 34 34]
...,
[ 32 32 32]
[ 9 9 9]
[ 0 0 0]]
...,
[[212 212 210]
[167 167 165]
[118 118 116]
...,
[185 186 181]
[176 177 172]
[170 171 166]]
[[220 220 218]
[165 165 163]
[116 116 114]
...,
[158 159 154]
[156 157 152]
[170 171 166]]
[[220 220 218]
[154 154 152]
[106 106 104]
...,
[144 145 140]
[136 137 132]
[158 159 154]]]

top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
data1 = np.array([io.imread(row[1]) for row in df.itertuples()])
If your dataset is not too big, there is no problem with using a standard list first and then convert to numpy array, I guess.
If you're not familiar with implicit lists:
data1 = []
for row in df.itertuples():
data1.append(io.imread(row[1]))
data1 = np.array(data1)

How to convert a pandas dataframe into one dimensional array?

I have a dataframe X. I want to convert it into 1D array with only 5 elements. One way of doing it is converting the inner arrays to lists. How can I do that?
0 1 2 3 4 5
0 1622 95 1717 85.278544 1138.964373 1053.685830
1 62 328 390 75.613900 722.588235 646.974336
2 102 708 810 75.613900 800.916667 725.302767
3 102 862 964 75.613900 725.870370 650.256471
4 129 1380 1509 75.613900 783.711111 708.097211
val = X.values will give a numpy array. I want to convert the inner elements of the array to list. How can I do that?
I tried this but failed
M = val.values.tolist()
A = np.array(M,dtype=list)
N = np.array(M,dtype=object)

Here's one approach to have each row as one list to give us a 1D array of lists -
In [231]: df
Out[231]:
0 1 2 3 4 5
0 1622 95 1717 85.278544 1138.964373 1053.685830
1 62 328 390 75.613900 722.588235 646.974336
2 102 708 810 75.613900 800.916667 725.302767
3 102 862 964 75.613900 725.870370 650.256471
4 129 1380 1509 75.613900 783.711111 708.097211
In [232]: out = np.empty(df.shape[0], dtype=object)
In [233]: out[:] = df.values.tolist()
In [234]: out
Out[234]:
array([list([1622.0, 95.0, 1717.0, 85.278544, 1138.964373, 1053.6858300000001]),
list([62.0, 328.0, 390.0, 75.6139, 722.5882349999999, 646.974336]),
list([102.0, 708.0, 810.0, 75.6139, 800.916667, 725.302767]),
list([102.0, 862.0, 964.0, 75.6139, 725.87037, 650.256471]),
list([129.0, 1380.0, 1509.0, 75.6139, 783.7111110000001, 708.097211])], dtype=object)
In [235]: out.shape
Out[235]: (5,)
In [236]: out.ndim
Out[236]: 1

Have you tried to use df.as_matrix() and then join rows?
EDIT:
Example:
L=[]
for m in df.as_matrix().tolist():
L += m

If it has only one column, you can try this
op_col = []
for i in df_name['Column_name']:
op_col.append(i)
print(op_col)

Change format for data imported from file in Python

My data file is Tab separated and looks like this:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
... ... .. .........
I imported them in Python using numpy, here is my script:
from numpy import loadtxt
np_data = loadtxt('u.data', delimiter='\t', skiprows=0)
print(np_data)
I just want to print it to see the result, but it gives me different a format:
[[ 1.96000000e+02 2.42000000e+02 3.00000000e+00 8.81250949e+08]
[ 1.86000000e+02 3.02000000e+02 3.00000000e+00 8.91717742e+08]
[ 2.20000000e+01 3.77000000e+02 1.00000000e+00 8.78887116e+08]
...,
[ 2.76000000e+02 1.09000000e+03 1.00000000e+00 8.74795795e+08]
[ 1.30000000e+01 2.25000000e+02 2.00000000e+00 8.82399156e+08]
[ 1.20000000e+01 2.03000000e+02 3.00000000e+00 8.79959583e+08]]
There is point . in every number in print(np_data). How to format them to look like my original data file?

I've solved this, turn out I miss the dtype argument , so the script should look like this:
from numpy import loadtxt
np_data = loadtxt('u.data',dtype=int ,delimiter='\t', skiprows=0)
print(np_data)
and done

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to subset pandas dataframe with list - python

This is how I solved it. Similar to what #Fabio has posted. new_DB=[] for i in range(len(DB)): new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])] [['latitude','longitude']].values)

Related

Find indices for max values subarrays and applying it on that subarray

Python numpy : Matrix Inverses give unprecise results when multiplied

storing data in numpy array without overwriting previous data

How to convert a pandas dataframe into one dimensional array?

Change format for data imported from file in Python

Categories

Resources