Extract left and right limit from a Series of pandas Intervals - python

I want to get interval margins of a column with pandas intervals and write them in columns 'left', 'right'. Iterrows does not work (documentation says it would not be use for writing data) and, anyway it would not be the better solution.
import pandas as pd
i1 = pd.Interval(left=85, right=94)
i2 = pd.Interval(left=95, right=104)
i3 = pd.Interval(left=105, right=114)
i4 = pd.Interval(left=115, right=124)
i5 = pd.Interval(left=125, right=134)
i6 = pd.Interval(left=135, right=144)
i7 = pd.Interval(left=145, right=154)
i8 = pd.Interval(left=155, right=164)
i9 = pd.Interval(left=165, right=174)
data = pd.DataFrame(
{
"intervals":[i1,i2,i3,i4,i5,i6,i7,i8,i9],
"left" :[0,0,0,0,0,0,0,0,0],
"right" :[0,0,0,0,0,0,0,0,0]
},
index=[0,1,2,3,4,5,6,7,8]
)
#this is not working (has no effect):
for index, row in data.iterrows():
print(row.intervals.left, row.intervals.right)
row.left = row.intervals.left
row.right = row.intervals.right
How can we do something like:
data['left']=data['intervals'].left
data['right']=data['intervals'].right
Thanks!

Create an pandas.IntervalIndex from your intervals. You can then access the .left and .right attributes.
import pandas as pd
idx = pd.IntervalIndex([i1, i2, i3, i4, i5, i6, i7, i8, i9])
pd.DataFrame({'intervals': idx, 'left': idx.left, 'right': idx.right})
intervals left right
0 (85, 94] 85 94
1 (95, 104] 95 104
2 (105, 114] 105 114
3 (115, 124] 115 124
4 (125, 134] 125 134
5 (135, 144] 135 144
6 (145, 154] 145 154
7 (155, 164] 155 164
8 (165, 174] 165 174
Another option is using map and operator.attrgetter (look ma, no lambda...):
from operator import attrgetter
df['left'] = df['intervals'].map(attrgetter('left'))
df['right'] = df['intervals'].map(attrgetter('right'))
df
intervals left right
0 (85, 94] 85 94
1 (95, 104] 95 104
2 (105, 114] 105 114
3 (115, 124] 115 124
4 (125, 134] 125 134
5 (135, 144] 135 144
6 (145, 154] 145 154
7 (155, 164] 155 164
8 (165, 174] 165 174

A pandas.arrays.IntervalArray, is the preferred way for storing interval data in Series-like structures.
For #coldspeed's first example, IntervalArray is basically a drop in replacement:
In [2]: pd.__version__
Out[2]: '1.1.3'
In [3]: ia = pd.arrays.IntervalArray([i1, i2, i3, i4, i5, i6, i7, i8, i9])
In [4]: df = pd.DataFrame({'intervals': ia, 'left': ia.left, 'right': ia.right})
In [5]: df
Out[5]:
intervals left right
0 (85, 94] 85 94
1 (95, 104] 95 104
2 (105, 114] 105 114
3 (115, 124] 115 124
4 (125, 134] 125 134
5 (135, 144] 135 144
6 (145, 154] 145 154
7 (155, 164] 155 164
8 (165, 174] 165 174
If you already have interval data in a Series or DataFrame, #coldspeed's second example becomes a bit more simple by accessing the array attribute:
In [6]: df = pd.DataFrame({'intervals': ia})
In [7]: df['left'] = df['intervals'].array.left
In [8]: df['right'] = df['intervals'].array.right
In [9]: df
Out[9]:
intervals left right
0 (85, 94] 85 94
1 (95, 104] 95 104
2 (105, 114] 105 114
3 (115, 124] 115 124
4 (125, 134] 125 134
5 (135, 144] 135 144
6 (145, 154] 145 154
7 (155, 164] 155 164
8 (165, 174] 165 174

A simple way is to use apply() method:
data['left'] = data['intervals'].apply(lambda x: x.left)
data['right'] = data['intervals'].apply(lambda x: x.right)
data
intervals left right
0 (85, 94] 85 94
1 (95, 104] 95 104
...
8 (165, 174] 165 174

Related

Find indices for max values subarrays and applying it on that subarray

I have a file f which holds N (unknown) events. Each event carries an (unknown and different for each event, call it i, j etc) amount of reconstructed tracks. Then, each track has properties like energy E and likelihood lik. So,
>>> print(f.events.tracks.lik)
[[lik1, lik2, ..., likX], [lik1, lik2, ..., likj], ..., [lik1, lik2, ..., likz]]
prints an array holding N subarrays (1 per event), each presenting the lik for all its tracks.
GOAL: call f.events.tracks[:, Inds].E to get the energies for the tracks with max likelihood.
Minimal code example
>>>import numpy as np
>>>lik = np.random.randint(low=0, high=100, size=50).reshape(5, 10)
>>>print(lik)
[[ 3 49 27 3 80 59 96 99 84 34]
[88 62 61 83 90 9 62 30 92 80]
[ 5 21 69 40 2 40 13 63 42 46]
[ 0 55 71 67 63 49 29 7 21 7]
[40 7 68 46 95 34 74 88 79 15]]
>>>energy = np.random.randint(low=100, high=2000, size=50).reshape(5, 10)
>>>print(energy)
[[1324 1812 917 553 185 743 358 877 1041 905]
[1407 663 359 383 339 1403 1511 1964 1797 1096]
[ 315 1431 565 786 544 1370 919 1617 1442 925]
[1710 698 246 1631 1374 1844 595 465 908 953]
[ 305 384 668 952 458 793 303 153 661 791]]
>>> Inds = np.argmax(lik, axis=1)
>>> print(Inds)
[2 1 8 6 7]
PROBLEM:
>>> # call energy[Inds] to get
# [917, 663, 1442, 1844, 153]
What is the correct way of accessing these energies?
You can select the values indexed by Inds for each line using a 2D indexing with a temporary array containing [0,1,2,...] (generated using np.arange).
Here is an example:
energy[np.arange(len(Inds)), Inds]

Bad display with pyplot, image too dark

I am working on an image processing problem.
I create a function that applies a salt and pepper noise to an image.
Here is the function:
def sp_noise(image,prob):
res = np.zeros(image.shape,np.uint8)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
rdn = random.random()
if rdn < prob:
rdn2 = random.random()
if rdn2 < 0.5:
res[i][j] = 0
else:
res[i][j] = 255
else:
res[i][j] = image[i][j]
return res
Problems happen when I want to display the result.
wood = loadPNGFile('wood.jpeg',rgb=False)
woodSP = sp_noise(bois,0.01)
plt.subplot(1,2,1)
plt.imshow(bois,'gray')
plt.title("Wood")
plt.subplot(1,2,2)
plt.imshow(woodSP,'gray')
plt.title("Wood SP")
I can not post the image directly but here is the link:
The picture is darker. But when I display the value of the pixels
But when I display the value of the pixels between the 2 images the values ​​are the same:
[[ 99 97 96 ... 118 90 70]
[110 110 103 ... 116 115 101]
[ 79 73 65 ... 96 121 121]
...
[ 79 62 46 ... 105 124 113]
[ 86 98 100 ... 114 119 99]
[ 96 95 95 ... 116 111 90]]
[[255 97 96 ... 118 90 70]
[110 110 103 ... 116 115 101]
[ 79 73 65 ... 96 121 121]
...
[ 79 62 46 ... 105 124 113]
[ 86 98 100 ... 114 119 99]
[ 96 95 95 ... 116 111 90]]
I also check the mean value:
117.79877369007804
117.81332616658703
Apparently the problem comes from the display plt.imshow, but I can not find a solution
Looking at the documentation of imshow, there are 2 optional parameters, vmin, vmax which:
When using scalar data and no explicit norm, vmin and vmax define the
data range that the colormap covers. By default, the colormap covers
the complete value range of the supplied data. vmin, vmax are ignored
if the norm parameter is used.
Therefore, if no values are specified for these parameters, the range of luminosity is based on the actual data values, with the minimum value being set to black and the maximum value being set to white. This is useful in visualization, but not in comparisons, as you found out. Therefore, just set vmin and vmax to appropriate values (probably 0 and 255).

test_train_split converts string type label to np.array. Is there any way to get back the original label name?

I have an image dataset with a string type label name. When I split the data using test_train_split of sklearn library, it converts the label to np.array type. Is there a way to get back the original string type label name?
The below code splits a data to train and test:
imgs, y = load_images()
train_img,ytrain_img,test_img,ytest_img = train_test_split(imgs,y, test_size=0.2, random_state=1)
If I print y, it gives me the label name but if I print the splitted label value it give an array:
for k in y:
print(k)
break
for k in ytrain_img:
print(k)
break
Output:
001.Affenpinscher
[[[ 97 180 165]
[ 93 174 159]
[ 91 169 152]
...
[[ 88 171 156]
[ 88 170 152]
[ 84 162 145]
...
[130 209 222]
[142 220 233]
[152 230 243]]
[[ 99 181 163]
[ 98 178 161]
[ 92 167 151]
...
[130 212 224]
[137 216 229]
[143 222 235]]
...
[[ 85 147 158]
[ 85 147 158]
[111 173 184]
...
[227 237 244]
[236 248 250]
[234 248 247]]
[[ 94 154 166]
[ 96 156 168]
[133 194 204]
...
[226 238 244]
[237 249 253]
[237 252 254]]
...
[228 240 246]
[238 252 255]
[241 255 255]]]
Is there a way to convert back the array to the original label name?
No, you are inferring the output of train_test_split wrong.
train_test_split works in this way:
A_train, A_test, B_train, B_test, C_train, C_test ...
= train_test_split(A, B, C ..., test_size=0.2)
You can give as many arrays to split. For each given array, it will provide the train and test split first, then do the same for next array, then third array and so on..
So in your case actually it is:
train_img, test_img, ytrain_img, ytest_img = train_test_split(imgs, y,
test_size=0.2,
random_state=1)
But you are then mixing up the names of the output and using them wrong.

storing data in numpy array without overwriting previous data

i am running a for loop to store the data in a numpy array. the problem is that after every iteration previous data is overwritten by latest one . i want to be able to store all data bu using some "extend" function as used for simple arrays. i tried append but it's not storing all the values of all arrays.
code
data1=np.empty((3,3),dtype=np.int8)
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
for row in df.itertuples():
data = io.imread(row[1])
data1=np.append(data1,data)
print(data1)
expected output
[[[ 34 34 34]
[ 35 35 35]
[ 40 40 40]
...,
[ 8 8 8]
[ 12 12 12]
[ 12 12 12]]
[[ 39 39 39]
[ 30 30 30]
[ 25 25 25]
...,
[ 11 11 11]
[ 1 1 1]
[ 5 5 5]]
[[ 54 54 54]
[ 44 44 44]
[ 34 34 34]
...,
[ 32 32 32]
[ 9 9 9]
[ 0 0 0]]
...,
[[212 212 210]
[167 167 165]
[118 118 116]
...,
[185 186 181]
[176 177 172]
[170 171 166]]
[[220 220 218]
[165 165 163]
[116 116 114]
...,
[158 159 154]
[156 157 152]
[170 171 166]]
[[220 220 218]
[154 154 152]
[106 106 104]
...,
[144 145 140]
[136 137 132]
[158 159 154]]]
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
data1 = np.array([io.imread(row[1]) for row in df.itertuples()])
If your dataset is not too big, there is no problem with using a standard list first and then convert to numpy array, I guess.
If you're not familiar with implicit lists:
data1 = []
for row in df.itertuples():
data1.append(io.imread(row[1]))
data1 = np.array(data1)

Can't reshape to right shape size

My original dataset is 7049 images(96x96) with following format:
train_x.shape= (7049,)
train_x[:3]
0 238 236 237 238 240 240 239 241 241 243 240 23...
1 219 215 204 196 204 211 212 200 180 168 178 19...
2 144 142 159 180 188 188 184 180 167 132 84 59 ...
Name: Image, dtype: object
I want to split image-string into 96x96 and get the (7049,96,96) array.
I try this method:
def split_reshape(row):
return np.array(row.split(' ')).reshape(96,96)
result = train_x.apply(split_reshape)
Then I still got result.shape=(7049,)
How to reshape to (7049,96,96) ?
Demo:
Source Series:
In [129]: train_X
Out[129]:
0 238 236 237 238 240 240 239 241 241
1 219 215 204 196 204 211 212 200 180
2 144 142 159 180 188 188 184 180 167
Name: 1, dtype: object
In [130]: type(train_X)
Out[130]: pandas.core.series.Series
In [131]: train_X.shape
Out[131]: (3,)
Solution:
In [132]: X = train_X.str \
.split(expand=True) \
.astype(np.int16) \
.values.reshape(len(train_X), 3, 3)
In [133]: X
Out[133]:
array([[[238, 236, 237],
[238, 240, 240],
[239, 241, 241]],
[[219, 215, 204],
[196, 204, 211],
[212, 200, 180]],
[[144, 142, 159],
[180, 188, 188],
[184, 180, 167]]], dtype=int16)
In [134]: X.shape
Out[134]: (3, 3, 3)

Categories

Resources