I have a dataframe df that contains the distances between all the points (IDs) in my system. So the df looks like the following:
df
radius ID1 ID2 x1 y1 x2 y2
0 0.454244 100 103 103.668919 1.335309 103.671812 1.332424
1 1.016734 100 123 103.668919 1.335309 103.677598 1.332424
2 0.643200 103 123 103.671812 1.332424 103.677598 1.332424
3 1.605608 100 124 103.668919 1.335309 103.677598 1.346851
4 1.728349 103 124 103.671812 1.332424 103.677598 1.346851
I want to compute the circle between all the points and then check witch points are inside that circle. For each points I have the coordinates in a separated dataframe coordinates.
coordinates
ID x y
0 100 103.668919 1.335309
1 103 103.671812 1.332424
2 124 103.677598 1.346851
3 125 103.677598 1.349737
4 134 103.680491 1.341080
5 135 103.680491 1.343966
6 136 103.680491 1.346851
7 137 103.680491 1.349737
8 138 103.680491 1.352622
9 146 103.683384 1.341080
Here the code
from matplotlib.patches import Circle
for i in df.index:
x = df.x1[i]
y = df.y1[i]
circ = Circle((x, y), radius = df.radius)
## it works until here: from now I need to understand what to do
## and in particular I need to find which points are inside the circle
points = circ.contains_point([coordinates.x, coordinates.y])
which returns the error
ValueError: setting an array element with a sequence.
When I have issues like this, I always do a small sanity test:
from matplotlib.patches import Circle
circ = Circle((0, 0), radius = 1)
print(circ.contains_point([0.5,0.5]))
print(circ.contains_point([2,2]))
I get (as expected)
True
False
So coordinates.x and coordinates.y are probably arrays, which explains the message.
contains_points works on a tuple or list of 2 scalars.
To generate your list, you could do a loop within a list comprehension:
points = [(x,y) for x,y in zip(coordinates.x, coordinates.y) if circ.contains_point(x,y)]
Related
The x data frame is information about departure and arrival, and the y data frame is latitude and longitude data for each location.
I try to calculate the distance between the origin and destination using the latitude and longitude data of start and end (e.g., start_x, start_y, end_x, end_y).
How can I connect x and y to bring the latitude data that fits each code into the x data frame?
The notation is somewhat confusing, but I took it after the question's notation.
One way to do would be by merging your dataframes into a new one like so :
Dummy dataframes:
import pandas as pd
x=[300,500,300,600,700]
y=[400,400,700,700,400]
code=[300,400,500,600,700]
start=[100,101,102,103,104]
end=[110,111,112,113,114]
x={"x":x, "y":y}
y={"code":code, "start":start, "end":end}
x=pd.DataFrame(x)
y=pd.DataFrame(y)
This gives:
x
x
y
0
300
400
1
500
400
2
300
700
3
600
700
4
700
400
y
code
start
end
0
300
100
110
1
400
101
111
2
500
102
112
3
600
103
113
4
700
104
114
Solution :
df = pd.merge(x,y,left_on="x",right_on="code").drop("code",axis=1)
df
x
y
start
end
0
300
400
100
110
1
300
700
100
110
2
500
400
102
112
3
600
700
103
113
4
700
400
104
114
df = df.merge(y,left_on="y",right_on="code").drop("code",axis=1)
df
x
y
start_x
end_x
start_y
end_y
0
300
400
100
110
101
111
1
500
400
102
112
101
111
2
700
400
104
114
101
111
3
300
700
100
110
104
114
4
600
700
103
113
104
114
Quick explanation :
The line df = pd.merge(...) creates the new dataframe by merging the left one (x) on the "x" column and the right one on the "code" column. The second line df = df.merge(...) takes the existing df as the left one, and uses its column "y" to merge the "code" column from the y dataframe.
The .drop("code",axis=1) is used to drop the unwanted "code" column resulting from the merging.
The _x and _y suffixes are added automatically when merging dataframes that have the same column names. To control it, use the "suffixe=.." option when calling the second merging (when the column with the same name are merging). In this case it works right with the default setting so no bothering with this if you use the x as right and y as left dataframes.
I have a labeled image of detected particles and a dataframe with the corresponding area of each labeled particle. What I want to do is filter out every particle on the image with an area smaller than a specified value.
I got it working with the example below, but I know there must be a smarter and especially faster way.
For example skipping the loop by comparing the image with the array.
Thanks for your help!
Example:
labels = df["label"][df.area > 5000].to_numpy()
mask = np.zeros(labeled_image.shape)
for label in labels:
mask[labeled_image == label] = 1
Dataframe:
label centroid-0 centroid-1 area
0 1 15 3681 191
1 2 13 1345 390
2 3 43 3746 885
3 4 32 3616 817
4 5 20 4250 137
... ... ... ...
3827 3828 4149 1620 130
3828 3829 4151 852 62
3829 3830 4155 330 236
3830 3831 4157 530 377
3831 3832 4159 3975 81
You can use isin to check equality to several labels. The resulting boolean array can be directly used as the mask after casting to the required type (e.g. int):
labels = df.loc[df.area.gt(5000), 'label']
mask = np.isin(labeled_image, labels).astype(int)
Lets say I have data like that and I want to group them in terms of feature and type.
feature type size
Alabama 1 100
Alabama 2 50
Alabama 3 40
Wyoming 1 180
Wyoming 2 150
Wyoming 3 56
When I apply df=df.groupby(['feature','type']).sum()[['size']], I get this as expected.
size
(Alabama,1) 100
(Alabama,2) 50
(Alabama,3) 40
(Wyoming,1) 180
(Wyoming,2) 150
(Wyoming,3) 56
However I want to sum sizes with only the same type not both type and feature.While doing this I want to keep indexes as (feature,type) tuple. I mean I want to get something like this,
size
(Alabama,1) 280
(Alabama,2) 200
(Alabama,3) 96
(Wyoming,1) 280
(Wyoming,2) 200
(Wyoming,3) 96
I am stuck trying to find a way to do this. I need some help thanks
Use set_index for MultiIndex and then transform with sum for return same length Series by aggregate function:
df = df.set_index(['feature','type'])
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
EDIT: First aggregate both columns and then use transform
df = df.groupby(['feature','type']).sum()
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
Here is one way:
df['size'] = df['type'].map(df.groupby('type')['size'].sum())
df.groupby(['feature', 'type'])['size_type'].sum()
# feature type
# Alabama 1 280
# 2 200
# 3 96
# Wyoming 1 280
# 2 200
# 3 96
# Name: size_type, dtype: int64
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47
I have 100 different graphics, they looks like
I need to superimpose all of them and next smooth result.
I try this
import Image
first = Image.open("test1.png")
second = Image.open("test2.png")
first.paste(second, (0, 0), second)
first.show()
But how can I do it to 100 graphics? And how can I smooth result?
First 10 steps in dataframe looks like
active nodes
graph
0 1024
1 598
2 349
3 706
4 541
5 623
6 576
7 614
8 578
9 613
10 595
You have it just as an image, or you also have the data that makes the graph?
If you have the data, the easiest way to smooth it is to use convolution.
n=100
smoothed_data=np.convolve(data,[1/n]*n,'same')