Extract data-set with a given range from a larger dataset - python

I have a 2D data-set of type with (X,Y) values as such:
X
Y
99.96
2
99.76
4
100.15
6
100.28
`0
100.66
11
101.17
14
102.36
4
I wish to extract a part of above 2D data-set such that 100.00 <= X <= 100.99 and its corresponding Y-values.
So the output generated would be as such:
X
Y
100.15
6
100.28
`0
100.66
11
Can anybody please let me know how do we go about doing this in Python?

You can create a data frame from your data using pandas and filter using between.
you can use pd.read_csv , pd.read_excel, pd.from_dict, etc to easily transform your source data.
import pandas as pd
# example pd read csv
# df = pd.read_csv('somefile.csv', header=0)
df = pd.DataFrame([[1,2],[3,4],[5,6],[2,3],[4,5]], columns=['a','b'])
print(df[df['a'].between(2,4)])
# a b
#1 3 4
#3 2 3
#4 4 5

Maybe just a simple loop, without any 3rd party package?
If you need to save the result, then you just substitute the print statement with result.append().
data = [[99.96, 2],
[97, 4],
[100.15,6],
[100.28,0],
[101.17, 14],
[102.36, 11]]
for x, y in data:
#print(x, y)
if 100.00 <= x <= 100.99:
print(x, y)

If the given data is of type "numpy.ndarray" then we can use of 'where' command as such:
import numpy as np
# Origianl data
data = np.array([[99.96,2],[99.76,4],[100.15,6],[100.28,0],[100.66,11],[101.17,14],[102.36,4]])
print("\n","Original data=\n",data)
# Extracted Data
data_extracted = data[np.where((data[:,0] >= 100.001) & ( data[:,0]<= 100.999))]
print("\n","Extracted data=\n",data_extracted)

Related

How to select multiple rows from a (geo)pandas dataframe based on an array or propagate metadata of a clustering algorithm result?

I have a geopandas data frame that contains a polygon, region_id and center_point lat and lon in Radians that looks like this:
I then wanted to go about clustering each region by their center point and did the following:
#Set Up
kms_per_radian = 6371.0088
eps = 0.1/kms_per_radian
coords = blocks_meta.as_matrix(columns=['lat', 'lon'])
#Cluster
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=epsilon, algorithm='ball_tree', metric='haversine', min_samples=1).fit(coords)
labels = db.labels_
clusters = pd.Series([coords[labels == n] for n in range(len(set(labels)))])
which yields an array of clusters of center points that looks like this.
array([[ 0.0703843 , 0.170845 ],
[ 0.07037922, 0.17084981],
[ 0.07036705, 0.17085678],
[ 0.0703715 , 0.17083775]])
What I am struggling to figure out how to do is to get the regions_ids associated with each cluster to merge the polygons to create one bigger region without looping through each cluster and for each lat,lon pair and querying the dataframe.
Is there a way of propagating the ids or querying the dataframe per cluster?
Any help here would be appreciated.
Thanks!
EDIT
What I want to avoid is doing this:
clusters_of_regions = []
for cluster in clusters:
cluster_of_regions_ids = []
for entry in cluster:
print(cluster[0][0])
region_id = blocks_meta.loc[blocks_meta['lat'] == cluster[0][0]]['region_id'][1]
cluster_of_regions_ids.append(region_id)
clusters_of_regions.append(cluster_of_regions_ids)
Both to avoid the nested for loop - and when ever I try I keep on getting a key error:
Is there a way I cluster on the regions themselves using the center points as properties.
Thanks
Check the example from skleanr (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). I modified it here to have a dataframe and resemble your example.
from sklearn.cluster import DBSCAN
import pandas as pd
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
df = pd.DataFrame(X, index=list(range(len(X))), columns = ['col1', 'col2'])
clustering = DBSCAN(eps = 3, min_samples = 2).fit(df)
labels = clustering.labels_
df = df.merge(pd.Series(labels).to_frame().rename(columns={0:'clusters'}), left_index = True, right_index = True, how = 'outer')
df
Gives you:
col1 col2 clusters
0 1 2 0
1 2 2 0
2 2 3 0
3 8 7 1
4 8 8 1
5 25 80 -1
According to the description:
labels_ : array, shape = [n_samples] Cluster labels for each point in
the dataset given to fit(). Noisy samples are given the label -1.
In the example, you get two groups (labels 0 and 1). The -1 is a 'noisy' sample, here that sample is clearly larger than the others.
If you do something similar to this you can have your regions_id and the label next to each other and compare whether there is a 1:1 relation or not.
I think your groups are in your labels.
I think what you want is this (I am using labels = [1,2,3,4]):
df1 = pd.DataFrame(ar)
df1.loc[:,'labels'] = pd.Series(labels)
df1
That will create a df like this one :
0 1 labels
0 0.070384 0.170845 1
1 0.070379 0.170850 2
2 0.070367 0.170857 3
3 0.070372 0.170838 4

How to plot data after groupby

I have a data frame similar to this
import pandas as pd
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
print(df) #printing dataframe.
I performed the groupby function on it to get the required output.
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
print(group_data)
now i want to plot the out using matplot lib. Please help me with it.. I am not able to figure how to start and what to do.
I want to plot using the counter value and something similar to bar graph
Try:
group_data = group_data.reset_index()
in order to get rid of the multiple index that the groupby() has created for you.
Your print(group_data) will give you this:
In [24]: group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
In [25]: print(group_data)
age data
1 ONE 3
THREE 1
TWO 1
2 ONE 2
TWO 1
3 ONE 1
TWO 1
Name: COUNTER, dtype: int64
Whereas, reseting will 'simplify' the new index:
In [26]: group_data = group_data.reset_index()
In [27]: group_data
Out[27]:
age data COUNTER
0 1 ONE 3
1 1 THREE 1
2 1 TWO 1
3 2 ONE 2
4 2 TWO 1
5 3 ONE 1
6 3 TWO 1
Then depending on what it is exactly that you want to plot, you might want to take a look at the Matplotlib docs
EDIT
I now read more carefully that you want to create a 'bar' chart.
If that is the case then I would take a step back and not use reset_index() on the groupby result. Instead, try this:
In [46]: fig = group_data.plot.bar()
In [47]: fig.figure.show()
I hope this helps
Try with this:
# This is a great tool to add plots to jupyter notebook
% matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Params get plot bigger
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["legend.fontsize"] = 12
plt.rcParams["figure.figsize"] = [15, 7]
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
df['COUNTER'] = 1
group_data = df.groupby(['age','data']).sum()[['COUNTER']].plot.bar(rot = 90) # If you want to rotate labels from x axis
_ = group_data.set(xlabel = 'xlabel', ylabel = 'ylabel'), group_data.legend(['Legend']) # you can add labels and legend

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Read lists into columns of pandas DataFrame

I want to load lists into columns of a pandas DataFrame but cannot seem to do this simply. This is an example of what I want using transpose() but I would think that is unnecessary:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: x = np.linspace(0,np.pi,10)
In [4]: y = np.sin(x)
In [5]: data = pd.DataFrame(data=[x,y]).transpose()
In [6]: data.columns = ['x', 'sin(x)']
In [7]: data
Out[7]:
x sin(x)
0 0.000000 0.000000e+00
1 0.349066 3.420201e-01
2 0.698132 6.427876e-01
3 1.047198 8.660254e-01
4 1.396263 9.848078e-01
5 1.745329 9.848078e-01
6 2.094395 8.660254e-01
7 2.443461 6.427876e-01
8 2.792527 3.420201e-01
9 3.141593 1.224647e-16
[10 rows x 2 columns]
Is there a way to directly load each list into a column to eliminate the transpose and insert the column labels when creating the DataFrame?
Someone just recommended creating a dictionary from the data then loading that into the DataFrame like this:
In [8]: data = pd.DataFrame({'x': x, 'sin(x)': y})
In [9]: data
Out[9]:
x sin(x)
0 0.000000 0.000000e+00
1 0.349066 3.420201e-01
2 0.698132 6.427876e-01
3 1.047198 8.660254e-01
4 1.396263 9.848078e-01
5 1.745329 9.848078e-01
6 2.094395 8.660254e-01
7 2.443461 6.427876e-01
8 2.792527 3.420201e-01
9 3.141593 1.224647e-16
[10 rows x 2 columns]
Note than a dictionary is an unordered set of key-value pairs. If you care about the column orders, you should pass a list of the ordered key values to be used (you can also use this list to only include some of the dict entries):
data = pd.DataFrame({'x': x, 'sin(x)': y}, columns=['x', 'sin(x)'])
Here's another 1-line solution preserving the specified order, without typing x and sin(x) twice:
data = pd.concat([pd.Series(x,name='x'),pd.Series(y,name='sin(x)')], axis=1)
If you don't care about the column names, you can use this:
pd.DataFrame(zip(*[x,y]))
run-time-wise it is as fast as the dict option, and both are much faster than using transpose.

Categories

Resources