Pandas subsetting returing different results to numpy

Pandas subsetting returing different results to numpy - python

I am trying to subset a pandas dataframe using two conditions. However, I am not getting the same results as when done with numpy. What am I doing wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(20,120,101)
y = np.linspace(-45,25,101)
xs,ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
plt.scatter(xs,ys,s=2,c='b')
plt.scatter(xs[idx],ys[idx],s=2,c='r')
I need to remove the red block from my dataset, which I can do with numpy by using:
plt.scatter(xs[~idx],ys[~idx],s=2,c='b')
How do I replicate this with a pandas dataframe?
I've tried using the same logic as I used above:
data = {'x':x,'y':y}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
I've also tried using loc:
df.loc[(df.x >=100) & (df.y >= 0),['x','y']] = np.nan
Both of these methods give the following result:
How do I replicate the results from numpy?
Many thanks.

You don't obtain the same result because you didn't create all the couple of coordinates before passing them to pandas. Here is a quick solution:
data = {'x':xs.flatten(),'y':ys.flatten()}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
plt.scatter(df2.x,df2.y,s=2,c='b')
Flatten reshape your arrays to only have one dimension so that they can be used to construct a DF containing couple of coordinates and not lists.
Output:
Edit: Same result but with dataframe containing x and y
Split the df in chunks
data_x = np.linspace(20,120,101)
data_y = np.linspace(-45,25,101)
dataframe = pd.DataFrame({'x':data_x,'y':data_y})
chunk_size = 25
dfs = [dataframe[i:i+chunk_size] for i in range(0,dataframe.shape[0],chunk_size)]
Define the function that will give you the points you are interested in. Two loops because you need to get every configuration of x and y values
def generatorPoints(dfs):
for i in range(len(dfs)):
x = dfs[i].x
for j in range(len(dfs)):
y = dfs[j].y
xs, ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
yield xs[~idx], ys[~idx]
x, y = [], []
for xs, ys in generatorPoints(dfs):
x.extend(xs), y.extend(ys)
plt.scatter(x,y,s=2,c='b')
This gives the same result as the previous code. There is certainly place to make some optimization but this is a start for your request :).

Related

Python's `.loc` is really slow on selecting subsets of Data

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.

One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Clustering dataframe after concatenation of x and y

I have x and y arrays, x consists of three arrays and y consists of three arrays that consist of seven values
x= [np.array([6.03437288]), np.array([6.39850922]), np.array([6.07835145])]
y= [np.array([[-1.06565856, -0.16222044, 7.85850477, -2.62498475, -0.46315498,
-0.33087472, -0.1394244 ]]),
np.array([[-1.41487104e+00, 5.81421750e-03, 7.92917001e+00,
-3.37987517e+00, 1.14685839e-01, -2.91779263e-01,
2.51753851e-01]]),
np.array([[-1.56496814, 0.2612637 , 7.60577761, -3.55727614, 0.18844392,
-0.75112678, -0.48055978]])]
I concatenate x and y into one dataframe
df = pd.DataFrame({'x': x,'y': y})
then I tried to cluster this dataframe by k-medoids
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(df)
cluster_labels = kmedoids.predict(df)
but I faced this error
ValueError: setting an array element with a sequence.
I tried to search for a solution to this problem, haven't found a concrete solution. any suggestions even with modified the code

Given arrays x and y as provided in question:
import pandas as pd
from sklearn_extra.cluster import KMedoids
df = pd.DataFrame({'x': x,'y': y})
First concatenate x and y of dataframe into one array per row:
df2 = df.apply(lambda r: np.append(r.x, r.y), axis = 1)
Then create one X array:
X = np.array(df2.values.tolist())
that can be passed to clustering method:
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(X)
cluster_labels = kmedoids.predict(X)
result of clustering:
array([2, 0, 1], dtype=int64)

Plot distribution of differences between two pandas dataframe columns

I have a pandas dataframe, which have columns A & B
I just want to plot a distribution graph of the percentage of differences between column A & B
A B
1 1.051990e+10 1.051990e+04
2 1.051990e+10 1.051990e+04
5 4.841800e+10 1.200000e+10
8 2.327700e+10 2.716000e+10
9 1.204900e+10 2.100000e+08
Distribution graph will be like, how many records are having 10% of differences, how many are 20% difference
I tried as follows
df percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(df['A'], df['B']), axis=1)
This is not working, as i'm newbie please help

You don't need the lambda operation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(1, 10, (20, 2)), columns=['A', 'B'])
def percCal(x,y):
return (x-y)*100/x
Alternatively, just manipulate the columns directly:
df1['diff'] = (df1['A'] - df1['B']) * 100 / df1['A']
Apply the function and plot:
df1['diff'] = percCal(df1['A'], df1['B'])
df1['diff'].plot(kind='density')

df['perc'] = (df['A'] - df['B']) *100/df['A']

def percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(x['A'], x['B']), axis=1)
Change dfin lambda for x in this case you are giving the function the data xthat means you are giving the percCalwhat you have in the row of the data frame and when you use dfyou are giving actually the data frame and the function is returning a data frame not a value. But please check your code, if xin the function can be 0 is a problem.

Think this is what you are looking for:
# Dummy df
data = [
[1.051990e+10, 1.051990e+04],
[1.051990e+10, 1.051990e+04],
[4.841800e+10, 1.200000e+10],
[2.327700e+10, 2.716000e+10],
[1.204900e+10, 2.100000e+08],
]
cols = ['A', 'B']
df2 = pd.DataFrame(data, columns=cols)
# Solution
import seaborn as sns
df2['pct_diff'] = (df2['A'] - df2['B']) / df2['A']
sns.distplot(df2['pct_diff']);

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

dictionary or sub df from df

I am totally new in programming in general, so please explain.
The general aim: I am dealing with x,y,z data. I want to reduce the number of points in each cell (could have variable sizes depinding on the project)to let's say 50 without affecting the mean value.
The problem: I have df with x,y,z,binnumber and I want to produce either dictionary(ex binnumber:[x,y,z],[x,y,z].....which is inside this bin), or some how sub datasets that I can work with as df so I can work with.
what I did:
`# import the data
import pandas as pd
import numpy as np
from scipy.stats import binned_statistic_2d
inputpath=input("write the file path:")
Data = pd.read_csv(inputpath, index_col=False, header= None, names =
['X','Y', 'Z'],skip_blank_lines=True) # file name , index =False means
without index , names are the columns names
Data = pd.DataFrame(Data)
# creating the grid cells
min_x = int(min(Data['X']))
max_x = int(max(Data['X'])+1)
min_y = int(min(Data['Y']))
max_y = int(max(Data['Y'])+1)
bin_size = float(input('write the cell size:'))
bx= int(((max_x-min_x)//bin_size)+1)
by=int(((max_y-min_y)//bin_size)+1)
xedges = np.linspace(min_x, max_x, bx, dtype=int)
yedges = np.linspace(min_y, max_y, by, dtype=int)
# assign the data to the cells
count, x_edge,y_edge,binnumber= binned_statistic_2d(Data['X'], Data['Y'],
Data['Z'],bins=(xedges, yedges))
Data['binnumber']= binnumber
# sub sets
subsets = dict(Data.groupby('binnumber'))
print (subsets)
this did not work...
Another solution was to deal with the cells itself but it did not work also.
cells= {}
for i in xedges:
for j in yedges:
cells[str(i),str(j)]=[]
print(cells.keys())
for x in Data.X:
for y in Data.Y:
for z in Data.Z:
for k,v in cells.keys():
if x>= int(k[0]) and x < int(k[0]) +1 and y>= int(k[1]) and y
< int(k[1]) +1:
k=(x,y,z)
else:
cells=('0')
print(cells)
Thanks for any try to help.

import the data
import pandas as pd
import numpy as np
from scipy.stats import binned_statistic_2d
inputpath=input("write the file path:")
Data = pd.read_csv(inputpath, index_col=False, header= None, names =
['X','Y', 'Z'],skip_blank_lines=True) # file name , index =False means
without index , names are the columns names
Data = pd.DataFrame(Data)
# creating the grid cells
min_x = int(min(Data['X']))
max_x = int(max(Data['X'])+1)
min_y = int(min(Data['Y']))
max_y = int(max(Data['Y'])+1)
bin_size = float(input('write the cell size:'))
bx= int(((max_x-min_x)//bin_size)+1)
by=int(((max_y-min_y)//bin_size)+1)
xedges = np.linspace(min_x, max_x, bx, dtype=int)
yedges = np.linspace(min_y, max_y, by, dtype=int)
# assign the data to the cells
count, x_edge,y_edge,binnumber= binned_statistic_2d(Data['X'], Data['Y'],
Data['Z'],bins=(xedges, yedges))
Data['binnumber']= binnumber
# making dictionary with >>> binnumber: all associated points......
Data['value'] = list(zip(Data['X'], Data['Y'], Data['Z']))
d = defaultdict(list)
for idx, row in Data.iterrows():
d[row['binnumber']].append(row['value'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas subsetting returing different results to numpy - python

Related

Python's `.loc` is really slow on selecting subsets of Data

Clustering dataframe after concatenation of x and y

Plot distribution of differences between two pandas dataframe columns

python: increase performance of finding the best timeshift for a correlation between each X column and y

dictionary or sub df from df

Categories

Resources