Print specific values in python dataframe for which lambda is applied - python

I have some code which reads a json file and applies a lambda that removes values.
Code -
import pandas as pd
data = pd.read_json('filename.json',dtype='int64')
data = data[data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))]
The last statements filter outs records from dataframe where anything other than numbers having length 2 is in ColumnA (please correct if I'm wrong).
Objective - Before applying the lambda, I want to print the record from dataframe, so that i can know what kind of values are getting removed.
P.S. I am new to python and working on some predesigned code

What you are doing here is filtering a DataFrame based on the values. Doing this involves two steps:
Creating a boolean array of True/False for those that you want/don't want
Indexing the original DataFrame that boolean array to filter the select only the values that you want.
In your code, you are doing both steps at once (and that's perfectly fine). But if you want to look at the values, it might be helpful to get the boolean array and look at that one.
data = data[data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))]
# can become
# step 1
my_values = data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))
# step2
data = data[my_values]
# you can use my_values to inspect what is happening
print(my_values) # this will be a series of just True/False
print(data[my_values]) # this will print the kept values
print(data[~my_values]) # this will print the removed values

Related

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

Are the outcomes of the numpy.where method on a pandas dataframe calculated on the full array or the filtered array?

I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range because the first entry contains no underscore:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])
Only calling np.where returns an array of indices for which the condition holds true, so I was under the impression that the split-command would only be used on that subset of the data:
np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)
But apparently the split-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.
I'm not asking for an alternative solution, coming up with that isn't hard.
I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.
Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here
if we split your line of code, we get the following statements:
df.A.str.contains('_')
df.A.apply(lambda x: x.split('_')[1])
df.A.str[0]
Python has to evaluate these statements before it can pass them as arguments to np.where
to see all this happening, we can rewrite the above as little functions that displays some output:
def fn_contains(x):
print('contains', x)
return '_' in x
def fn_split(x):
s = x.split('_')
print('split', x, s)
# check for errors here
if len(s) > 1:
return s[1]
def fn_first(x):
print('first', x)
return x[0]
and then you can run them on your data with:
s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
s.apply(fn_contains),
s.apply(fn_split),
s.apply(fn_first)
)
and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things
In my opinion numpy.where only set values by condition, so second and third arrays are counted for all data - filtered and also non filtered.
If need apply some function only for filtered values:
mask = df.A.str.contains('_')
df.loc[mask, "B"] = df.loc[mask, "A"].str.split('_').str[1]
In your solution is error, but problem is not connected with np.where. After split by _ if not exist value, get one eleemnt list, so selecting second value of list by [1] raise error:
print (df.A.apply(lambda x: x.split('_')))
0 [a]
1 [a, 1]
2 [b, ]
3 [b, 2, 3]
Name: A, dtype: object
print (df.A.apply(lambda x: x.split('_')[1]))
IndexError: list index out of range
So here is possible use pandas solution, if performance is not important, because strings functions are slow:
df["B"] = np.where(df.A.str.contains('_'),
df.A.str.split('_').str[1],
df.A.str[0])

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

new dataframe within if statement. Python

Here is the the part of the code I am having issues with:
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
b
The output I get is b = 14 which is the last value where the if statement holds in the column of the dataframe. I am trying to get ALL the values of the column in which the if statement holds to be held in "b" rather than just the last value alone.
What you want to do is make a list instead and append b to it.
my_vals = []
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
my_vals.append(b)
my_vals
In your code, you are changing b in every iterration and so it only stores the most recent value. In the future when you are trying to store multiple values, do so in a different Data Type.
You can also use the filtering functionality of pandas and use
values = df.loc[(df['Days'] >= 0) & (df['Days'] <= 30)]
If you want the values as a Series instead of a DataFrame use
values_series = values['Days']
If you want the values as a list instead of a Series use
values_list = list(values_series)

assigning values to first three rows of every group

I'm trying to code following logic in pandas, for first three rows of every group i want to create a variable which should have value 1(1st row), 2 (2nd row), 3(3rd row). I'm doing it like below, In the below code I'm not creating a new variable because i don't know how to do that, so I'm replacing the variable that's already present in the data set. Though my code doesn't throw error, it's giving me very strange results.
def func (i):
data.loc[data.groupby('ID').nth(i).index,'date'] = i
func(1)
Any suggestions?
Thanks in Advance.
If you don't have duplicated index, you can create a row id for each group, filter out id which is larger than 3 and then assign it back to the data frame:
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
This gives the first three rows for each ID 1,2,3, rows beyond 3 will have NaN values.
data = pd.DataFrame({"ID":[1,1,1,1,2,2,3,3,3]})
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
data

Categories

Resources