I have the following code that masks values equal to ten, and then the next closest value. But actually I need to apply it only if 10 occurs once in the column ending in '_ans'. So the mask should only occur for the column 'a_ans', because there are two 10s in 'b_ans.
any comments welcome. thanks
df = pd.DataFrame(data={'a_ans':[0,1,1,10,11],
'a_num': [1,8,90,2,8],
'b_ans': [0,10,139,10,18],
'b_num': [15,43,90,14,87]}).astype(float)
out=[]
for i in ['a_', 'b_']:
pairs = (df.loc[:,df.columns.str.startswith(i)]) # pair columns
mask1 = pairs[i+'ans'] == 10 # mask values equal to 10
mask2 = pairs[i+'ans'].eq(pairs[i+'ans'].mask(mask1).max())# get the next highest value
pairs = pairs.mask(mask1, 1001).mask(mask2, 1002) # replacing values
out.append(pairs)
you can use value_counts() to get the occurrence of each row value within each column:
if pairs[i+'ans'].value_counts()[10] == 1:
# apply mask logic
Following modifications could be useful, but it is not clear what should be the next values closest or highest?
df = pd.DataFrame(data={'a_ans':[0,1,1,10,11],
'a_num': [1,8,90,2,8],
'b_ans': [0,10,139,10,18],
'b_num': [15,43,90,14,87]}).astype(float)
out=[]
for i in ['a_', 'b_']:
pairs = df.loc[:,df.columns.str.startswith(i+"ans")] # for only _ans columns
if len(pairs[pairs[i+'ans'] == 10]) == 1: # for only one ten
mask1 = pairs[i+'ans'] == 10 # mask values equal to 10
mask2 = pairs[i+'ans'].eq(pairs[i+'ans'].mask(mask1).max())
pairs = pairs.mask(mask1, 1001).mask(mask2, 1002)
out.append(pairs)
Related
Let's suppose that I have an array of (2,2) dimension:
matrix = np.zeros([2, 2])
I'd like to add the value 1 at the following position (3, 1).
Of course the matrix is too small.
How can I check if the index of the row and the column exist in this array and extend it automatically if it does not exist and this for any new position. Thanks.
This simple code can detect if the row and column exists:
If it exists : Change the value
If it doesn't :
Create the missing lines with only zeros
Update the dimensions of matrix m
Create the missing columns with only zeros
Replace the value in the correct index
def add_value(matrix, value, row, column):
nbcol = matrix.ndim-1
nbrow = len(matrix)-1
if nbcol >= column and nbrow >= row:
matrix[row,column] = value
return matrix
else:
m = matrix
for i in range(nbrow,row):
m = np.append(m, np.zeros([1,nbcol+1]), axis=0)
nbrow = len(m)-1
nbcol = m.ndim-1
for i in range(nbcol,column):
m = np.append(m, np.zeros([nbrow+1,1]), axis=1)
m[row,column] = value
return m
if __name__ == "__main__":
m1 = np.zeros([2, 2])
print(m1)
m1 = add_value(m1, 6, 2, 2)
print(m1)
Don't forget that the index starts at 0! So
m[0,0]
gives the value of the first row and first column!
In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.
I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want
Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas
I have a dataset with a column 'Self_Employed'. In these columns are values 'Yes', 'No' and 'NaN. I want to replace the NaN values with a value that is calculated in calc(). I've tried some methods I found on here, but I couldn't find one that was applicable to me.
Here is my code, I put the things i've tried in comments.:
# Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = randint(0,(SEno+SEyes))
if rand_SE > 81:
return 'No'
else:
return 'Yes'
> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
>
>
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
> # df[df['Self_Employed']]
>
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
>
> # for i in range(df['Self_Employed'].isnull().sum()):
> # print(df.Self_Employed[i]
df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df
now the line where i tried it with df_nan seems to work, but then I have a separate set with only the former missing values, but I want to fill the missing values in the whole dataset. For the last row I'm getting an error, i linked to a screenshot of it.
Do you understand my problem and if so, can you help?
This is the set with only the rows where Self_Employed is NaN
This is the original dataset
This is the error
Make shure that SEno+SEyes != null
use the .loc method to set the value for Self_Employed when it is empty
SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = np.random.randint(0,(SEno+SEyes))
if(rand_SE >= 81):
return 'No'
else:
return 'Yes'
df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
What about df['Self_Employed'] = df['Self_Employed'].fillna(calc())?
You could first identify the locations of your NaNs like
na_loc = df.index[df['Self_Employed'].isnull()]
Count the amount of NaNs in your column like
num_nas = len(na_loc)
Then generate an according amount of random numbers, readily indexed and set up
fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)
And finally substitute those values in your dataframe
df.loc[na_loc]['Self_Employed'] = fill_values
Having a dataframe, 'df':
l = [['a',1,3,3,1,1,3,3,3],['b',1,1,3,1,3,3,1,3],['c',1,1,1,1,3,1,1,1]]
col = ['id','x1','x2','x3','x4','y1','y2','y3','y4']
df = pd.DataFrame (l, columns = col)
I want to count the number of rows (ids) with value "1" in each subset of subsets of X= {x1,x2,x3,x4} and Y = {y1,y2,y3,y4} columns.
For an example subset s1={ [x1,x3] , [y2,y3,y4] }, the code does:
df[(df['x1']==1) & (df['x3']==1) & (df['y2'] == 1) & (df['y3'] == 1) & (df['y4'] == 1)].count()['id']
and return "1" as count. And repeat this for all subsets of {subsets of X columns} x {subsets of Y columns}.
I need to first construct all subsets of subsets (using for example the function suggested here), and then perform the counts for each subset. What is the best way to perform this?