I would like to interpolate missing values within groups in dataframe using preceding and following rows value.
Here is the df (there are more records within a group but for this example I left 3 per group):
import numpy as np
import pandas as pd
df = pd.DataFrame({'Group': ['a','a','a','b','b','b','c','c','c'],'Yval': [1,np.nan,5,2,np.nan,8,5,np.nan,10],'Xval': [0,3,2,4,5,8,3,1,9],'PTC': [0,1,0,0,1,0,0,1,0]})
df:
Group Yval Xval PTC
0 a 1.0 0 0
1 a NaN 3 1
2 a 5.0 2 0
3 b 2.0 4 0
4 b NaN 5 1
5 b 8.0 8 0
6 c 5.0 3 0
7 c NaN 1 1
8 c 10.0 9 0
For PTC (point to calculate) I need Yval interpolation using Xval,Yval from -1, +1 rows.
I.e. for A Group I would like:
df.iloc[1,1]=np.interp(3, [0,2], [1,5])
Here is what I tried to do using loc and shift method
and interp function found in this post:
df.loc[(df['PTC'] == 1), ['Yval']]= \
np.interp(df['Xval'], (df['Xval'].shift(+1),df['Xval'].shift(-1)),(df['Yval'].shift(+1),df['Yval'].shift(-1)))
Error I get:
ValueError: object too deep for desired array
df['Xval-1'] = df['Xval'].shift(-1)
df['Xval+1'] = df['Xval'].shift(+1)
df['Yval-1'] = df['Yval'].shift(-1)
df['Yval+1'] = df['Yval'].shift(+1)
df["PTC_interpol"] = df.apply(lambda x: np.interp(x['Xval'], [x['Xval-1'], x['Xval+1']], [x['Yval-1'], x['Yval+1']]), axis=1)
df['PTC'] = np.where(df['PTC'].isnull(), df["PTC_interpol"], df['PTC'])
Related
I'm importing data where from excel where some rows may have notes in a column and are not truly part of the dataframe. dummy Eg. below:
H1 H2 H3
*highlighted cols are PII
sam red 5
pam blue 3
rod green 11
* this is the end of the data
When the above file is imported into dfPA it looks like:
dfPA:
Index H1 H2 H3
1 *highlighted cols are PII
2 sam red 5
3 pam blue 3
4 rod green 11
5 * this is the end of the data
I want to delete the first and last row. This is what I've done.
#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3
#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output:
0 2
1 3
2 3
4 3
5 2
(where a is a series)
#convert a from series to df
dfa = a.to_frame()
#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
if (cntcols - dfa.iloc[r][0]) > n:
i = row.name
dfPA = dfPA.drop(index=i)
This doesn't work. Is there way to do this?
You should use the pandas.DataFrame.dropna method. It has a thresh parameter that you can use to define a minimum number of NaN to drop the row/column.
Imagine the following dataframe:
>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
A B C D
0 1.0 NaN 1 NaN
1 1.0 1.0 1 1.0
2 1.0 NaN 1 1.0
3 NaN 1.0 1 1.0
You can drop columns with NaN using:
>>> df.dropna(axis=1)
C
0 1
1 1
2 1
3 1
The thresh parameter defines the minimum number of non-NaN values to keep the column:
>>> df.dropna(thresh=3, axis=1)
A C D
0 1.0 1 NaN
1 1.0 1 1.0
2 1.0 1 1.0
3 NaN 1 1.0
If you want to reason in terms of the number of NaN:
# example for a minimum of 2 NaN to drop the column
>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
If the rows rather than the columns need to be filtered, remove the axis parameter or use axis=0:
>>> df.dropna(thresh=3)
I need to sum up values of 'D' column for every row with the same combination of values from columns 'A','B' and 'C. Eventually I need to create DataFrame with unique combinations of values from
columns 'A','B' and 'C' with corresponding sum in column D.
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df
OT:
A B C D
0 0 2 0 2
1 0 1 2 1
2 0 0 2 0
3 1 2 2 2
4 0 2 2 2
5 0 2 2 2
6 2 2 2 1
7 2 1 1 1
8 1 0 2 0
9 1 2 0 0
I've tried to create temporary data frame with empty cells
D = pd.DataFrame([i for i in range(len(df))]).rename(columns = {0:'D'})
D['D'] = ''
D
OT:
D
0
1
2
3
4
5
6
7
8
9
And use apply() to sum up all 'D' column values for unique row consisted of columns 'A','B' and 'C'. For example below line returns sum of values from 'D' column for 'A'=0,'B'=2,'C'=2:
df[(df['A']==0) & (df['B']==2) & (df['C']==2)]['D'].sum()
OT:
4
function:
def Sumup(cols):
A = cols[0]
B = cols[1]
C = cols[2]
D = cols[3]
sum = df[(df['A']==A) & (df['B']==B) & (df['C']==C)]['D'].sum()
return sum
apply on df and saved in temp df D['D']:
D['D'] = df[['A','B','C','D']].apply(Sumup)
Later I wanted to use drop_duplicates but I receive dataframe consisted of NaN's.
D
OT:
D
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Anyone could give me a hint how to manage the NaN problem or what other approach can I apply to solve the original
problem?
df.groupby(['A','B','C']).sum()
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df.groupby(["A", "B", "C"])["D"].sum()
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..
I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).
I have a DataFrame object and I'm grouping by some keys and counting the results. The problem is that I want to replace one of the index of the DataFrame columns for a relation between the counts.
df.groupby(['A','B', 'C'])['C'].count().apply(f).reset_index()
I'm looking for an f that replaces the column C by the value of #timesC==1 / #timesC==0 for each value of A and B.
Is this what you want?
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A':[1,2,3,1,2,3],
'B':[2,0,1,2,0,1],
'C':[1,1,0,1,1,1]
})
print(df)
def f(x):
if np.count_nonzero(x==0)==0:
return np.nan
else:
return np.count_nonzero(x==1)/np.count_nonzero(x==0)
result = df.groupby(['A','B'])['C'].apply(f).reset_index()
print(result)
Result:
#df
A B C
0 1 2 1
1 2 0 1
2 3 1 0
3 1 2 1
4 2 0 1
5 3 1 1
#result
A B C
0 1 2 NaN
1 2 0 NaN
2 3 1 1.0