I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Related
This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
i've got a pd.DataFrame with four columns
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2]
, 'A':['H','H','E','E','H','E','E','H','H']
, 'B':[4,5,2,7,6,1,3,1,0]
, 'C':['M','D','M','D','M','M','M','D','D']})
id A B C
0 1 H 4 M
1 1 H 5 D
2 1 E 2 M
3 1 E 7 D
4 1 H 6 M
5 2 E 1 M
6 2 E 3 M
7 2 H 1 D
8 2 H 0 D
I'd like to group by id and get the value of B for the nth (let's say second) occurrence of A = 'H' for each id in agg_B1 and value of B for the nth (let's say first) occurrence of C='M':
desired output:
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index].loc[df.A== 'H'][1]])
, agg_B2= ('B',lambda x:x[df.loc[x.index].loc[df.C== 'M'][0]])
).reset_index()
TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column
Obviously, I'm doing something wrong with the indexing.
Edit: if possible, I'd like to use aggregate with lambda function, because there are multiple aggregate outputs of other sorts that I'd like to extract at the same time.
Your solution is possible change if need GroupBy.agg:
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index, 'A']== 'H'].iat[1]),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
But if performance is important and also not sure if always exist second value matched H for first condition I suggest processing each condition separately and last add to original aggregated values:
#some sample aggregations
df0 = df.groupby('id').agg({'B':'sum', 'C':'last'})
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(1).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df0, df1, df2], axis=1)
print (desired_output)
B C agg_B1 agg_B2
id
1 24 M 5 4
2 5 D 0 1
EDIT1: If need GroupBy.agg is possible test if failed indexing and then add missing value:
#for second value in sample working nice
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[1]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
#third value not exist so added missing value NaN
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[2]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 6.0 4
1 2 NaN 1
What working same like:
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(2).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df1, df2], axis=1)
print (desired_output)
agg_B1 agg_B2
id
1 6.0 4
2 NaN 1
Filter for rows where A equals H, then grab the second row with the nth function :
df.query("A=='H'").groupby("id").nth(1)
A B
id
1 H 5
2 H 0
Python works on a zero based notation, so row 2 will be nth(1)
i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1
Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5
The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)
I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.
I can return the index of the last valid item but I'm hoping to subset a df using the same method. For instance, the code below returns the last time 2 appears in the df. But I want to return the df using this index.
import pandas as pd
df = pd.DataFrame({
'Number' : [2,3,2,4,2,1],
'Code' : ['x','a','b','c','f','y'],
})
df_last = df[df['Number'] == 2].last_valid_index()
print(df_last)
4
Intended Output:
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f
You can use loc, but solution working only if at least one value 2 in column:
df = df.loc[:df[df['Number'] == 2].last_valid_index()]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f
General solution should be:
df = df[(df['Number'] == 2)[::-1].cumsum().ne(0)[::-1]]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f