For example, if I want to consider a flower species, number of petals, germination time and user ID, the user ID is going to have a hyphen in there. So in my data analysis, I don't want to use it. I'm aware that I can hard code it in, but I want to make it so when I input any dataset, it will automatically remove columns with non-numeric inputs.
Edit: Unclear question. I'm reading in data from a csv file using pandas.
Example:
Species NPetals GermTime UserID
1 R. G 5 4 65-78
2 R. F 5 3 65-81
I want to remove the UserID and Species columns from the dataset.
From the docs you can just select the numeric data by filtering using select_dtypes:
In [5]:
df = pd.DataFrame({'a': np.random.randn(6).astype('f4'),'b': [True, False] * 3,'c': [1.0, 2.0] * 3})
df
Out[5]:
a b c
0 0.338710 True 1
1 1.530095 False 2
2 -0.048261 True 1
3 -0.505742 False 2
4 0.729667 True 1
5 -0.634482 False 2
In [15]:
df.select_dtypes(include=[np.number])
Out[15]:
a c
0 0.338710 1
1 1.530095 2
2 -0.048261 1
3 -0.505742 2
4 0.729667 1
5 -0.634482 2
You can pass any valid np dtype hierarchy
Related
Its easy to create a vector of bools in pandas for testing values such as
DF['a'] > 10
but how do you write
DF['a'] in list
to generate a vector of bools based on membership of each value in the Series in some list or other? I am getting a value error.
I know I can loop through the data pretty simply, but is this possible without having to do that?
Use the isin method:
DF['a'].isin(list)
Example:
DF = pd.DataFrame({'a':np.arange(5),'b':np.arange(5)*2})
print DF
a b
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
print DF['a'].isin([0,2,3])
0 True
1 False
2 True
3 True
4 False
I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64
I have a Dataframe:
User Numbers
A 0
A 4
A 5
B 0
B 0
C 1
C 3
I want to perform an operation on each corresponding grouped data. For example, if I want to remove all Users that have the Number 0, it should look like:
User Numbers
A 0
A 4
A 5
C 1
C 3
since all Numbers of User B is 0.
Or for example, if I want to find the variance of the Numbers of all the Users, it should look like:
Users Variance
A 7
B 0
C 2
This means only the Numbers of A are calculated for finding the variance of A and so on.
Is there a general way to do all these computations for matching grouped data?
You want 2 different operations - filtration per groups and aggregation per groups.
Filtration:
For better performance is better use transform for boolean mask and filter by boolean indexing.
df1 = df[~df['Number'].eq(0).groupby(df['User']).transform('all')]
print (df1)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Steps:
1.First create boolean Series by comparing Number by eq:
print (df['Number'].eq(0))
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
2.Then use syntactic sugar - groupby by another column and transform function all for check if all Trues per group and transform is for mask with same size as original DataFrame:
print (df['Number'].eq(0).groupby(df['User']).transform('all'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
3.Invert boolen mask by ~:
print (~df['Number'].eq(0).groupby(df['User']).transform('all'))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Number, dtype: bool
4.Filter:
print (df[~df['Number'].eq(0).groupby(df['User']).transform('all')])
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Another slowier solution in large DataFrame with filter and same logic as first solution:
df2 = df.groupby('User').filter(lambda x: ~x['Number'].eq(0).all())
print (df2)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Aggregation:
For simplier aggregation by one column with one aggregate function, e.g. GroupBy.var use:
df3 = df.groupby('User', as_index=False)['Number'].var()
print (df3)
User Number
0 A 7
1 B 0
2 C 2
I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?
Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]
Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
I am trying to clean a dataset and basically get rid of all the features which have a certain amount of empty values, in more than 100 empty values inclusive, with pandas/python. I am using the following command
train.isnull().sum()>=100
which gets me:
Id False
Feature 1 False
Feature 2 False
Feature 3 True
Feature 4 False
Feature 5 True
I would like to return a new dataframe without the features 3 and 4.
Thank you.
in your case, just run:
train[train.columns[train.isnull().sum()<100]]
Full example:
import pandas as pd
df = pd.DataFrame([[1,None,2],[3,4,None],[7,8,9]], columns = ['A','B','C'])
You'll get:
A B C
0 1 NaN 2.0
1 3 4.0 NaN
2 7 8.0 9.0
then running:
df.isnull().sum()
will result in null count:
A 0
B 1
C 1
then just select the wanted columns:
df.columns[df.isnull().sum()<100]
and filter your data frame:
df[ df.columns[df.isnull().sum()<100]]