i wanna make if statement to show all REF_INT that are duplicated i tried this:
(df_picru['REF_INT'].value_counts()==1)
and it shows me all values with true or false but i dont wanna do something like this:
if (df_picru['REF_INT'].value_counts()==1)
print "df_picru['REF_INT']"
In [28]: df_picru['new'] = \
df_picru['REF_INT'].duplicated(keep=False) \
.map({True:'duplicates',False:'unique'})
In [29]: df_picru
Out[29]:
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
I think you need duplicated for boolean mask and for new column numpy.where:
mask = df_picru['REF_INT'].duplicated(keep=False)
Sample:
df_picru = pd.DataFrame({'REF_INT':[1,2,3,8,8,2]})
mask = df_picru['REF_INT'].duplicated(keep=False)
print (mask)
0 False
1 True
2 False
3 True
4 True
5 True
Name: REF_INT, dtype: bool
df_picru['new'] = np.where(mask, 'duplicates', 'unique')
print (df_picru)
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
If need check at least one if unique value need any for convert boolean mask - array to scalar True or False:
if mask.any():
print ('at least one unique')
at least one unique
Another solution using groupby.
#groupby REF_INT and then count the occurrence and set as duplicate if count is greater than 1
df_picru.groupby('REF_INT').apply(lambda x: 'Duplicated' if len(x)> 1 else 'Unique')
Out[21]:
REF_INT
1 Unique
2 Duplicated
3 Unique
8 Duplicated
dtype: object
value_counts can actually work if you make a minor change:
df_picru.REF_INT.value_counts()[lambda x: x>1]
Out[31]:
2 2
8 2
Name: REF_INT, dtype: int64
Related
Its easy to create a vector of bools in pandas for testing values such as
DF['a'] > 10
but how do you write
DF['a'] in list
to generate a vector of bools based on membership of each value in the Series in some list or other? I am getting a value error.
I know I can loop through the data pretty simply, but is this possible without having to do that?
Use the isin method:
DF['a'].isin(list)
Example:
DF = pd.DataFrame({'a':np.arange(5),'b':np.arange(5)*2})
print DF
a b
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
print DF['a'].isin([0,2,3])
0 True
1 False
2 True
3 True
4 False
I have a Pandas dataframe , for which I am checking for duplicates. I get the following output, but I dont know why its showing them as duplicates. Isnt all the column values in a row supposed to be same to be shown as a duplicate? Please correct me if I am wrong, I am newbie in Python
Per default, duplicated does not count the first occurence of a duplicated row.
>>> df = pd.DataFrame([[1,1], [2,2], [1,1], [1,1]])
>>> df
0 1
0 1 1
1 2 2
2 1 1
3 1 1
>>> df.duplicated()
0 False
1 False
2 True
3 True
dtype: bool
This means that your df[dup] will have unique rows if the duplicate rows in df were only duplicated twice.
You can adjust this behavior with the keep argument.
>>> df.duplicated(keep=False)
0 True
1 False
2 True
3 True
dtype: bool
The data frame looks like the following:
df = pd.DataFrame({'k1':['one']*3 + ['two']*4,'k2':[1,1,2,3,3,4,4]})
When I am checking duplicates, I get boolean index by doing
df.duplicated(), then I use it as the filter
df[df.duplicated()] which shows different result compares with df.drop_duplicates()
An additional row has been created in the result
2 one 2
drop_duplicate will drop all duplicated row . duplicated will return False for the first item and True for the other row of duplicates when it has the duplicate , so they are different function target for different problem .
df.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
df.drop_duplicates()
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
How to make the output same ?
Check the unique value
df[~df.duplicated(keep=False)]
k1 k2
2 one 2
df.drop_duplicates(keep=False)
k1 k2
2 one 2
I have a Dataframe:
User Numbers
A 0
A 4
A 5
B 0
B 0
C 1
C 3
I want to perform an operation on each corresponding grouped data. For example, if I want to remove all Users that have the Number 0, it should look like:
User Numbers
A 0
A 4
A 5
C 1
C 3
since all Numbers of User B is 0.
Or for example, if I want to find the variance of the Numbers of all the Users, it should look like:
Users Variance
A 7
B 0
C 2
This means only the Numbers of A are calculated for finding the variance of A and so on.
Is there a general way to do all these computations for matching grouped data?
You want 2 different operations - filtration per groups and aggregation per groups.
Filtration:
For better performance is better use transform for boolean mask and filter by boolean indexing.
df1 = df[~df['Number'].eq(0).groupby(df['User']).transform('all')]
print (df1)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Steps:
1.First create boolean Series by comparing Number by eq:
print (df['Number'].eq(0))
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
2.Then use syntactic sugar - groupby by another column and transform function all for check if all Trues per group and transform is for mask with same size as original DataFrame:
print (df['Number'].eq(0).groupby(df['User']).transform('all'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
3.Invert boolen mask by ~:
print (~df['Number'].eq(0).groupby(df['User']).transform('all'))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Number, dtype: bool
4.Filter:
print (df[~df['Number'].eq(0).groupby(df['User']).transform('all')])
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Another slowier solution in large DataFrame with filter and same logic as first solution:
df2 = df.groupby('User').filter(lambda x: ~x['Number'].eq(0).all())
print (df2)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Aggregation:
For simplier aggregation by one column with one aggregate function, e.g. GroupBy.var use:
df3 = df.groupby('User', as_index=False)['Number'].var()
print (df3)
User Number
0 A 7
1 B 0
2 C 2
I understand how to drop rows if a given column is equal to some value e.g
df = df.drop(df[<some boolean condition>].index)
but how do you drop a row if the columns are all equal to each other? Is there a way to do this without specifying the column names?
You can use apply method to loop through rows and make a logical series indicating if each contains unique values, and use boolean series to remove corresponding rows:
df[df.apply(lambda r: r.nunique() != 1, 1)]
df = pd.DataFrame({"A": [1,2,3,3,3,4,5], "B": [1,3,4,4,3,5,1]})
In [867]:
df[df.apply(lambda r: r.nunique() != 1, 1)]
Out[867]:
A B
1 2 3
2 3 4
3 3 4
5 4 5
6 5 1
You can just compare the first column against the entire df using .eq and specify axis=0 and call all on the result and invert using ~:
In [158]:
df = pd.DataFrame({'a':np.arange(5), 'b':[0,0,2,2,4]})
df
Out[158]:
a b
0 0 0
1 1 0
2 2 2
3 3 2
4 4 4
In [159]:
df[~df.eq(df['a'], axis=0).all(axis=1)]
Out[159]:
a b
1 1 0
3 3 2
If you look at the boolean mask:
In [160]:
df.eq(df['a'], axis=0)
Out[160]:
a b
0 True True
1 True False
2 True True
3 True False
4 True True
You can see it is true for the rows that meet the condition so calling all(axis=1) returns a 1-D boolean mask:
In [161]:
df.eq(df['a'], axis=0).all(axis=1)
Out[161]:
0 True
1 False
2 True
3 False
4 True
dtype: bool