I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.
Related
I have a dataframe like as below
stu_id,Mat_grade,sci_grade,eng_grade
1,A,C,A
1,A,C,A
1,B,C,A
1,C,C,A
2,D,B,B
2,D,C,B
2,D,D,C
2,D,A,C
tf = pd.read_clipboard(sep=',')
My objective is to
a) Find out how many different unique grades that a student got under Mat_grade, sci_grade and eng_grade
So, I tried the below
tf['mat_cnt'] = tf.groupby(['stu_id'])['Mat_grade'].nunique()
tf['sci_cnt'] = tf.groupby(['stu_id'])['sci_grade'].nunique()
tf['eng_cnt'] = tf.groupby(['stu_id'])['eng_grade'].nunique()
But this doesn't provide the expected output. Since, I have more than 100K unique ids, any efficient and elegant solution is really helpful
I expect my output to be like as below
You can specify columns names in list and for column cols call DataFrameGroupBy.nunique with rename:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = dict(zip(cols, new))
df = tf.groupby(['stu_id'], as_index=False)[cols].nunique().rename(columns=d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
Another idea is used named aggregation:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = {v: (k,'nunique') for k, v in zip(cols, new)}
print (d)
{'mat_cnt': ('Mat_grade', 'nunique'),
'sci_cnt': ('sci_grade', 'nunique'),
'eng_cnt': ('eng_grade', 'nunique')}
df = tf.groupby(['stu_id'], as_index=False).agg(**d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df
Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)
I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?
If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.
If i have data like
Col1
A
B
A
B
A
C
I need output like
Col_value Count
A 3
B 2
C 1
I need to col_value and count be column names.
So I can access it like a['col_value']
Use value_counts:
df = pd.value_counts(df.Col1).to_frame().reset_index()
df
A 3
B 2
C 1
then rename your columns if needed:
df.columns = ['Col_value','Count']
df
Col_value Count
0 A 3
1 B 2
2 C 1
Another solution is groupby with aggregating size:
df = df.groupby('Col1')
.size()
.reset_index(name='Count')
.rename(columns={'Col1':'Col_value'})
print (df)
Col_value Count
0 A 3
1 B 2
2 C 1
Use pd.crosstab as another alternative:
import pandas as pd
help(pd.crosstab)
Help on function crosstab in module pandas.core.reshape.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Example:
df_freq = pd.crosstab(df['Col1'], columns='count')
df_freq.head()
def frequencyTable(alist):
'''
list -> chart
Returns None. Side effect is printing two columns showing each number that
is in the list, and then a column indicating how many times it was in the list.
Example:
>>> frequencyTable([1, 3, 3, 2])
ITEM FREQUENCY
1 1
2 1
3 2
'''
countdict = {}
for item in alist:
if item in countdict:
countdict[item] = countdict[item] + 1
else:
countdict[item] = 1
itemlist = list(countdict.keys())
itemlist.sort()
print("ITEM", "FREQUENCY")
for item in itemlist:
print(item, " ", countdict[item])
return None