I want to combine two dataframes:
df1=pd.DataFrame({'A':['a','a',],'B':['b','b']})
df2=pd.DataFrame({'B':['b','b'],'A':['a','a']})
pd.concat([df1,df2],ignore_index=True)
result:
But I want the output to be like this (I want the same code as SQL's union/union all):
Another way is to use numpy to stack the two dataframes and then use pd.DataFrame constructor:
pd.DataFrame(np.vstack([df1.values,df2.values]), columns = df1.columns)
Output:
A B
0 a b
1 a b
2 b a
3 b a
Here is a proposition to do an SQL UNION ALL with pandas by using pandas.concat :
list_dfs = [df1, df2]
out = (
pd.concat([pd.DataFrame(sub_df.to_numpy()) for sub_df in list_dfs],
ignore_index=True)
.set_axis(df1.columns, axis=1)
)
# Output :
print(out)
A B
0 a b
1 a b
2 b a
3 b a
Related
I have a pandas DataFrame.
Say I want to sample two persons of each group, I use the following code to get a new dataframe:
sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)
I would like to create a dataframe where the non-sampled persons are stored.
The sample_df stil has the indices of the original df so I probably have to do something with that, but I'm not sure what...
Thanks in advance!
First add group_keys=False to groupby for avoid category to MultiIndex:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'category':list('aaabbb')
})
sample_df = (df.groupby("category", group_keys=False)
.apply(lambda group_df: group_df.sample(2, random_state=1234)))
print(sample_df)
A B category
0 a 4 a
1 b 5 a
3 d 5 b
4 e 5 b
So possible filter original index values with boolean indexing by Index.isin and inverted mask by ~:
non_sample_df = df[~df.index.isin(sample_df.index)]
print(non_sample_df)
A B category
2 c 4 a
5 f 4 b
I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.
I have the following dataframes.
import pandas as pd
d={'P':['A','B','C'],
'Q':[5,6,7]
}
df=pd.DataFrame(data=d)
print(df)
d={'P':['A','C','D'],
'Q':[5,7,8]
}
df1=pd.DataFrame(data=d)
print(df1)
d={'P':['B','E','F'],
'Q':[5,7,8]
}
df3=pd.DataFrame(data=d)
print(df3)
Code to check one dataframe column not present in other is this:
df.loc[~df['P'].isin(df1['P'])]
How to check the same in multiple columns?
How to find P column in df3 not in P column of df and df1?
Expected Output:
P Q
0 E 7
1 F 8
You can chain 2 conditions with & for bitwise AND:
cond1 = ~df3['P'].isin(df1['P'])
cond2 = ~df3['P'].isin(df['P'])
df = df3.loc[cond1 & cond2]
print (df)
P Q
1 E 7
2 F 8
Or join values of columns - by concatenate or join list by +:
df = df3.loc[~df3['P'].isin(np.concatenate([df1['P'],df['P']]))]
#another solution
#df = df3.loc[~df3['P'].isin(df1['P'].tolist() + df['P'].tolist())]
What about, However jezrael already given expert answer :)
You can simply define the conditions, and then combine them logically, like:
con1 = df3['P'].isin(df['P'])
con2 = df3['P'].isin(df1['P'])
df = df3[~ (con1 | con2)]
>>> df
P Q
1 E 7
2 F 8
I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?
If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.
Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.