I am new to data frames so I apologize if the question is obvious ,Assume I have a data frame that looks like that:
1 2 3
4 5 6
7 8 9
and I would like to check if it contains the following data frame:
5 6
8 9
is there any build in function in pandas.dataframe that do it?
Supposed two dataframes have the same relative columns and index (I assume so as they are dataframe not just values array), here is a quick solution (not the most elegant or efficient) where you compare two dataframes after combine_first:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
Example:
df
a b c
0 1 2 3
1 4 5 6
2 7 8 9
df1
a b
1 4 5
2 7 8
all(df1.combine_first(df) == df.combine_first(df1))
True
or, if you want to check df1 (smaller) is in df (you know their size already):
all(df == df1.combine_first(df))
True
Related
say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...).
In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
Notes:
There are various ways to merge dask dataframes. Dask provides various built-in modules, such as: dask.dataframe.DataFrame.join, dask.dataframe.multi.concat, dask.dataframe.DataFrame.merge, dask.dataframe.multi.merge, dask.dataframe.multi.merge_asof. Depending on one's requirements one might want to use a specific one.
This thread has really valuable information on merges. Even though its focus is on Pandas, it will allow one to understand left, right, outer, and inner merges.
If one wants to do it with Pandas dataframes, there are various ways to achieve that.
One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let's invest a bit more time in that approach.
We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})
df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())
df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())
As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id
df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)
[Out]:
sample_id df_a df_b df_max
0 1 2 1 df_a
1 2 4 2 df_a
2 3 1 3 df_b
3 4 1 1 df_a
4 5 6 3 df_a
A one-liner to create the desired df_count would look like the following
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))
Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:
If df_max is df_a, then take the rows from df_a.
If df_max is df_b, then take the rows from df_b.
One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe
def merge_df(df_a, df_b, df_count):
# Create a list to store the dataframes
df_list = []
# Iterate over the rows in df_count
for index, row in df_count.iterrows():
# If df_max is df_a, then take the rows from df_a
if row['df_max'] == 'df_a':
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# If df_max is df_b, then take the rows from df_b
elif row['df_max'] == 'df_b':
df_list.append(df_b[df_b['sample_id'] == row['sample_id']])
# If df_max is neither df_a nor df_b, then use the first dataframe
else:
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# Concatenate the dataframes in df_list and return the result. Also, reset the index.
return pd.concat(df_list).reset_index(drop=True)
Then one can apply the function
df_merged = merge_df(df_a, df_b, df_count)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 5
12 5
13 5
14 5
15 5
I have a dataframe with column headings (and for my real data multi-level row indexes). I want to add a second level index to the columns based on a list I have.
import pandas as pd
data = {"apple": [7,5,6,4,7,5,8,6],
"strawberry": [3,5,2,1,3,0,4,2],
"banana": [1,2,1,2,2,2,1,3],
"chocolate" : [5,8,4,2,1,6,4,5],
"cake":[4,4,5,1,3,0,0,3]
}
df = pd.DataFrame(data)
food_cat = ["fv","fv","fv","j","j"]
I am wanting something that looks like this:
I tried to use How to add a second level column header/index to dataframe by matching to dictionary values? - however couldn't get it working (and not ideal as I'd need to figure out how to automate the dictionary, which I don't have).
I also tried adding the list as a row in the dataframe and converting that row to a second level index as in this answer using
df.loc[len(df)] = food_cat
df = pd.MultiIndex.from_arrays(df.columns, df.iloc[len(df)-1])
but got the error
Check if lengths of all arrays are equal or not,
TypeError: Input must be a list / sequence of array-likes.
I also tried using df = pd.MultiIndex.from_arrays(df.columns, np.array(food_cat)) with import numpy as np but got the same error.
I feel like this should be a simple task (it is for rows), and there are a lot of questions asked, but I was struggling to find something I could duplicate to adapt to my data.
Pandas multi index creation requires a list(or list like) passed as an argument:
df.columns = pd.MultiIndex.from_arrays([food_cat, df.columns])
df
fv j
apple strawberry banana chocolate cake
0 7 3 1 5 4
1 5 5 2 8 4
2 6 2 1 4 5
3 4 1 2 2 1
4 7 3 2 1 3
5 5 0 2 6 0
6 8 4 1 4 0
7 6 2 3 5 3
My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())
I have N Dataframes with different number of columns, I want to get one dataframe with 2 columns x and Y where x is the data from the columns of the input dataframe and Y is the column name itself. I have many such dataframes that I need to concat (N is of the order of 10^2), so efficiency is of priority. A numpy way rather than pandas way is also welcome.
For example,
df1:
one two
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
df2:
three four
0 NaN
1 None f
2 g
3 6 7
Final Output Dataframe:
x y
0 1 one
1 2 one
2 3 one
3 4 one
4 5 one
5 a two
6 b two
7 c two
8 d two
9 e two
10 6 three
11 f four
12 g four
13 7 four
Note: I'm ignoring empty strings, NaNs and Nones in the final dataframe.
IIUC you can use melt() before concating:
final=(pd.concat([df1.melt(),df2.dropna().melt()]).
rename(columns={'variable':'y','value':'x'}). reindex(['x','y'],axis=1))
print(final)
I have a question, how does one count the number of unique values that occur within each column of a pandas data-frame?
Say I have a data frame named df that looks like this:
1 2 3 4
a yes f c
b no f e
c yes d h
I am wanting to get output that shows the frequency of unique values within the four columns. The output would be something similar to this:
Column # of Unique Values
1 3
2 2
3 2
4 3
I don't need to know what the unique values are, just how many there are within each column.
I have played around with something like this:
df[all_cols].value_counts()
[all_cols] is a list of all the columns within the data frame. But this is counting how many times the value appears within the column.
Any advice/suggestions would be a great help. Thanks
You could apply Series.nunique:
>>> df.apply(pd.Series.nunique)
1 3
2 2
3 2
4 3
dtype: int64
Or you could do a groupby/nunique on the unstacked version of the frame:
>>> df.unstack().groupby(level=0).nunique()
1 3
2 2
3 2
4 3
dtype: int64
Both of these produce a Series, which you could then use to build a frame with whatever column names you wanted.
You could try df.nunique()
>>> df.nunique()
1 3
2 2
3 2
4 3
dtype: int64