Compare columns in a dictionary of dataframes

Compare columns in a dictionary of dataframes - python

I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.

I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last

Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1

Related

dropping rows that has only one non zero value from a pandas dataframe in python

I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?

Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0

Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.

Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]

How to create a dataframe with number of row having value above of zero for specifics columns?

I have a dataframe like this
name skill_1 skill_2
john 2 0
james 0 1
I would like to have a count of the rows above of zero for each columns starting with "skill".
Expected output for the new dataframe:
skills count
skill_1 1
skill_2 1
How can I do it with Pandas ?

You can try:
df.filter(like="skill").gt(0).sum(axis=0).to_frame("count")
filter for the columns that include "skill"
Mark those entries that are greater than 0 as True and others as False
Sum row-wise (axis=0) where True will be treated as 1 and False 0 to get the counts
Convert to dataframe
to get
count
skill_1 1
skill_2 1

Just filter by the condition on the "skill" columns:
import pandas as pd
df = pd.DataFrame(columns=['name','skill_1','skill_2'],
data=[['john',2,0],
['james',0,1]])
skill_cols = [x for x in df.columns if 'skill' in x]
subset_df = df[df[skill_cols] > 0][skill_cols]
column_count = subset_df.count()

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?

Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

i want to match two dataframe columns in python

I have a two data frame df1 (35k record) and df2(100k records). In df1['col1'] and df2['col3'] i have unique id's. I want to match df1['col1'] with df2['col3']. If they match, I want to update df1 with one more column say df1['Match'] with value true and if not match, update with False value. I want to map this TRUE and False value against Matching and non-matching record only.
I am using .isin()function, I am getting the correct match and not match count but not able to map them correctly.
Match = df1['col1'].isin(df2['col3'])
df1['match'] = Match
I have also used merge function using by passing the parameter how=rightbut did not get the results.

You can simply do as follows:
df1['Match'] = df1['col1'].isin(df2['col3'])
For instance:
import pandas as pd
data1 = [1,2,3,4,5]
data2 = [2,3,5]
df1 = pd.DataFrame(data1, columns=['a'])
df2 = pd.DataFrame(data2,columns=['c'])
print (df1)
print (df2)
df1['Match'] = df1['a'].isin(df2['c']) # if matches it returns True else False
print (df1)
Output:
a
0 1
1 2
2 3
3 4
4 5
c
0 2
1 3
2 5
a Match
0 1 False
1 2 True
2 3 True
3 4 False
4 5 True

Use df.loc indexing:
df1['Match'] = False
df1.loc[df1['col1'].isin(df2['col3']), 'Match'] = True

lookup from multiple columns pandas

i have 2 dataframes df1 & df2 as given below:
df1:
a
T11552
T11559
T11566
T11567
T11569
T11594
T11604
T11625
df2:
a b
T11552 T11555
T11560 T11559
T11566 T11562
T11568 T11565
T11569 T11560
T11590 T11594
T11604 T11610
T11621 T11625
T11633 T11631
T11635 T11634
T13149 T13140
I want to have a new dataframe df3 where i want to search the value of df1 in df2. if the value is present in df2, i want to add new column in df1 returning True/False as shown below.
df3:
a v
T11552 TRUE
T11559 TRUE
T11566 TRUE
T11567 FALSE
T11569 TRUE
T11594 TRUE
T11604 TRUE
T11625 TRUE
T11633 TRUE
T11634 TRUE

Use assign for new DataFrame with isin and converting all values to flatten array by ravel, for improve performance is possible check only unique values and also check by in1d:
df3 = df1.assign(v = lambda x: x['a'].isin(np.unique(df2.values.ravel())))
#alternative solution
#df3 = df1.assign(v = lambda x: np.in1d(x['a'], np.unique(df2[['a','b']].values.ravel())))
#if need specify columns in df2 for check
df3 = df1.assign(v = lambda x: x['a'].isin(np.unique(df2[['a','b']].values.ravel())))
print (df3)
a v
0 T11552 True
1 T11559 True
2 T11566 True
3 T11567 False
4 T11569 True
5 T11594 True
6 T11604 True
7 T11625 True

Try this:
df3 = df1[['a']].copy()
df3['v'] = df3['a'].isin(set(df2.values.ravel()))
The above code will:
Create a new dataframe using column 'a' from df1.
Create a Boolean column 'v' testing the existence of each value of column 'a' versus values in df2 via set and numpy.ravel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare columns in a dictionary of dataframes - python

Create a list: l=[Di_1[i] for i in dfs] Then using isin() you can compare the first and last df l[0].isin(l[-1]).astype(int) A 0 1 1 0 2 0 3 1

Related

dropping rows that has only one non zero value from a pandas dataframe in python

How to create a dataframe with number of row having value above of zero for specifics columns?

Add multi level column to dataframe

i want to match two dataframe columns in python

lookup from multiple columns pandas

Categories

Resources