Compare values from two pandas data frames, order-independent - python

I am new to data science. I want to check which elements from one data frame exist in another data frame, e.g.
df1 = [1,2,8,6]
df2 = [5,2,6,9]
# for 1 output should be False
# for 2 output should be True
# for 6 output should be True
etc.
Note: I have matrix not vector.
I have tried using the following code:
import pandas as pd
import numpy as np
priority_dataframe = pd.read_excel(prioritylist_file_path, sheet_name='Sheet1', index=None)
priority_dict = {column: np.array(priority_dataframe[column].dropna(axis=0, how='all').str.lower()) for column in
priority_dataframe.columns}
keys_found_per_sheet = []
if file_path.lower().endswith(('.csv')):
file_dataframe = pd.read_csv(file_path)
else:
file_dataframe = pd.read_excel(file_path, sheet_name=sheet, index=None)
file_cell_array = list()
for column in file_dataframe.columns:
for file_cell in np.array(file_dataframe[column].dropna(axis=0, how='all')):
if isinstance(file_cell, str) == 'str':
file_cell_array.append(file_cell)
else:
file_cell_array.append(str(file_cell))
converted_file_cell_array = np.array(file_cell_array)
for key, values in priority_dict.items():
for priority_cell in values:
if priority_cell in converted_file_cell_array[:]:
keys_found_per_sheet.append(key)
break
I am doing something wrong in if priority_cell in converted_file_cell_array[:] ?
Is there any other efficient way to do that?

You can take the .values from each dataframe, convert them to a set(), and take the set intersection.
set1 = set(df1.values.reshape(-1).tolist())
set2 = set(dr2.values.reshape(-1).tolist())
different = set1 & set2

You can flatten all values of DataFrames by numpy.ravel and then use set.intersection():
df1 = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = pd.DataFrame({'A':[2,3,13,4], 'Z':list('abfr')})
print (df2)
A Z
0 2 a
1 3 b
2 13 f
3 4 r
L = list(set(df1.values.ravel()).intersection(df2.values.ravel()))
print (L)
['f', 2, 3, 4, 'a', 'b']

Related

Pandas Dataframe - select columns with a specific value in a specific row

I want to select columns with a specific value (say 1) in a specific row (say first row) for Pandas Dataframe
you can use this
df['a'][df['a']==0]
Use iloc with boolean indexing, for performance is better filtering index not DataFrame and then select index (see performance):
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
s = df.iloc[0]
a = s.index[s == 1]
print (a)
Index(['D'], dtype='object')
a = s.index.values[(s == 1)]
print (a)
['D']
You can use iloc to extract a row as a series, then apply your condition:
row = df.iloc[0] # extract first row as series
res = row[res == 1].index # filter for values equal to 1 and get columns via index

Assigning dataframe data to dataframe label using a loop

I have a list of dataframe names that I would like to assign different dataframe data to.
filenames =[]
for i in np.arange(1,7):
a = "C:\Users\...........\Python code\Cp error for MPE MR%s.csv" %(i)
filenames.append(a)
dfs =[df1,df2,df3,df4,df5,df6]
for i, j in enumerate(filenames):
dfs[j]= pd.DataFrame.from_csv(i,header=0, index_col=None)
However, the following error code occurs:
NameError: name 'df1' is not defined
Is there something wrong with the way I am defining the list of values? Why can't a value in a list be assigned as a variable?
how can i put the following code in a loop?
df1 = pd.DataFrame.from_csv(filenames[0],header=0, index_col=None)
df2 = pd.DataFrame.from_csv(filenames[1],header=0, index_col=None)
df3 = pd.DataFrame.from_csv(filenames[2],header=0, index_col=None)
df4 = pd.DataFrame.from_csv(filenames[3],header=0, index_col=None)
df5 = pd.DataFrame.from_csv(filenames[4],header=0, index_col=None)
df6 = pd.DataFrame.from_csv(filenames[5],header=0, index_col=None)
It seems you need dict comprehension, one possible way for list of files is use glob:
Sample files:
a.csv, b.csv, c.csv.
files = glob.glob('files/*.csv')
#windows solution for files names - os.path.splitext(os.path.split(fp)[1])
dfs = {os.path.splitext(os.path.split(fp)[1])[0]:pd.read_csv(fp) for fp in files}
print (dfs)
{'b': a b c d
0 0 9 6 5
1 1 6 4 2, 'a': a b c d
0 0 1 2 5
1 1 5 8 3, 'c': a b c d
0 0 7 1 7
1 1 3 2 6}
print (dfs['a'])
a b c d
0 0 1 2 5
1 1 5 8 3
If same columns in each files is possible create one big df by concat:
df = pd.concat(dfs)
print (df)
a b c d
a 0 0 1 2 5
1 1 5 8 3
b 0 0 9 6 5
1 1 6 4 2
c 0 0 7 1 7
1 1 3 2 6
EDIT:
Better instead pd.DataFrame.from_csv is use read_csv:
Solution with global variables:
#for df0, df1, df2...
for i, fp in enumerate(files):
print (fp)
df = pd.read_csv(fp, header=0, index_col=None)
globals()['df' + str(i)] = df
print (df1)
a b c d
0 0 9 6 5
1 1 6 4 2
Better solution for list of DataFrames and selecting by positions:
#for dfs[0], dfs[1], dfs[2]...
dfs = [pd.read_csv(fp, header=0, index_col=None) for fp in files]
print (dfs[1])
a b c d
0 0 9 6 5
1 1 6 4 2
dfs =[df1,df2,df3,df4,df5,df6]?
Why this string? Why not it should be:
dfs =[]
And yes, I think you swapped i and j, and it should be something like:
dfs.append(pd.DataFrame.from_csv(j,header=0, index_col=None))
And enumerate is redundant:
for f in filenames:
dfs.append(pd.DataFrame.from_csv(f,header=0, index_col=None))

How to extract rows in a pandas dataframe NOT in a subset dataframe

I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

How to append several data frame into one

I have write down a code to append several dummy DataFrame into one. After appending, the expected "DataFrame.shape" would be (9x3). But my code producing something unexpected output (6x3). How can i rectify the error of my code.
import pandas as pd
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
for df in (df1, df2, df3):
df = df.append(df, ignore_index=True)
print df
I don't want use "pd.concat" because in this case i have to store all the data frame into memory and my real data set contains hundred of data frame with huge shape. I just want a code which can open one CSV file at once into loop update the final DF with the progress of loop
thanks
Firstly use concat to concatenate a bunch of dfs it's quicker:
In [308]:
df = pd.concat([df1,df2,df3], ignore_index=True)
df
Out[308]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
secondly you're reusing the iterable in your loop which is why it overwrites it, if you did this it would work:
In [307]:
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
​
​
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
​
df = pd.DataFrame()
​
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)
df
Out[307]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
Here I changed the iterable to be d and declared an empty df outside the loop:
df = pd.DataFrame()
​
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)

Categories

Resources