How to find the set difference between two Pandas DataFrames - python

I'd like to check the difference between two DataFrame columns. I tried using the command:
np.setdiff1d(train.columns, train_1.columns)
which results in an empty array:
array([], dtype=object)
However, the number of columns in the dataframes are different:
len(train.columns), len(train_1.columns) = (51, 56)
which means that the two DataFrame are obviously different.
What is wrong here?

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.
If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'.
If you want to use setdiff1d you need to check for differences both ways:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)

use something like this
data_3 = data1[~data1.isin(data2)]
Where data1 and data2 are columns and data_3 = data_1 - data_2

Related

pandas value_counts(): directly compare two instances

I have done .value_counts() on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts') as suggested in this thread), but it doesn't help.
first = df1['company'].value_counts()
second = df2['company'].value_counts()
I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?
GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).
You might use collections.Counter ability to subtract as follows
import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)
gives output
Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})
Explanation: where value is positive means more instances are in df1, where value is zero then number is equal, where value is negative means more instances are in df2.
Use from this code
df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()
Now filter on df for related data

Multi-slice pandas dataframe

I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h

Use results from two queries with common key to create a dataframe without having to use merge

data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717

Pandas dataframe column selection

I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')

Filter numpy ndarray (matrix) according to column values

This question is about filtering a NumPy ndarray according to some column values.
I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.
The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").
The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn't work because the dtypes I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
>>> # import the library
>>> import pandas as PD
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'],
'value':[4, 2, 6, 3, 8, 4, 3, 9]}
>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
>>> # step 1: create the index
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')
>>> # then filter the data against that index:
>>> D.ix[idx]
category_code value
2 B 6
3 C 3
6 C 3
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:
>>> D[idx,:]
In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:
>>> D.loc[idx]
If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.

Categories

Resources