How to check all the columns of a dataframe are the same? - python

What is a quick way to check if all the columns in a pandas dataframe are the same?
E.g. I have a dataframe with the columns a,b,c below, and I need to check that the columns are all the same, i.e. that a = b = c
+---+---+---+
| a | b | c |
+---+---+---+
| 5 | 5 | 5 |
| 7 | 7 | 7 |
| 9 | 9 | 9 |
+---+---+---+
I had thought of using apply to iterate over all the rows, but I am afraid it might be inefficient as it would be a non-vectorised loop.
I suspect looping over the columns would be quicker because I always have fewer columns than rows (a few dozen columns but hundreds of thousands of rows).
I have come up with the contraption below. I need to tidy it up and make it into a function but it works - the question is if there is a more elegant / faster way of doing it?
np.where returns zero when the items are all the same and 1 otherwise (not the opposite), so summing the output gives me the number of mismatches.
I iterate over all the columns (excluding the first), comparing them to the first.
The first output counts the matches/mismatches by column, the second by row.
If you add something like
df.iloc[3,2] = 100
after defining df, the output tells you the 3rd row of column c doesn't match
import numpy as np
import pandas as pd
df = pd.DataFrame()
x = np.arange(0,20)
df['a'] = x
df['b'] = x
df['c'] = x
df['d'] = x
#df.iloc[3,2] = 100
cols = df.columns
out = pd.DataFrame()
for c in np.arange(1, len(cols) ):
out[cols[c]] = np.where(df[cols[0]] == df[cols[c]], 0, 1)
print(out.sum(axis = 0))
print(out.sum(axis = 1))

Let's try duplicated:
(~df.T.duplicated()).sum()==1
# True

You can use df.drop_duplicates() and check for the len to be 1. This would mean that all columns are same:
In [1254]: len(df.T.drop_duplicates()) == 1
Out[1254]: True

Use DataFrame.duplicated + DataFrame.all:
df.T.duplicated(keep=False).all()
#True

Related

Get all distinct values from a column where another column has at least two distinct values for each value in the initial column

I'm having a very large dataset (20GB+) and I need to select all distinct values from column A where there are at least two other distinct values in column B for each distinct value on column A.
For the following dataframe:
| A | B |
|---|---|
| x | 1 |
| x | 2 |
| y | 1 |
| y | 1 |
Should return only x because it has two distinct values on column B, while y has only 1 distinct value.
The following code does the trick, but it takes a very long time (as in hours) since the dataset is very large:
def get_values(list_of_distinct_values, dataframe):
valid_values = []
for value in list_of_distinct_values:
value_df = dataframe.loc[dataframe['A'] == value]
if len(value_df.groupby('B')) > 1:
valid_values.append(value)
return valid_values
Can anybody suggest a faster way of doing this?
I think you can solve your problem with the method drop_duplicates() of the dataframe. You need to use the parameters subset and keep (to remove all the lines with duplicates) :
import pandas as pd
df = pd.DataFrame({
'A': ["x", "x", "y", "y"],
'B': [1, 2, 1, 1],
})
df.drop_duplicates(subset=['A', 'B'], keep=False).drop_duplicates(subset=['A'])['A']

Pandas: Filter DataFrameGroupBy (df.groupby) based on group aggregates

df
| a | b |
|----|---|
| 10 | 1 |
| 10 | 5 |
| 11 | 1 |
straight grouping it using
grouped = df.groupby('a')
Lets get only groups where
selector = grouped.b.max() - grouped.b.min() >= 3
yields
df
| a | |
|----|-------|
| 10 | True |
| 11 | False |
My questions is, what is the equivalent for df = df.loc[<filter condition>] when working with DataFrameGroupBy elements?
grouped.filter(..) returns a DataFrame.
Is there a way to preserve the groups, while filtering based on .aggreate() functions? Thanks!
You can use np.ptp (peak-to-peak)
df.groupby('a').b.agg(np.ptp) > 3
a
10 True
11 False
Name: b, dtype: bool
For the df.loc[] equivalent question, you can just do:
df=df.set_index('a')\
.loc[df.groupby('a').b.agg(np.ptp).gt(3)]\
.reset_index()
Alternatively (inner join solution):
selector=df.groupby('a').b.agg(np.ptp).gt(3)
selector=selector.loc[selector]
df=df.merge(selector, on='a', suffixes=["", "_dropme"])
df=df.loc[:, filter(lambda col: "_dropme" not in col, df.columns)]
Outputs:
a b
0 10 1
1 10 5
PS +1 #rafaelc - for the .ptp thing
Sadly I did not came around a direct solution.. So I worked around it like this using 2 groupby:
# Build True/False Series for filter criteria
selector = df.groupby('a').b.agg(np.ptp) > 3
# Only select those 'a' which have True in filter criteria
selector = selector.loc[selector == True]
# Re-Create groups of 'a' with the filter criteria in place
# Only those groups for 'a' will be created, where the MAX-MIN of 'b' are > 3.
grouped = df.loc[df['a'].isin(selector.index)].groupby('a')

How to select a dataframe column based on a list?

I want to make a condition based on each value unique of a data frame column using python .
I tried to put it in a list and iterate to take all the value :
**f=(df['Technical family'].unique())
for i in f:
df_2 = df[(df['Technical family'] = f[i])]
S=pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
**
but apparently the df_2 = df[(df['Technical family'] = f[i])] doesn't work !
Anyone have an idea how to do it ?
You need to use == instead of =. == is for comparing values while = is for assigning values.
For example:
df = pd.DataFrame({'Technical family':np.random.choice(['1','2','3'],100),
'PFG | ID':np.random.choice(['A','B','C'],100),
'Comp. | Family':np.random.choice(['a','b','c'],100)})
f=(df['Technical family'].unique())
df_2 = df[df['Technical family']==f[0]]
pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
Comp. | Family a b c
PFG | ID
A 5 5 3
B 3 5 3
C 4 3 4
Also as a further suggestion, you can directly crosstab:
res = pd.crosstab([df['Technical family'],df['PFG | ID']],df['Comp. | Family'])
res.loc['1']

Pandas.DataFrame: find the index of the row whose value in a given column is closest to (but below) a specified value

In a Pandas.DataFrame, I would like to find the index of the row whose value in a given column is closest to (but below) a specified value. Specifically, say I am given the number 40 and the DataFrame df:
| | x |
|---:|----:|
| 0 | 11 |
| 1 | 15 |
| 2 | 17 |
| 3 | 25 |
| 4 | 54 |
I want to find the index of the row such that df["x"] is lower but as close as possible to 40. Here, the answer would be 3 because df[3,'x']=25 is smaller than the given number 40 but closest to it.
My dataframe has other columns, but I can assume that the column "x" is increasing.
For an exact match, I did (correct me if there is a better method):
list = df[(df.x == number)].index.tolist()
if list:
result = list[0]
But for the general case, I do not know how to do it in a "vectorized" way.
Filter rows below 40 by Series.lt in boolean indexing and get mximal index value by Series.idxmax:
a = df.loc[df['x'].lt(40), 'x'].idxmax()
print (a)
3
For improve performance is possible use numpy.where with np.max, solution working if default index:
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
If not default RangeIndex:
df = pd.DataFrame({'x':[11,15,17,25,54]}, index=list('abcde'))
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
print (df.index[a])
d
How about that:
import pandas as pd
data = {'x':[0,1,2,3,4,20,50]}
df = pd.DataFrame(data)
#get df with selected condition
sub_df = df[df['x'] <= 40]
#get the idx of the maximum
idx = sub_df.idxmax()
print(idx)
Use Series.where to mask greater or equal than n, then use Series.idxmax to obtain
the closest one:
n=40
val = df['x'].where(df['x'].lt(n)).idxmax()
print(val)
3
We could also use Series.mask:
df['x'].mask(df['x'].ge(40)).idxmax()
or callable with loc[]
df['x'].loc[lambda x: x.lt(40)].idxmax()
#alternative
#df.loc[lambda col: col['x'].lt(40),'x'].idxmax()
If not by default RangeIndex
i = df.loc[lambda col: col['x'].lt(40),'x'].reset_index(drop=True).idxmax()
df.index[i]

Producing every combination of columns from one pandas dataframe in python

I'd like to take a dataframe and visualize how useful each column is in a k-neighbors analysis so I was wondering if there was a way to loop through dropping columns and analyzing the dataframe in order to produce an accuracy for every single combination of columns. I'm really not sure if there are some functions in pandas that I'm not aware of that could make this easier or how to loop through the dataframe to produce every combination of the original dataframe. If I have not explained it well I will try and create a diagram.
a | b | c | | labels |
1 | 2 | 3 | | 0 |
5 | 6 | 7 | | 1 |
The dataframe above would produce something like this after being run through the splitting and k-neighbors function:
a & b = 43%
a & c = 56%
b & c = 78%
a & b & c = 95%
import itertools
min_size = 2
max_size = df.shape[1]
column_subsets = itertools.chain(*map(lambda x: itertools.combinations(df.columns, x), range(min_size,max_size+1)))
for column_subset in column_subsets:
foo(df[list(column_subset)])
where df is your dataframe and foo is whatever kNA you're doing. Although you said "all combinations", I put min_size at 2 since your example has only size >= 2. And these are more precisely referred to as "subsets" rather than "combinations".

Categories

Resources