How to select a dataframe column based on a list? - python

I want to make a condition based on each value unique of a data frame column using python .
I tried to put it in a list and iterate to take all the value :
**f=(df['Technical family'].unique())
for i in f:
df_2 = df[(df['Technical family'] = f[i])]
S=pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
**
but apparently the df_2 = df[(df['Technical family'] = f[i])] doesn't work !
Anyone have an idea how to do it ?

You need to use == instead of =. == is for comparing values while = is for assigning values.
For example:
df = pd.DataFrame({'Technical family':np.random.choice(['1','2','3'],100),
'PFG | ID':np.random.choice(['A','B','C'],100),
'Comp. | Family':np.random.choice(['a','b','c'],100)})
f=(df['Technical family'].unique())
df_2 = df[df['Technical family']==f[0]]
pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
Comp. | Family a b c
PFG | ID
A 5 5 3
B 3 5 3
C 4 3 4
Also as a further suggestion, you can directly crosstab:
res = pd.crosstab([df['Technical family'],df['PFG | ID']],df['Comp. | Family'])
res.loc['1']

Related

Header for sub-headers in Pandas

In Pandas I have a table with the next columns:
Number of words | 1 | 2 | 4 |
...And I want to make it like the following:
----------------|worker/node|
Number of words | 1 | 2 | 4 |
So how to "create" this header for sub-features?
And how to merge empty cell (from row 1 where FeatureHeader is located) with "Index" cell in row 2?
In another words, I want to make table headers like this:
Use MultiIndex.from_product for add first level of MultiIndex by your string:
#if necessary convert some columns to index first
df = df.set_index(['Number of words'])
df.columns = pd.MultiIndex.from_product([['Worker/node'], df.columns])

How to check all the columns of a dataframe are the same?

What is a quick way to check if all the columns in a pandas dataframe are the same?
E.g. I have a dataframe with the columns a,b,c below, and I need to check that the columns are all the same, i.e. that a = b = c
+---+---+---+
| a | b | c |
+---+---+---+
| 5 | 5 | 5 |
| 7 | 7 | 7 |
| 9 | 9 | 9 |
+---+---+---+
I had thought of using apply to iterate over all the rows, but I am afraid it might be inefficient as it would be a non-vectorised loop.
I suspect looping over the columns would be quicker because I always have fewer columns than rows (a few dozen columns but hundreds of thousands of rows).
I have come up with the contraption below. I need to tidy it up and make it into a function but it works - the question is if there is a more elegant / faster way of doing it?
np.where returns zero when the items are all the same and 1 otherwise (not the opposite), so summing the output gives me the number of mismatches.
I iterate over all the columns (excluding the first), comparing them to the first.
The first output counts the matches/mismatches by column, the second by row.
If you add something like
df.iloc[3,2] = 100
after defining df, the output tells you the 3rd row of column c doesn't match
import numpy as np
import pandas as pd
df = pd.DataFrame()
x = np.arange(0,20)
df['a'] = x
df['b'] = x
df['c'] = x
df['d'] = x
#df.iloc[3,2] = 100
cols = df.columns
out = pd.DataFrame()
for c in np.arange(1, len(cols) ):
out[cols[c]] = np.where(df[cols[0]] == df[cols[c]], 0, 1)
print(out.sum(axis = 0))
print(out.sum(axis = 1))
Let's try duplicated:
(~df.T.duplicated()).sum()==1
# True
You can use df.drop_duplicates() and check for the len to be 1. This would mean that all columns are same:
In [1254]: len(df.T.drop_duplicates()) == 1
Out[1254]: True
Use DataFrame.duplicated + DataFrame.all:
df.T.duplicated(keep=False).all()
#True

Pandas: Filter DataFrameGroupBy (df.groupby) based on group aggregates

df
| a | b |
|----|---|
| 10 | 1 |
| 10 | 5 |
| 11 | 1 |
straight grouping it using
grouped = df.groupby('a')
Lets get only groups where
selector = grouped.b.max() - grouped.b.min() >= 3
yields
df
| a | |
|----|-------|
| 10 | True |
| 11 | False |
My questions is, what is the equivalent for df = df.loc[<filter condition>] when working with DataFrameGroupBy elements?
grouped.filter(..) returns a DataFrame.
Is there a way to preserve the groups, while filtering based on .aggreate() functions? Thanks!
You can use np.ptp (peak-to-peak)
df.groupby('a').b.agg(np.ptp) > 3
a
10 True
11 False
Name: b, dtype: bool
For the df.loc[] equivalent question, you can just do:
df=df.set_index('a')\
.loc[df.groupby('a').b.agg(np.ptp).gt(3)]\
.reset_index()
Alternatively (inner join solution):
selector=df.groupby('a').b.agg(np.ptp).gt(3)
selector=selector.loc[selector]
df=df.merge(selector, on='a', suffixes=["", "_dropme"])
df=df.loc[:, filter(lambda col: "_dropme" not in col, df.columns)]
Outputs:
a b
0 10 1
1 10 5
PS +1 #rafaelc - for the .ptp thing
Sadly I did not came around a direct solution.. So I worked around it like this using 2 groupby:
# Build True/False Series for filter criteria
selector = df.groupby('a').b.agg(np.ptp) > 3
# Only select those 'a' which have True in filter criteria
selector = selector.loc[selector == True]
# Re-Create groups of 'a' with the filter criteria in place
# Only those groups for 'a' will be created, where the MAX-MIN of 'b' are > 3.
grouped = df.loc[df['a'].isin(selector.index)].groupby('a')

Pandas.DataFrame: find the index of the row whose value in a given column is closest to (but below) a specified value

In a Pandas.DataFrame, I would like to find the index of the row whose value in a given column is closest to (but below) a specified value. Specifically, say I am given the number 40 and the DataFrame df:
| | x |
|---:|----:|
| 0 | 11 |
| 1 | 15 |
| 2 | 17 |
| 3 | 25 |
| 4 | 54 |
I want to find the index of the row such that df["x"] is lower but as close as possible to 40. Here, the answer would be 3 because df[3,'x']=25 is smaller than the given number 40 but closest to it.
My dataframe has other columns, but I can assume that the column "x" is increasing.
For an exact match, I did (correct me if there is a better method):
list = df[(df.x == number)].index.tolist()
if list:
result = list[0]
But for the general case, I do not know how to do it in a "vectorized" way.
Filter rows below 40 by Series.lt in boolean indexing and get mximal index value by Series.idxmax:
a = df.loc[df['x'].lt(40), 'x'].idxmax()
print (a)
3
For improve performance is possible use numpy.where with np.max, solution working if default index:
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
If not default RangeIndex:
df = pd.DataFrame({'x':[11,15,17,25,54]}, index=list('abcde'))
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
print (df.index[a])
d
How about that:
import pandas as pd
data = {'x':[0,1,2,3,4,20,50]}
df = pd.DataFrame(data)
#get df with selected condition
sub_df = df[df['x'] <= 40]
#get the idx of the maximum
idx = sub_df.idxmax()
print(idx)
Use Series.where to mask greater or equal than n, then use Series.idxmax to obtain
the closest one:
n=40
val = df['x'].where(df['x'].lt(n)).idxmax()
print(val)
3
We could also use Series.mask:
df['x'].mask(df['x'].ge(40)).idxmax()
or callable with loc[]
df['x'].loc[lambda x: x.lt(40)].idxmax()
#alternative
#df.loc[lambda col: col['x'].lt(40),'x'].idxmax()
If not by default RangeIndex
i = df.loc[lambda col: col['x'].lt(40),'x'].reset_index(drop=True).idxmax()
df.index[i]

Producing every combination of columns from one pandas dataframe in python

I'd like to take a dataframe and visualize how useful each column is in a k-neighbors analysis so I was wondering if there was a way to loop through dropping columns and analyzing the dataframe in order to produce an accuracy for every single combination of columns. I'm really not sure if there are some functions in pandas that I'm not aware of that could make this easier or how to loop through the dataframe to produce every combination of the original dataframe. If I have not explained it well I will try and create a diagram.
a | b | c | | labels |
1 | 2 | 3 | | 0 |
5 | 6 | 7 | | 1 |
The dataframe above would produce something like this after being run through the splitting and k-neighbors function:
a & b = 43%
a & c = 56%
b & c = 78%
a & b & c = 95%
import itertools
min_size = 2
max_size = df.shape[1]
column_subsets = itertools.chain(*map(lambda x: itertools.combinations(df.columns, x), range(min_size,max_size+1)))
for column_subset in column_subsets:
foo(df[list(column_subset)])
where df is your dataframe and foo is whatever kNA you're doing. Although you said "all combinations", I put min_size at 2 since your example has only size >= 2. And these are more precisely referred to as "subsets" rather than "combinations".

Categories

Resources