I am trying to set up a way to cycle though a 100 element dictionary, where each element is a 42,000 row dataframe and check whether the value is above a threshold
I am having problems working out how to store the row data so it is not overwritten.
I've made a simple example of what I would like to do:
I have a three element dictionary (my_dic) where each element is a dataframe
I want to cycle through each row of every dataframe and check if any of the columns are above a threshold number.
I have been trying to use .any() and .where but I dont know how to capture the data for each df separately.
I would like to end up with three separate new df that have boolean values in each column that is above the threshold.
Any help would be great!
d1 = np.random.rand(3,3)
d2 = np.random.rand(3,3)
d3 = np.random.rand(3,3)
df1 =pd.DataFrame(d1, index=['a', 'b', 'c'])
df2 =pd.DataFrame(d2, index=['a', 'b', 'c'])
df3 =pd.DataFrame(d3, index=['a', 'b', 'c'])
my_dic = {}
my_dic['a'] = df1
my_dic['b'] = df2
my_dic['c'] = df3
threshold = 0.5
I want to cycle through every row of each of the keys in my_dic and change the value to booleans if it is greater than a threshold number
for k in my_dic:
print k
data = my_dic[k]
for row in range(len(data)):
print row
np.where(data.iloc[row,:] > threshold)
This is the point I am struggling with, Im not sure how to keep this data, without it being overwritten
I assume you're looking for a dict comprhension, if you just want a boolean mask.
result = {k : v > threshold for k, v in my_dic.items()}
for v in result.values():
print(v, '\n')
0 1 2
a False False False
b True False True
c False True True
0 1 2
a False True True
b True True True
c False True True
0 1 2
a False False False
b False False False
c True True False
If you want the result as 0/1, use astype:
result = {k : v.gt(threshold).astype(int) for k, v in my_dic.items()}
Related
I have a DataFrame and a list like this:
df = pd.DataFrame({'bool':[True,False,True,False, False]})
lst = ["aa","bb"]
Now I want to add the list as a column to the DataFrame based on boolean values like this:
df = pd.DataFrame({'bool':[True,False,True,False, False], 'lst':['aa','','bb','','']})
My solution is
df1 = df[df['bool'] == True].copy()
df2 = df[df['bool'] == False].copy()
df1['lst'] = lst
df2['lst'] = ''
df = pd.concat([df1, df2])
But it created so many DataFrames. Is there a better way to do this?
If length of list is same like count of Trues values use:
df.loc[df['bool'], 'lst'] = lst
df['lst'] = df['lst'].fillna('')
print (df)
bool lst
0 True aa
1 False
2 True bb
3 False
4 False
I have two dataframes say df and thresh_df. The shape of df is say 1000*200 and thresh_df is 1*200.
I need to compare the thresh_df row with each row of df element wise respectively and I have to fetch the corresponding column number whose values are less than the values of thresh_df.
I tried the following
compared_df = df.apply(lambda x : np.where(x < thresh_df.values))
But I get an empty dataframe! If question is unclear and need any explanations,please let me know in the comments.
I think apply is not necessary only compare one row DataFrame converted to Series by selecting first row:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
thresh_df = pd.DataFrame({
'B':[4],
'C':[7],
'D':[4],
'E':[5],
})
compared_df = df < thresh_df.iloc[0]
print (compared_df)
B C D E
0 False False True False
1 False False True True
2 False False False False
3 False True False False
4 False True True True
5 False True True True
Then use DataFrame.any for filter at least one True per row and filter index values:
idx = df.index[compared_df.any(axis=1)]
print (idx)
Int64Index([0, 1, 3, 4, 5], dtype='int64')
Detail:
print (compared_df.any(axis=1))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool
I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.
My dataframe looks like this
df = pd.Dataframe({ 'a': ["10001", "10001", "10002", "10002" , "10002"], 'b': ['hello', 'hello', 'hola', 'hello', 'hola']})
I want to create a new column 'c' of boolean values with the following condition:
If values of 'a' is the same (i.e. 1st and 2nd row, 3rd and 4th and 5th row), check if values of 'b' of those rows are the same. (2nd row returns True. 4th row returns False).
If values of 'a' is not the same, skip.
My current code is the following:
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
t = []
for i in df['match']:
if i == True:
t.append(df[col2].eq(df[col2].shift()))
check_consistency('a','b')
And it returns error.
I think this is groupby
df.groupby('a').b.apply(lambda x : x==x.shift())
Out[431]:
0 False
1 True
2 False
3 False
4 False
Name: b, dtype: bool
A bitwise & should do: Checking if both the conditions are satisfied:
df['c'] = (df.a == df.a.shift()) & (df.b == df.b.shift())
df.c
#0 False
#1 True
#2 False
#3 False
#4 False
#Name: c, dtype: bool
Alternatively, if you want to make your current code work, you can do something like (essentially doing the same check as above):
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
for i in range(len(df['match'])):
if (df['match'][i] == True):
df.loc[i,'match'] = (df.loc[i, col2] == df.loc[i-1, col2])
check_consistency('a','b')
Working with data in Python 3+ with pandas. It seems like there should be an easy way to check if two columns have a one-to-one relationship (regardless of column type), but I'm struggling to think of the best way to do this.
Example of expected output:
A B C
0 'a' 'apple'
1 'b' 'banana'
2 'c' 'apple'
A & B are one-to-one? TRUE
A & C are one-to-one? FALSE
B & C are one-to-one? FALSE
Well, you can create your own function to check it:
def isOneToOne(df, col1, col2):
first = df.groupby(col1)[col2].count().max()
second = df.groupby(col2)[col1].count().max()
return first + second == 2
isOneToOne(df, 'A', 'B')
#True
isOneToOne(df, 'A', 'C')
#False
isOneToOne(df, 'B', 'C')
#False
In case you data is more like this:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],
'B': ["'a'", "'b'", "'c'", "'a'"]})
df
# A B C
#0 0 'a' 'apple'
#1 1 'b' 'banana'
#2 2 'c' 'apple'
#3 0 'a' 'apple'
Then you can use:
def isOneToOne(df, col1, col2):
first = df.drop_duplicates([col1, col2]).groupby(col1)[col2].count().max()
second = df.drop_duplicates([col1, col2]).groupby(col2)[col1].count().max()
return first + second == 2
df.groupby(col1)[col2]\
.apply(lambda x: x.nunique() == 1)\
.all()
should work fine if you want a true or false answer.
A nice way to visualize the relationship between two columns with discrete / categorical values (in case you are using Jupyter notebook) is :
df.groupby([col1, col2])\
.apply(lambda x : x.count())\
.iloc[:,0]\
.unstack()\
.fillna(0)
This matrix will tell you the correspondence between the column values in the two columns.
In case of a one-to-one relationship there will be only one non-zero value per row in the matrix.
df.groupby('A').B.nunique().max()==1 #Output: True
df.groupby('B').C.nunique().max()==1 #Output: False
Within each value in [groupby column], count the number of unique values in [other column], then check that the maximum for all such counts is one
one way to solve this ,
df['A to B']=df.groupby('B')['A'].transform(lambda x:x.nunique()==1)
df['A to C']=df.groupby('C')['A'].transform(lambda x:x.nunique()==1)
df['B to C']=df.groupby('C')['B'].transform(lambda x:x.nunique()==1)
Output:
A B C A to B A to C B to C
0 0 a apple True False False
1 1 b banana True True True
2 2 c apple True False False
To check column by column:
print (df['A to B']==True).all()
print (df['A to C']==True).all()
print (df['B to C']==True).all()
True
False
False
Here is my solution (only two or three lines of codes) to check for any number of columns to see whether they are one to one match (duplicated matches are allowed, see the example bellow).
cols = ['A', 'B'] # or any number of columns ['A', 'B', 'C']
res = df.groupby(cols).count()
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
all(uniqueness)
Let's make it a function and add some docs:
def is_one_to_one(df, cols):
"""Check whether any number of columns are one-to-one match.
df: a pandas.DataFrame
cols: must be a list of columns names
Duplicated matches are allowed:
a - 1
b - 2
b - 2
c - 3
(This two cols will return True)
"""
if len(cols) == 1:
return True
# You can define you own rules for 1 column check, Or forbid it
# MAIN THINGs: for 2 or more columns check!
res = df.groupby(cols).count()
# The count number info is actually bootless.
# What maters here is the grouped *MultiIndex*
# and its uniqueness in each level
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
return all(uniqueness)
By using this function, you can do the one-to-one match check:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'B': ["'a'", "'b'", "'c'", "'a'"],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],})
is_one_to_one(df, ['A', 'B'])
is_one_to_one(df, ['A', 'C'])
is_one_to_one(df, ['A', 'B', 'C'])
# Outputs:
# True
# False
# False