Boolean Comparison across multiple dataframes

Boolean Comparison across multiple dataframes - python

I have an issue where I want to compare values across multiple dataframes. Here is a snippet example:
data0 = [[1,'01-01'],[2,'01-02']]
data1 = [[11,'02-30'],[12,'02-25']]
data2 = [[8,'02-30'],[22,'02-25']]
data3 = [[7,'02-30'],[5,'02-25']]
df0 = pd.DataFrame(data0,columns=['Data',"date"])
df1 = pd.DataFrame(data1,columns=['Data',"date"])
df2 = pd.DataFrame(data2,columns=['Data',"date"])
df3 = pd.DataFrame(data3,columns=['Data',"date"])
result=(df0['Data']| df1['Data'])>(df2['Data'] | df3['Data'])
What I would like to do as I hope it can be seen is say if a value in df0 rowX or df1 rowX is greater than df2 rowX or df3 rowX return True else it should be false. In the code above 11 in df1 is greater than both 8 and 7 (df2 and 3 respectively) so the result should be True and then for the second row neither 2 or 12 is greater than 22 (df2) so should be False. However, result gives me
False,False
instead of
True,False
any thoughts or help?

Problem
For your data:
>>> df0['Data']
0 1
1 2
Name: Data, dtype: int64
>>> df1['Data']
0 11
1 12
Name: Data, dtype: int64
your a doing a bitwise or with |:
>>> df0['Data']| df1['Data']
0 11
1 14
Name: Data, dtype: int64
>>> df2['Data']| df3['Data']
0 15
1 23
Name: Data, dtype: int64
Do this for the single numbers:
>>> 1 | 11
11
>>> 2 | 12
14
This is not what you want.
Solution
You can use np.maximum for find the biggest values from each series:
>>> np.maximum(df0['Data'], df1['Data']) > np.maximum(df2['Data'], df3['Data'])
0 True
1 False
Name: Data, dtype: bool

Your existing solution does not work because the | operator performs a bitwise OR operation on the elements.
df0.Data | df1.Data
0 11
1 14
Name: Data, dtype: int64
This results in you comparing values that are different to the values in your dataframe columns. In summary, your approach does not compare values as you'd expect.
You can make this easy by finding -
the max per row of df0 and df1, and
the max per row of df2 and df3
Comparing these two columns to retrieve your result -
i = np.max([df0.Data, df1.Data], axis=0)
j = np.max([df2.Data, df3.Data], axis=0)
i > j
array([ True, False], dtype=bool)
This approach happens to be extremely scalable for any number of dataframes.

Related

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck

You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)

You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})

Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

Compute a row in Pandas operating other columns

I have a dataframe like this-
df = pd.DataFrame{"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] }
I need to compute another column c by the following operation on column a-
c = len(str(a))-len(str(int(a)))-1
Tried different methods but not able to achieve.

If there is different digits after . is possible use Series.str.len with Series.astype:
df = pd.DataFrame({"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 4
1 5
2 6
Name: a, dtype: int64
df['c'] = df.a.astype(str).str.len() - df.a.astype(int).astype(str).str.len() - 1
But because float precision is problematic count values with general data (simulate problem):
df = pd.DataFrame({"a":[2.220000000236, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 14
1 5
2 6
Name: a, dtype: int64

This solution creates column C with the desired result.
df['c'] = df['a'].astype(str).str.len() - df['a'].astype(int).astype(str).str.len() - 1

How can I vectorize the apply + filter operation on pandas.DataFrame?

Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False

Pandas, selecting by column and row

I want to sum up all values that I select based on some function of column and row.
Another way of putting it is that I want to use a function of the row index and column index to determine if a value should be included in a sum along an axis.
Is there an easy way of doing this?

Columns can be selected using the syntax dataframe[<list of columns>]. The index (row) can be used for filtering using the dataframe.index method.
import pandas as pd
df = pd.DataFrame({'a': [0.1, 0.2], 'b': [0.2, 0.1]})
odd_a = df['a'][df.index % 2 == 1]
even_b = df['b'][df.index % 2 == 0]
# odd_a:
# 1 0.2
# Name: a, dtype: float64
# even_b:
# 0 0.2
# Name: b, dtype: float64

If df is your dataframe :
In [477]: df
Out[477]:
A s2 B
0 1 5 5
1 2 3 5
2 4 5 5
You can access the odd rows like this :
In [478]: df.loc[1::2]
Out[478]:
A s2 B
1 2 3 5
and the even ones like this:
In [479]: df.loc[::2]
Out[479]:
A s2 B
0 1 5 5
2 4 5 5
To answer your question, getting even rows and column B would be :
In [480]: df.loc[::2,'B']
Out[480]:
0 5
2 5
Name: B, dtype: int64
and odd rows and column A can be done as:
In [481]: df.loc[1::2,'A']
Out[481]:
1 2
Name: A, dtype: int64

I think this should be fairly general if not the cleanest implementation. This should allow applying separate functions for rows and columns depending on conditions (that I defined here in dictionaries).
import numpy as np
import pandas as pd
ran = np.random.randint(0,10,size=(5,5))
df = pd.DataFrame(ran,columns = ["a","b","c","d","e"])
# A dictionary to define what function is passed
d_col = {"high":["a","c","e"], "low":["b","d"]}
d_row = {"high":[1,2,3], "low":[0,4]}
# Generate list of Pandas boolean Series
i_col = [df[i].apply(lambda x: x>5) if i in d_col["high"] else df[i].apply(lambda x: x<5) for i in df.columns]
# Pass the series as a matrix
df = df[pd.concat(i_col,axis=1)]
# Now do this again for rows
i_row = [df.T[i].apply(lambda x: x>5) if i in d_row["high"] else df.T[i].apply(lambda x: x<5) for i in df.T.columns]
# Return back the DataFrame in original shape
df = df.T[pd.concat(i_row,axis=1)].T
# Perform the final operation such as sum on the returned DataFrame
print(df.sum().sum())

Pandas, concat Series to DF as rows

I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9

You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.

The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9

A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boolean Comparison across multiple dataframes - python

Related

Dropping column if more than half of the values are same - Python

Compute a row in Pandas operating other columns

How can I vectorize the apply + filter operation on pandas.DataFrame?

Pandas, selecting by column and row

Pandas, concat Series to DF as rows

Categories

Resources