Issues using compare lists in pandas DataFrame - python

I have a DataFrame in pandas with one of the column types being a list on int, like so:
df = pandas.DataFrame([[1,2,3,[4,5]],[6,7,8,[9,10]]], columns=['a','b','c','d'])
>>> df
a b c d
0 1 2 3 [4, 5]
1 6 7 8 [9, 10]
I'd like to build a filter using d, but the normal comparison operations don't seem to work:
>>> df['d'] == [4,5]
0 False
1 False
Name: d, dtype: bool
However when I inspect row by row, I get what I would expect
>>> df.loc[0,'d'] == [4,5]
True
What's going on here? How can I do list comparisons?

It is a curious issue, it probably has to do with the fact that list are not hashable
I would go for apply:
df['d'].apply(lambda x: x == [4,5])
Of course as suggested by DSM, the following works:
df = pd.DataFrame([[1,2,3,(4,5)],[6,7,8,(9,10)]], columns=['a','b','c','d'])
df['d'] == (4,5)
Another solution is use list comprehension:
df[[x == [4, 5] for v in df['col2']]]

As an alternative, if you wish to keep your "series of lists" structure, you can convert your series to tuples for comparison purposes only. This is possible via pd.Series.apply:
>>>>df['d'].apply(tuple) == (4, 5)
0 True
1 False
Name: d, dtype: bool
However, note that none of the options available for a series of lists are vectorised. You are advised to split your data into numeric series before performing comparisons.

Related

How to evaluate conditions after each other in Pandas .loc?

I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.
There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.
you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b

Is there a way to loop through a python data frame, compare column value (nested list) and update another column conditionally?

I have a python data frame as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 3
My goal is to loop through the data frame and compare column B, if column B are the same, the update column C to the same number such as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 1
I tried with the code below:
for i, j in df.iterrows():
if len(df['B'][i] ==len(df['B'][j] & collections.Counter(df['B'][i]==collections.Counter(df['B'][j])
df['C'][j]==df['C'][i]
else:
df['C'][j]==df['C'][j]
I got error message unhashable type: 'list'
Anyone knows what cause this error and better way to do this? Thank you for your help!
Because lists are not hashable convert lists to sorted tuples and get first values by GroupBy.transform with GroupBy.first:
df['C'] = df.groupby(df.B.apply(lambda x: tuple(sorted(x)))).C.transform('first')
print (df)
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1
Detail:
print (df.B.apply(lambda x: tuple(sorted(x))))
0 (3, 4, 9)
1 (4, 8)
2 (3, 4, 9)
Name: B, dtype: object
Not quite sure about the efficiency of the code, but it gets the job done:
uniqueRows = {}
for index, row in df.iterrows():
duplicateFound = False
for c_value, uniqueRow in uniqueRows.items():
if duplicateFound:
continue
if len(row['B']) == len(uniqueRow):
if len(list(set(row['B']) - set(uniqueRow))) == 0:
print(c_value)
df.at[index, 'C'] = c_value
uniqueFound = True
if not duplicateFound:
uniqueRows[row['C']] = row['B']
print(df)
print(uniqueRows)
This code first loops over your dataframe. It has a duplicateFound boolean for each row that will be used later.
It will loop over the uniqueRows dict and first checks if a duplicate is found. In this case it will continue skip the calculations, because this is not needed anymore.
Afterwards it compares the length of the list to skip some comparisons and in case it's the same uses the following code: This returns a list with the differences and in case there are no differences returns an empty list.
So if the list is empty it sets the value from the C column at this position using pandas dataframe at function (this has to be used when iterating over a dataframe link). It sets the unqiueFound variable to True to prevent further comparisons. In case no duplicates were found the uniqueFound value will still be False and will trigger the addition to the uniqueRows dict at the end of the for loop before going to the next row.
In case you have any comments or improvements to my code feel free to discuss and hope this code helps you with your project!
Create a temporary column by applying sorted to each entry in the B column; group by the temporary column to get your matches and get rid of the temporary column.
df1['B_temp'] = df1.B.apply(lambda x: ''.join(sorted(x)))
df1['C'] = df1.groupby('B_temp').C.transform('min')
df1 = df1.drop('B_temp', axis = 1)
df1
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1

Convert True/False to 1/0 Python [duplicate]

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

how to preserve pandas dataframe identity when extracting a single row

I am extracting a subset of my dataframe by index using either .xs or .loc (they seem to behave the same). When my condition retrieves multiple rows, the result stays a dataframe. When only a single row is retrieved, it is automatically converted to a series. I don't want that behavior, since that means I need to handle multiple cases downstream (different method sets available for series vs dataframe).
In [1]: df = pd.DataFrame({'a':range(7), 'b':['one']*4 + ['two'] + ['three']*2,
'c':range(10,17)})
In [2]: df.set_index('b', inplace=True)
In [3]: df.xs('one')
Out[3]:
a c
b
one 0 10
one 1 11
one 2 12
one 3 13
In [4]: df.xs('two')
Out[4]:
a 4
c 14
Name: two, dtype: int64
In [5]: type(df.xs('two'))
Out [5]: pandas.core.series.Series
I can manually convert that series back to a dataframe, but it seems cumbersome and will also require case testing to see if I should do that. Is there a cleaner way to just get a dataframe back to begin with?
IIUC, you can simply add braces, [], and use .loc:
>>> df.loc["two"]
a 4
c 14
Name: two, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
>>> df.loc[["two"]]
a c
b
two 4 14
[1 rows x 2 columns]
>>> type(_)
<class 'pandas.core.frame.DataFrame'>
This may remind you of how numpy advanced indexing works:
>>> a = np.arange(9).reshape(3,3)
>>> a[1]
array([3, 4, 5])
>>> a[[1]]
array([[3, 4, 5]])
Now this will probably require some refactoring of code so that you're always accessing with a list, even if the list only has one element, but it works well for me in practice.

Mapping a few numerical columns into a new columns of tuples in Pandas

For object data I can map two columns into a third, (object) column of tuples
>>> import pandas as pd
>>> df = pd.DataFrame([["A","b"], ["A", "a"],["B","b"]])
>>> df
0 1
0 A b
1 A a
2 B b
>>> df.apply(lambda row: (row[0], row[1]), axis=1)
0 (A, b)
1 (A, a)
2 (B, b)
dtype: object
(see also Pandas: How to use apply function to multiple columns).
However, when I try to do the same thing with numerical columns
>>> df2 = pd.DataFrame([[10,2], [10, 1],[20,2]])
df2.apply(lambda row: (row[0], row[1]), axis=1)
0 1
0 10 2
1 10 1
2 20 2
so instead of a series of pairs (i.e. [(10,2), (10,1), (20,2)]) I get a DataFrame.
How can I force pandas to actually get a series of pairs? (Preferably, doing it nicer than converting to string and then parsing.)
I don't recommend this, but you can force it:
In [11]: df2.apply(lambda row: pd.Series([(row[0], row[1])]), axis=1)
Out[11]:
0
0 (10, 2)
1 (10, 1)
2 (20, 2)
Please don't do this.
Two columns will give you much better performance, flexibility and ease of later analysis.
Just to update with the OP's experience:
What was wanted was to count the occurrences of each [0, 1] pair.
In Series they could use the value_counts method (with the column from the above result). However, the same result could be achieved using groupby and found to be 300 times faster (for OP):
df2.groupby([0, 1]).size()
It's worth emphasising (again) that [11] has to create a Series object and a tuple instance for each row, which is a huge overhead compared to that of groupby.

Categories

Resources