I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.
There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.
you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b
Related
I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per #Tim Roberts and #DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?
np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():
str(numpy.nan)
# 'nan'
As the result, the values in column C are all strings.
You can first fill the NaN rows with None and then convert them to np.nan with fillna():
df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)
B is a pure numeric column. C has a mixture of strings and numerics, so the column has type "object", and it prints differently.
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)
I have a dataframe with several columns (the features).
>>> print(df)
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
I would like to compute the mode of one of them. This is what happens:
>>> print(df['col1'].mode())
0 3
dtype: int64
I would like to output simply the value 3.
This behavoiur is quite strange, if you consider that the following very similar code is working:
>>> print(df['col1'].mean())
2.25
So two questions: why does this happen? How can I obtain the pure mode value as it happens for the mean?
Because Series.mode() can return multiple values:
consider the following DF:
In [77]: df
Out[77]:
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
e 2 3
In [78]: df['col1'].mode()
Out[78]:
0 2
1 3
dtype: int64
From docstring:
Empty if nothing occurs at least 2 times. Always returns Series
even if only one value.
If you want to chose the first value:
In [83]: df['col1'].mode().iloc[0]
Out[83]: 2
In [84]: df['col1'].mode()[0]
Out[84]: 2
I agree that it's too cumbersome
df['col1'].mode().iloc[0].values[0]
a series can have one mean(), but a series can have more than one mode()
like
<2,2,2,3,3,3,4,4,4,5,6,7,8> its mode 2,3,4.
the output must be indexed
mode() will return all values that tie for the most frequent value.
In order to support that functionality, it must return a collection, which takes the form of a dataFrame or Series.
For example, if you had a series:
[2, 2, 3, 3, 5, 5, 6]
Then the most frequent values occur twice. The result would then be the series [2, 3, 5] since each of these occur twice.
If you want to collapse this into a single value, you can access the first value, compute the max(), min(), or whatever makes most sense for your application.
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.
I am using pandas and want to select subsets of data and apply it to other columns.
e.g.
if there is data in column A; &
if there is NO data in column B;
then, apply the data in column A to column D
I have this working fine for now using .isnull() and .notnull().
e.g.
df = pd.DataFrame({'A' : pd.Series(np.random.randn(4)),
'B' : pd.Series(np.nan),
'C' : pd.Series(['yes','yes','no','maybe'])})
df['D']=''
df
Out[44]:
A B C D
0 0.516752 NaN yes
1 -0.513194 NaN yes
2 0.861617 NaN no
3 -0.026287 NaN maybe
# Now try the first conditional expression
df['D'][df['A'].notnull() & df['B'].isnull()] \
= df['A'][df['A'].notnull() & df['B'].isnull()]
df
Out[46]:
A B C D
0 0.516752 NaN yes 0.516752
1 -0.513194 NaN yes -0.513194
2 0.861617 NaN no 0.861617
3 -0.026287 NaN maybe -0.0262874
When one adds a third condition, to also check whether data in column C matches a particular string, we get the error:
df['D'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] \
= df['A'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes']
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 763, in wrapper
res = na_op(values, other)
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 718, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison
I have read that this occurs due to the different datatypes. And I can get it working if I change all the strings in column C for integers or booleans. We also know that string on its own would work, e.g. df['A'][df['B']=='yes'] gives a boolean list.
So any ideas how/why this is not working when combining these datatypes in this conditional expression? What are the more pythonic ways to do what appears to be quite long-winded?
Thanks
In case this solution doesn't work for anyone, another situation that happened to me was that even though I was reading all data in as dtype=str (and therefore doing any string comparison should be OK [ie df[col] == "some string"]), I had a column of all nulls, which becomes type float, which will give an error when comparing to a string.
To get around that, you can use .astype(str) to ensure a string to string comparison will be performed.
I think you need add parentheses () to conditions, also better is use ix for selecting column with boolean mask which can be assigned to variable mask:
mask = (df['A'].notnull()) & (df['B'].isnull()) & (df['C']=='yes')
print (mask)
0 True
1 True
2 False
3 False
dtype: bool
df.ix[mask, 'D'] = df.ix[mask, 'A']
print (df)
A B C D
0 -0.681771 NaN yes -0.681771
1 -0.871787 NaN yes -0.871787
2 -0.805301 NaN no
3 1.264103 NaN maybe