Same comparison over two DataFrame columns to form a mask - python

I have a pandas Dataframe with columns col1 and col2. I am trying to build col3 as:
df["col3"] = (df["col1"] == 1) | (df["col2"] ==1)
and it works. I tried to rewrite it as:
df["col3"] = any([df[c] == 1 for c in ["col1", "col2"]])
but I get the infamous ValueError: The truth value of a series is ambiguous ...
I even tried to rewrite any( .. ) as pd.Series( .. ).any(), but it did not work.
How would you do it?

SImpliest is compare all columns filtered in list for boolean DataFrame and add DataFrame.any:
(df[["col1", "col2"]] == 1).any(axis=1)
Your solution should be changed by np.logical_or.reduce:
np.logical_or.reduce([df[c] == 1 for c in ["col1", "col2"]])
Or a bit overcomplicated:
pd.concat([df[c] == 1 for c in ["col1", "col2"]], axis=1).any(axis=1)

As was already explained in the comments, the any function implicitly tries (and fails) to convert a series to bool
If you want to have something similar to your second code snippet, you can use numpy's any function as this supports only a single axis.
import numpy
np.any([df[c] == 1 for c in ["col1", "col2"]], axis=0)
Alternatively, you could also extend your first code snippet to more columns by using reduce
In [6]: import functools
In [7]: functools.reduce(lambda a, b: a | b, [(df[c] == 1) for c in ['col1', 'col2']])

Related

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.
There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34
import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe
I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

Fastest method of finding and replacing row-specific data in a pandas DataFrame

Given an example pandas DataFrame:
Index | sometext | a | ff |
0 'asdff' 'b' 'g'
1 'asdff' 'c' 'hh'
2 'aaf' 'd' 'i'
What would be the fastest way to replace all instances of the columns names in the [sometext] field with the data in that column, where the values to replace are row specific?
i.e. the desired result from the above input would be:
Index | sometext | a | ff |
0 'bsdg' 'b' 'g'
1 'csdhh' 'c' 'hh'
2 'ddf' 'd' 'i'
note: there is no chance the replacement values would include column names.
I have tried iterating over the rows but the execution time blows out as the length of the DataFrame and number of replacement columns increases.
the Series.str.replace method looks at single values as well so would need to be run over each row.
We can do this ..
df.apply(lambda x : pd.Series(x['sometext']).replace({'a':x['a'],'ff':x['ff']},regex=True),1)
Out[773]:
0
0 bsdg
1 csdhh
2 ddf
This way seems quite fast. See below for a brief discussion.
import re
df['new'] = df['sometext']
for v in ['a','ff']:
df['new'] = df.apply( lambda x: re.sub( v, x[v], x['new']), axis=1 )
Results:
sometext a ff new
0 asdff b g bsdg
1 asdff c hh csdhh
2 aaf d i ddf
Discussion:
I expanded the sample to 15,000 rows and this was the fastest approach by around 10x or more compared to the existing answers (although I suspect there might be even faster ways).
The fact that you want to use the columns to make row specific substitutions is what complicates this answer (otherwise you would just do a simpler version of #wen's df.replace). As it is, that simple and fast approach requires further code in both my approach and wen's although I think they are more or less working the same way.
I have the following:
d = {'sometext': ['asdff', 'asdff', 'aaf'], 'a': ['b', 'c', 'd'], 'ff':['g', 'hh', 'i']}
df = pd.DataFrame(data=d)
start = timeit.timeit()
def replace_single_string(row_label, original_column, final_column):
result_1 = df.get_value(row_label, original_column)
result_2 = df.get_value(row_label, final_column)
if 'a' in result_1:
df.at[row_label, original_column] = result_1.replace('a', result_2)
else:
pass
return df
for i in df.index.values:
df = replace_single_string(i, 'sometext', 'a')
print df
end = timeit.timeit()
print end - start
This ran in 0.000404119491577 seconds in Terminal.
The fastest method I found was to use the apply function in tandem with a replacer function that uses the basic str.replace() method. It's very fast, even with a for loop inside it, and it also allows for a dynamic amount of columns:
def value_replacement(df_to_replace, replace_col):
""" replace the <replace_col> column of a dataframe with the values in all other columns """
cols = [col for col in df_to_replace.columns if col != replace_col]
def replacer(rep_df):
""" function to by used in the apply function """
for col in cols:
rep_df[replace_col] = \
str(rep_df[replace_col]).replace(col.lower(), str(rep_df[col]))
return rep_df[replace_col]
df_to_replace[replace_col] = df_to_replace.apply(replacer, axis=1)
return df_to_replace

How to row-wise concatenate several columns containing strings?

I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe
The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.
Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object
Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k

Select everything but a list of columns from pandas dataframe

Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']

Filter string data based on its string length

I like to filter out data whose string length is not equal to 10.
If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.
df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')
This works slow, but is working.
However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file):
File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()
I believe there should be more efficient and elegant code instead of this.
Based on the answers and comments below, the simplest solution I found are:
df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]
or
df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]
or
df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]
import pandas as pd
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)
Applied to filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
A B
2 1234567890 abcdefghij
A more Pythonic way of filtering out rows based on given conditions of other columns and their values:
Assuming a df of:
data = {
"names": ["Alice", "Zac", "Anna", "O"],
"cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
"age": ["1", "4", "2", "0"],
}
df=pd.DataFrame(data)
df:
age cars names
0 1 Civic Alice
1 4 BMW Zac
2 2 Mitsubishi Anna
3 0 Benz O
Then:
df[
df["names"].apply(lambda x: len(x) > 1)
& df["cars"].apply(lambda x: "i" in x)
& df["age"].apply(lambda x: int(x) < 2)
]
We will have :
age cars names
0 1 Civic Alice
In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.
I personally found this way to be the easiest:
df['column_name'] = df[df['column_name'].str.len()!=10]
You can also use query:
df.query('A.str.len() == 10 & B.str.len() == 10')
If You have numbers in rows, then they will convert as floats.
Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.
you can use df.apply(len) . it will give you the result
For string operations such as this, vanilla Python using built-in methods (without lambda) is much faster than apply() or str.len().
Building a boolean mask by mapping len to each string inside a list comprehension is approx. 40-70% faster than apply() and str.len() respectively.
For multiple columns, zip() allows to evaluate values from different columns concurrently.
col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]
For a single column, drop zip() and loop over the column and check if the length is equal to 3:
df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]
This code can be written a little concisely using the Series.map() method (but a little slower than list comprehension due to pandas overhead):
df2 = df[df['A'].astype(str).map(len)==3]
Filter out values other than length of 10 from column A and B, here i pass lambda expression to map() function. map() function always applies in Series Object.
df = df[df['A'].map(lambda x: len(str(x)) == 10)]
df = df[df['B'].map(lambda x: len(str(x)) == 10)]
You could use applymap to filter all columns you want at once, followed by the .all() method to filter only the rows where both columns are True.
#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)
#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

Categories

Resources