Apply Pandas series string function to the whole dataframe - python

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.

There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34

import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe

I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

Related

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Create new column with reverse order using pandas dataframe

I need to create a new column X containing the reverse order value of x shown below.
x X
aa01 01aa
bb02 02bb
cc03 03cc
I did slice and concatenate them manually and it worked anyway, but I am looking for a "smarter" way doing this.
df["X1"] = df["x"].str.slice(0,2)
df["X2"] = df["x"].str.slice(2,4)
df['X'] = df["X2"]+ df["X1"].map(str)
Faster way would be using list comprehension instead of pandas str functions:
df['X'] = [s[-2:]+s[:2] for s in df.x]

List comprehension pandas assignment

How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')

Create a new pandas dataframe from a python list of lists with a column with lists

I have a python list of lists, e.g. [["chunky","bacon","foxes"],["dr_cham"],["organ","instructor"],...] and would like to create a pandas dataframe with one column containing the lists:
0 ["chunky","bacon","foxes"]
1 ["dr_cham"]
2 ["organ","instructor"] .
. .
The std constructor (l is the list here)
pd.DataFrame(l)
returns a dataframe with 3 columns in that case.
How would this work? I'm sure it's very simple, but I'm searching for a solution since a while and can't figure it out for some obscure reason.
Thanks,
B.
The following code should achieve what you want:
import pandas as pd
l = [["hello", "goodbye"], ["cat", "dog"]]
# Replace "lists" with whatever you want to name the column
df = pd.DataFrame({"lists": l})
After printing df, we get
lists
0 [hello, goodbye]
1 [cat, dog]
Hope this helps -- let me know if I can clarify anything!
This pattern is rarely if ever a good idea and defeats the purpose of Pandas, but if you have some unusual use case that requires it, you can achieve it by further nesting your inner lists an additional level in your main list, then creating a DataFrame from that:
l = [[x] for x in l]
df = pd.DataFrame(l)
0
0 [chunky, bacon, foxes]
1 [dr_cham]
2 [organ, instructor]

Select everything but a list of columns from pandas dataframe

Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']

Categories

Resources