Run functions for each row in dataframe - python

I have a dataframe df1, like this:
date sentence
29/03/1029 I like you
30/03/2019 You eat cake
and run functions getVerb and getObj to dataframe df1. So, the output is like this:
date sentence verb object
29/03/1029 I like you like you
30/03/2019 You eat cake eat cake
I want those functions (getVerb and getObj) run for each line in df1. Could someone help me to solve this problem in an efficient way?
Thank you so much.

Each column of a pandas DataFrame is a Series. You can use the Series.apply or Series.map functions to get the result you want.
df1['verb'] = df1['sentence'].apply(getVerb)
df1['object'] = df1['sentence'].apply(getObj)
# OR
df1['verb'] = df1['sentence'].map(getVerb)
df1['object'] = df1['sentence'].map(getObj)
See the pandas documentation for more details on Series.apply or Series.map.

Assume you have a pandas dataframe such as:
import pandas as pd, numpy as np
df = pd.DataFrame([[4, 9]] *3, columns=['A', 'B'])
>>>df
A B
4 9
4 9
4 9
Let's say, we want sum of columns A and B row wise and column wise. To accomplish it, we write
df.apply(np.sum, axis = 1) # for row-wise sum
Output: 13
13
13
df.apply(np.sum, axis = 0) # for column-wise sum
Output: A 12
B 27
Now, if you want to apply any function for specific set of columns, you may choose a subset from the data-frame.
For example: I want to compute sum over column A only.
df['A'].apply(np.sum, axis =1)
Dataframe.apply
You may refer the above link as well. Other than that, Series.map, Series.apply could be handy as well, as mentioned in the above answer.
Cheers!

Using a simple loop: (assuming that columns already exist in the data frame having names 'verb' and 'object')
for index, row in df1.iterrows():
df1['verb'].iloc[index]= getVerb(row['sentence'])
df1['object'].iloc[index]= getObj(row['sentence'])

Related

Pandas split DataFrame according to indices

I've been working on a pandas DataFrame,
df = pd.DataFrame({'col':[-0.217514, -0.217834, 0.844116, 0.800125, 0.824554]}, index=[49082, 49083, 49853, 49854, 49855])
and I get data that looks like this:
As you can see, the index suddenly jumps 770 values (due to a sorting I did earlier).
Now I would like to split this DataFrame into many different ones, where each one would be made of the rows whose index follow each other only (here the first 2 rows would be in the same DataFrame while the last three would be in a different one).
Does anyone have an idea as to how to do this?
Thanks!
Use groupby on the index from which we subtract an increasing by 1 sequence, then stick each group as a separate df in the list
all_dfs = [g for _,g in df.groupby(df.index - np.arange(len(df.index)))]
all_dfs
output:
[ col
49082 -0.217514
49083 -0.217834,
col
49853 0.844116
49854 0.800125
49855 0.824554]

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Why does referencing a concatenated pandas dataframe return multiple entries?

When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Select everything but a list of columns from pandas dataframe

Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']

Categories

Resources