How to apply a function to selected rows of a dataframe - python

I want to apply a regex function to selected rows in a dataframe. My solution works but the code is terribly long and I wonder if there is not a better, faster and more elegant way to solve this problem.
In words I want my regex function to be applied to elements of the source_value column, but only to rows where the column source_type == rhombus AND (rhombus_refer_to_odk_type == integer OR a decimal).
The code:
df_arrows.loc[(df_arrows['source_type']=='rhombus') & ((df_arrows['rhombus_refer_to_odk_type']=='integer') | (df_arrows['rhombus_refer_to_odk_type']=='decimal')),'source_value'] = df_arrows.loc[(df_arrows['source_type']=='rhombus') & ((df_arrows['rhombus_refer_to_odk_type']=='integer') | (df_arrows['rhombus_refer_to_odk_type']=='decimal')),'source_value'].apply(lambda x: re.sub(r'^[^<=>]+','', str(x)))

Use Series.isin with condition in variable m and for replace use Series.str.replace:
m = (df_arrows['source_type']=='rhombus') &
df_arrows['rhombus_refer_to_odk_type'].isin(['integer','decimal'])
df_arrows.loc[m,'source_value'] = df_arrows.loc[m,'source_value'].astype(str).str.replace(r'^[^<=>]+','')
EDIT: If mask is 2 dimensional possible problem should be duplicated columns names, you can test it:
print ((df_arrows['source_type']=='rhombus'))
print (df_arrows['rhombus_refer_to_odk_type'].isin(['integer','decimal']))

Related

How to filter on a pandas dataframe using contains against a list of columns, if I don't know which columns are present?

I want to filter my dataframe to look for columns containing a known string.
I know you can do something like this:
summ_proc = summ_proc[
summ_proc['data.process.name'].str.contains(indicator) |
summ_proc['data.win.eventdata.processName'].str.contains(indicator) |
summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) |
summ_proc['syscheck.audit.process.name'].str.contains(indicator)
]
where I'm using the | operator to check against multiple columns. But there are cases where a certain column name isn't present. So 'data.process.name' might not be present every time.
I tried the following implementation:
summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
And that works. But I'm not sure how I can apply the OR operator to this lambda function.
I want all the rows where either data.process.name or data.win.eventdata.processName or data.win.eventdata.logonProcessName or syscheck.audit.process.name contains the indicator.
EDIT:
I tried the following approach, where I created individual frames and concated all the frames.
summ_proc1 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
summ_proc2 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.processName'].str.contains(indicator) if 'data.win.eventdata.processName' in summ_proc.columns else summ_proc)]
summ_proc3 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) if 'data.win.eventdata.logonProcessName' in summ_proc.columns else summ_proc)]
frames = [summ_proc1, summ_proc2, summ_proc3]
result = pd.concat(frames)
This works, but I'm curious if there's a better more pythonic approach? Or if this current method will cause more downstream issues?
should work with something like this:
import numpy as np
columns = ['data.process.name', 'data.win.eventdata.processName']
# filter columns that are in summ_proc
available_columns = [c for c in columns if c in summ_proc.columns]
# array of Boolean values indicating if c contains indicator
ss = [summ_proc[c].str.contains(indicator) for c in available_columns]
# reduce without '|' by using 'np.logical_or'
indexer = np.logical_or.reduce(ss)
result = summ_proc[indexer]

Numpy where matching two specific columns

I have a six column matrix. I want to find the row(s) where BOTH columns match the query.
I've been trying to use numpy.where, but I can't specify it to match just two columns.
#Example of the array
x = np.array([[860259, 860328, 861277, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 871151, 871173],])
print(x)
#Match first column of interest
A = np.where(x[:,2] == 861301)
#Match second column on interest
B = np.where(x[:,3] == 861393)
#rows in both A and B
np.intersect1d(A, B)
#This approach works, but is not column specific for the intersect, leaving me with extra rows I don't want.
#This is the only way I can get Numpy to match the two columns, but
#when I query I will not actually know the values of columns 0,1,4,5.
#So this approach will not work.
#Specify what row should exactly look like
np.where(all([860259, 860328, 861277, 861393, 865534, 865716]))
#I want something like this:
#Where * could be any number. But I think that this approach may be
#inefficient. It would be best to just match column 2 and 3.
np.where(all([*, *, 861277, 861393, *, *]))
I'm looking for an efficient answer, because I am looking through a 150GB HDF5 file.
Thanks for your help!
If I understand you correctly,
you can use a little more advanced slicing, like this:
np.where(np.all(x[:,2:4] == [861277, 861393], axis=1))
this will give you only where these 2 cols are equal to [861277, 861393]

Pandas column selection with many conditions becomes unwieldy

Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))

Filtering data frame using str.contains('') but with an exception

I am trying to filter out rows in my data frame column 'PRODUCT' while using str.contains('DE'). DE ranges from DE001 up to DE999.
How do I filter out DE998 and DE999? I have been trying this code but I can't seem to figure out a way to remove DE998 and DE999 without having to do it manually on another line.
I am using df2[df2['PRODUCT'].str.contains("DE")]. Can anyone suggest a code for this or a more efficient way to do this? Thank you for answering. Sorry, still a newbie programmer.
You can create 2 masks: one testing the first 2 characters and the other testing the entire string. For the second condition, we can use ~ to indicate a negative condition. Then combine the 2 Boolean masks with the & operator.
mask1 = df2['PRODUCT'].str[:2] == 'DE'
mask2 = ~df2['PRODUCT'].isin(['DE998', 'DE999'])
res = df2[mask1 & mask2]

Concatenate Using Lambda And Conditions

I am trying to using lambda and map to create a new column within my dataframe. Essentially the new column will take column A if a criteria is met and column B is the criteria is not met. Please see my code below.
df['LS'] = df.['Long'].map(lambda y:df.Currency if y>0 else df.StartDate)
However, when I do this the function returns the entire column to each item in my new column.
In English I am going through each item y in column Long. If the item is > 0 then take the yth value in column "Currency". Otherwise take the yth value in column "Start".
Iteration is extremely slow in running the above. Are there any other options?
Thanks!
James
Just do
df['LS']=np.where(df.Long>0,df.Currency,df.StartDate)
which is the good vectored approach.
df.Long.map apply to each row, but return actually df.State or df.current which are Series.
An other approach is to consider:
df.apply(lambda row : row[1] if row[0]>0 else row[2],1)
will also work with df.columns=Index(['Long', 'Currency', 'StartDate', ...])
but it is not a vectored approach, so it is slow. (200x slower for 1000 rows in this case).
You can do the same using where:
df['LS'] = df['Currency'].where(df['Long']>0,df['StartDate'])

Categories

Resources