I am struggling to understand how df.apply()exactly works.
My problem is as follows: I have a dataframe df. Now I want to search in several columns for certain strings. If the string is found in any of the columns I want to add for each row where the string is found a "label" (in a new column).
I am able to solve the problem with map and applymap(see below).
However, I would expect that the better solution would be to use applyas it applies a function to an entire column.
Question: Is this not possible using apply? Where is my mistake?
Here are my solutions for using map and applymap.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
Solution using map
def setlabel_func(column):
return df[column].str.contains("A")
mask = sum(map(setlabel_func, ["h1","h5"]))
df.ix[mask==1,"New Column"] = "Label"
Solution using applymap
mask = df[["h1","h5"]].applymap(lambda el: True if re.match("A",el) else False).T.any()
df.ix[mask == True, "New Column"] = "Label"
For applyI don't know how to pass the two columns into the function / or maybe don't understand the mechanics at all ;-)
def setlabel_func(column):
return df[column].str.contains("A")
df.apply(setlabel_func(["h1","h5"]),axis = 1)
Above gives me alert.
'DataFrame' object has no attribute 'str'
Any advice? Please note that the search function in my real application is more complex and requires a regex function which is why I use .str.contain in the first place.
Another solutions are use DataFrame.any for get at least one True per row:
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')))
h1 h5
0 True False
1 False False
2 False True
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1))
0 True
1 False
2 True
dtype: bool
df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A')).any(1),
'Label', '')
print (df)
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1)
df.loc[mask, 'New'] = 'Label'
print (df)
h1 h2 h3 h4 h5 New
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
pd.DataFrame.apply iterates over each column, passing the column as a pd.Series to the function being applied. In you case, the function you're trying to apply doesn't lend itself to being used in apply
Do this instead to get your idea to work
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A').any(), 1)
df.loc[mask, 'New Column'] = 'Label'
h1 h2 h3 h4 h5 New Column
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
IIUC you can do it this way:
In [23]: df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A'))
.sum(1) > 0,
'Label', '')
In [24]: df
Out[24]:
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
Others have given good alternative methods. Here is a way to use apply 'row wise' (axis=1) to get your new column indicating presence of "A" for a bunch of columns.
If you are passed a row, you can just join the strings together into one big string and then use a string comparison ("in") see below. here I am combing all columns, but you can do it with just H1 and h5 easily.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
def dothat(row):
sep = ""
return "A" in sep.join(row['h1':'h5'])
df['NewColumn'] = df.apply(dothat,axis=1)
This just squashes squashes each row into one string (e.g. ABCDZ) and looks for "A". This is not that efficient though if you just want to quit the first time you find the string then combining all the columns could be a waste of time. You could easily change the function to look column by column and quit (return true) when it finds a hit.
Related
I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array
i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).
How do I check a Pandas column for "any" row that matches a condition? (in my case, I want to test for type string).
Background: I was using the df.columnName.dtype.kind == 'O' to check for strings. But then I encountered the issue where some of my columns had decimal values. So I am looking for a different way to check and what I have come up with is:
display(df.col1.apply(lambda x: isinstance(x,str)).any()) #true
But the above code causes isinstance to be evaluated on every row and that seems inefficient, if I have a very large number of rows. How can I implement the above check, such that it stops evaluating further after encountering the first true value.
here is a more complete example:
from decimal import *
import pandas as pd
data = {
'c1': [None,'a','b'],
'c2': [None,1,2],
'c3': [None,Decimal(1),Decimal(2)]
}
dx = pd.DataFrame(data)
print(dx) #displays the dataframe
print('dx.dtypes')
print(dx.dtypes) #displays the datatypes in the dataframe
print('dx.c1.dtype:',dx.c1.dtype) #'O'
print('dx.c2.dtype:',dx.c2.dtype) #'float64'
print('dx.c3.dtype:',dx.c3.dtype) #'O'!
print('dx.c1.apply(lambda x: isinstance(x,str)')
print(dx.c1.apply(lambda x: isinstance(x,str)).any())#true
print('dx.c2.apply(lambda x: isinstance(x,str)).any()')
print(dx.c2.apply(lambda x: isinstance(x,str)).any())#false
#the following line shows that the apply function applies it to every row
print('dx.c1.apply(lambda x: isinstance(x,str))')
print(dx.c1.apply(lambda x: isinstance(x,str))) #false,false,false
#and only after that is the any function applied
print('dx.c1.apply(lambda x: isinstance(x,str)).any()')
print(dx.c1.apply(lambda x: isinstance(x,str)).any())#true
The above code outputs:
c1 c2 c3
0 None NaN None
1 a 1.0 1
2 b 2.0 2
dx.dtypes
c1 object
c2 float64
c3 object
dtype: object
dx.c1.dtype: object
dx.c2.dtype: float64
dx.c3.dtype: object
dx.c1.apply(lambda x: isinstance(x,str)
True
dx.c2.apply(lambda x: isinstance(x,str)).any()
False
dx.c1.apply(lambda x: isinstance(x,str))
0 False
1 True
2 True
Name: c1, dtype: bool
dx.c1.apply(lambda x: isinstance(x,str)).any()
True
Is there a better way?
More detail: I am trying to fix this line, which breaks when the column has "decimal" values: https://github.com/capitalone/datacompy/blob/8a74e60d26990e3e05d5b15eb6fb82fef62f4776/datacompy/core.py#L273
Copying my comment as an answer:
It seems what you needed was the built-in function any:
any(isinstance(x,str) for x in df['col1'])
That way rows are only evaluated until an instance of string is found.
I need to merge two pandas data frames using a columns which contains numerical values.
For example, the two data frames could be like the following ones:
data frame "a"
a1 b1
0 "x" 13560
1 "y" 193309
2 "z" 38090
3 "k" 37212
data frame "b"
a2 b2
0 "x" 13,56
1 "y" 193309
2 "z" 38,09
3 "k" 37212
What i need to do, is merge a with b on column b1/b2.
The problem is that as you can see, some values of data frame b', are a little bit different. First of all, b' values are not integers but strings and furthermore, the values which end with 0 are "rounded" (13560 --> 13,56).
What i've tried to do, is replace the comma and then cast them to int, but it doesn't work; more in details this procedure doesn't add the missing zero.
This is the code that i've tried:
b['b2'] = b['b2'].str.replace(",", "")
b['b2'] = b['b2'].astype(np.int64) # np is numpy
Is there any procedure that i can use to fix this problem?
I believe need create boolean mask for specify which values has to be multiple:
#or add parameter thousands=',' to read_csv like suggest #Inder
b['b2'] = b['b2'].str.replace(",", "", regex=True).astype(np.int64)
mask = b['b2'] < 10000
b['b2'] = np.where(mask, b['b2'] * 10, b['b2'])
print (b)
a2 b2
0 x 13560
1 y 193309
2 z 38090
3 k 37212
Correcting the column first with a apply and a lambda function:
b.b2 = b.b2.apply(lambda x: int(x.replace(',','')) * 10 if ',' in x else int(x))
I have a function like this:
def highlight_otls(df):
return ['background-color: yellow']
And a DataFrame like this:
price outlier
1.99 F,C
1.49 L,C
1.99 F
1.39 N
What I want to do is highlight a certain column in my df based off of this condition of another column:
data['outlier'].str.split(',').str.len() >= 2
So if the column values df['outlier'] >= 2, I want to highlight the corresponding column df['price']. (So the first 2 prices should be highlighted in my dataframe above).
I attempted to do this by doing the following which gives me an error:
data['price'].apply(lambda x: highlight_otls(x) if (x['outlier'].str.split(',').str.len()) >= 2, axis=1)
Any idea on how to do this the proper way?
Use Styler.apply. (To output to xlsx format, use to_excel function.)
Suppose one's dataset is
other price outlier
0 X 1.99 F,C
1 X 1.49 L,C
2 X 1.99 F
3 X 1.39 N
def hightlight_price(row):
ret = ["" for _ in row.index]
if len(row.outlier.split(",")) >= 2:
ret[row.index.get_loc("price")] = "background-color: yellow"
return ret
df.style.\
apply(hightlight_price, axis=1).\
to_excel('styled.xlsx', engine='openpyxl')
From the documentation, "DataFrame.style attribute is a property that returns a Styler object."
We pass our styling function, hightlight_price, into Styler.apply and demand a row-wise nature of the function with axis=1. (Recall that we want to color the price cell in each row based on the outlier information in the same row.)
Our function hightlight_price will generate the visual styling for each row. For each row row, we first generate styling for other, price, and outlier column to be ["", "", ""]. We can obtain the right index to modify only the price part in the list with row.index.get_loc("price") as in
ret[row.index.get_loc("price")] = "background-color: yellow"
# ret becomes ["", "background-color: yellow", ""]
Results
Key points
You need to access values in the multiple columns for your lambda function, so apply to the whole dataframe instead of the price column only.
The above also solves the issue that apply for a series has no axis argument.
Add else x to fix the syntax error in the conditional logic for your lambda
When you index x in the lambda it is a value, no longer a series, so kill the str attribute calls and just call len on it.
So try:
data.apply(lambda x: highlight_otls(x) if len(x['outlier'].split(',')) >= 2 else x, axis=1)
Output
0 [background-color: yellow]
1 [background-color: yellow]
2 [None, None]
3 [None, None]
dtype: object
One way to deal with null outlier values as per your comment is to refactor the highlighting conditional logic into the highlight_otls function:
def highlight_otls(x):
if len(x['outlier'].split(',')) >= 2:
return ['background-color: yellow']
else:
return x
data.apply(lambda x: highlight_otls(x) if pd.notnull(x['outlier']) else x, axis=1)
By the way, you may want to return something like ['background-color: white'] instead of x when you don't want to apply highlighting.
I suggest use custom function for return styled DataFrame by condition, last export Excel file:
def highlight_otls(x):
c1 = 'background-color: yellow'
c2 = ''
mask = x['outlier'].str.split(',').str.len() >= 2
df1 = pd.DataFrame(c2, index=df.index, columns=df.columns)
#modify values of df1 column by boolean mask
df1.loc[mask, 'price'] = c1
#check styled DataFrame
print (df1)
price outlier
0 background-color: yellow
1 background-color: yellow
2
3
return df1
df.style.apply(highlight_otls, axis=None).to_excel('styled.xlsx', engine='openpyxl')