Hy peeps,
I wold like to know if have some possibility to use a function at the result of pandas .loc or if exist some better way to do it.
So what I'm trying to do is:
If the value in this series is =!0, then get the values of other rows and use as parameters for one function (in this case, get_working_days_delta), after this put the result in the same series.
df.loc[(df["SERIES"] != 0), 'SERIES'] = df.apply(cal.get_working_days_delta(df["DATE_1"],df["DATE_2"]))
The output is: datetime64[ns] is of unsupported type (<class 'pandas.core.series.Series'>)
In this case, the parameters used (df["DATE_1"] df["DATE_2"]) are recognized as the entire series rather than cell values
I don't wanna use .apply or .at because this df has over 4 milion rows
Hard to give a good answer without an example.
But I think the problem is that you need to filter before using apply.
Try
df[df["SERIES"] != 0].apply(...)
Related
I'm learning Python and want to use the "apply" function. Reading around the manual I found that if a I have a simple dataframe like this:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
A B
0 4 9
1 4 9
2 4 9
and then I use something like this:
df.apply(lambda x:x.sum(),axis=0)
output works because according to theory x receives every column and apply the sum to each so the the result is correctly this:
A 12
B 27
dtype: int64
When instead I issue something like:
df['A'].apply(lambda x:x.sum())
result is: 'int' object has no attribute 'sum'
question is: why is that working on a dataframe by column, it's ok and working on a single column is not ? In the end the logic should be the same. x should receive in input one column instead of two.
I know that for this simple example I should use other functions like df.agg or even df['A'].sum() but the question is to understand the logic of apply.
if you look at a specific column of a pandas.DataFrame object, you working with a pandas.Series with (in your case) integers as values. Well and integers don't have a sum() method.
(Run type(df['A']) to see that you are working with a series and not a data frame anymore when slicing a single column).
The irritating part is that if you work with an actual pandas.DataFrame object, every column is a pandas.Series object and they have a sum() method.
So there are two ways to fix your problem
Work with a pandas.DataFrame and not with a pandas.Series: df[['A']]. The additional brackets force pandas to return a pandas.DataFrame object. (Verify by type(df[['A']])) and use the lambda function just as you did before
use a function rather than a method when using lambda: df['A'].apply(lambda x: np.sum(x)) (assuming that you have imported numpy as np)
I would recommend to go with the second option as it seems to me the more generic and clearer way
However, this is only relevant if you want to apply a certain function to ever element in a pandas.Series or pandas.DataFrame. In your specific case, there is no need to take the detour that your are currently using. Just use df.sum(axis=0).
The approach with apply is over complicating things. The reason why this works is that every element of a pandas.DataFrame is a pandas.Series, which as a sum method. But so does a pandas.DataFrame has, so you can use this right away.
The only way, where you actually need to take the way with apply is if you had arrays in every cell of the pandas.DataFrame
This question is related to the question I posted yesterday, which can be found here.
So, I went ahead and implemented the solution provided by Jan to the entire data set. The solution is as follows:
import re
def is_probably_english(row, threshold=0.90):
regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
ascii = [character for character in row['App'] if regular_expression.search(character)]
quotient = len(ascii) / len(row['App'])
passed = True if quotient >= threshold else False
return passed
google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)
google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]
So, from what I understand, we are filtering the google_play_store_no_duplicates DataFrame using the is_probably_english function and storing the result, which is a boolean, into another DataFrame (google_play_store_is_probably_english). The google_play_store_is_probably_english is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in a new DataFrame.
Does this make sense and does it seem like a sound way to approach the problem? Is there a better way to do this?
This makes sense, I think this is the best way to do it, the result of the function is a boolean as you said and then when you apply it in a pd.Series you end up with a pd.Series of booleans, which is usually called a boolean mask. This concept can be very useful in pandas when you want to filter rows by some parameters.
Here is an article about boolean masks in pandas.
I am fairly new to python and pandas and I am trying to do the following:
Here is my dataset:
df5
Out[52]:
NAME
0 JIMMcdonald
1 TomDickson
2 SamHarper
I am trying to extract the first three characters using lambda apply
Here is what I have tried:
df5["FirstName"] = df5.apply(lambda x: x[0:3],axis=1)
here is the result:
df5
Out[54]:
NAME FirstName
0 JIMMcdonald JIMMcdonald
1 TomDickson TomDickson
2 SamHarper SamHarper
I dont understand why it didnt work.. can someone help me?
Thank you
This is due to the difference between DataFrame.apply (which is what you're using) and Series.apply (which is what you want to use). The easiest way to fix this is to select the series you want from your dataframe, and use .apply on that:
df5["FirstName"] = df5["NAME"].apply(lambda x: x[0:3],axis=1)
Your current code is running the apply function once on each column, in which case it's selecting the first three rows. This fixed code is running the function on each value in the selected column.
Better, yet, as #Erfan pointed out in his comment, doing simple one-liner string operations like this can often be simplified using panda's .str, which allows you to operate on entire series of strings in much the same way you'd operate on a single string:
df5["FirstName"] = df5["NAME"].str[:3]
I need to read and update my data cells values based on dataframe.iat[row, column].
my data is about 338 000 row so that I need to use the faster way (iat) for this goal.
I have to use column by its name because it changes dynamically by another loop
when I execute my code I obtain the following error
for i in range(30000):
b = data_jeux.iat[i, 'skill_id_%s' % k]
ValueError: iAt based indexing can only have integer indexers
ps: I already use df.get_value() it work correctly but I need obligatory to get a solution with .iat
pd.Index.get_loc
With Pandas, you should generally avoid Python-level for loops. However, assuming you must iterate explicitly you can use get_loc to extract the column index:
col_loc = data_jeux.columns.get_loc
for i in range(30000):
b = data_jeux.iat[i, col_loc(f'skill_id_{k}')]
Given you have a large loop, assigning data_jeux.columns.get_loc to a variable outside your loop and using f-strings may offer some marginal performance improvements.
I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.