I'm trying to apply multiple functions to my pandas dataframes, which is being called from other internal scripts. However, when applying them, the code get really long and I think that isn't a good engineering practice in general:
df['data_udo_product_brand_mode'] = df['data_udo_product_brand'].fillna(
'[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(lambda x: x.strip('"').lower())
Is there a more efficient and better way to apply multiple functions to a dataframe/dataframe columns than the above way?
Thanks in advance!
Possibly there may be a way, but one thing can be done here that the lambda function can be separated out so to get rid from the code getting long .For example:
#Lowering character and removing spaces
strip_lower = lambda x: x.strip('"').lower()
df['data_udo_product_brand_mode']=df['data_udo_product_brand'].fillna('[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(strip_lower)
Related
I have data in a tabular format with id per row. In columns I have set a flag that has one or more categorical values i.e. condition_one, condition_two
I'm generating summary statistics using the below:
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
# ---
aggregations = {
'column_one': ['count','first','last','nunique'],
'conditions_column': [function_count_certain_condition]
}
df_aggregate_stats = df.groupby(['id_column']).agg(aggregations)
This works but doesn't seem particularly pythonic or performant. I tried using value_counts() but got a key error
particularly pythonic
Yes, you are using lambda which is stored in variable (whole point of lambda is missing if it is not nameless) and then shove name for it. Just use def to define function, that is replace
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
using
def number_of_two_conditions(x):
return x.str.count("condition_two").sum()
performant
Firstly be warned against premature optimization. If that code works fast enough for your code do not try force it to be faster. Regarding that particular function I do not see anything to cause excessive execution time as both substring counting and addition are generally fast operations.
I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?
You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."
Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables
I am trying to use the function getKmers from this post
def getKmers(sequence, size=6):
return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
However my data set has over 50000 observations and when I ran this script my notebook crashed every time. How should I optimize?
One solution that I can think of is to divide the data set into pieces and run this code iteratively. I do not think this will work since apply() really just iterates through each row. I am not sure what is the problem here.
Try to do it like this. You're unnecessarily going over each column by using apply that way, when you just want to use the sequence column.
human_data['words'] = human_data['sequence'].apply(getKmers)
Edit: While this is faster (you're forgoing running the lambda function), your original way was not going through each column, I mixed apply with applymap up.
I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!
I have a Python package installed that displays Google geography info from a search term. I cannot apply the function across all the rows in my data. I can confirm that it works for a single result:
But how can I apply this across every row in the 7th column?
I understand I need a for loop:
But how do I apply my function from the package to each row?
Thanks a lot
You can use the apply function which is faster:
data2['geocode_result'] = data2['ghana_city'].apply(lambda x: gmaps.geocode(x))
Result will be JSON strings in the geocode_result column, so you may have to define some custom function to extract the information you want from the strings. At least apply will be a step towards the right direction.
pd.Series.map is one way to vectorise your algorithm:
data2['geocode_result'] = data2['ghana_city'].map(gmaps.geocode)
This is still not vectorised as gmaps.geocode() will be applied to every element in the series. But, similar to #ScratchNPurr's approach, you should see a improvement.