I have data in a tabular format with id per row. In columns I have set a flag that has one or more categorical values i.e. condition_one, condition_two
I'm generating summary statistics using the below:
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
# ---
aggregations = {
'column_one': ['count','first','last','nunique'],
'conditions_column': [function_count_certain_condition]
}
df_aggregate_stats = df.groupby(['id_column']).agg(aggregations)
This works but doesn't seem particularly pythonic or performant. I tried using value_counts() but got a key error
particularly pythonic
Yes, you are using lambda which is stored in variable (whole point of lambda is missing if it is not nameless) and then shove name for it. Just use def to define function, that is replace
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
using
def number_of_two_conditions(x):
return x.str.count("condition_two").sum()
performant
Firstly be warned against premature optimization. If that code works fast enough for your code do not try force it to be faster. Regarding that particular function I do not see anything to cause excessive execution time as both substring counting and addition are generally fast operations.
Related
Coming from a SQL development background, and currently learning pyspark / python I am a bit confused with querying data / chaining methods, using python.
for instance the query below (taken from 'Learning Spark 2nd Edition'):
fire_ts_df.
select("CallType")
.where(col("CallType").isNotNull())
.groupBy("CallType")
.count()
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
will execute just fine.
What i don't understand though, is that if i had written the code like: (moved the call to 'count()' higher)
fire_ts_df.
select("CallType")
.count()
.where(col("CallType").isNotNull())
.groupBy("CallType")
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
this wouldn't work.
The problem is that i don't want to memorize the order, but i want to understand it. I feel it has something to do with proper method chaining in Python / Pyspark but I am not sure how to justify it. In other words, in a case like this, where multiple methods should be invoked and chained using (.), what is the right order and is there any specific rule to follow?
Thanks a lot in advance
The important thing to note here is that chained methods necessarily do not occur in random order. The operations represented by these method calls are not some associative transformations applied flatly on the data from left to right.
Each method call could be written as a separate statement, where each statement produces a result that makes the input to the next operation, and so on until the result.
fire_ts_df.
select("CallType") # selects column CallType into a 1-col DF
.where(col("CallType").isNotNull()) # Filters rows on the 1-column DF from select()
.groupBy("CallType") # Group filtered DF by the one column into a pyspark.sql.group.GroupedData object
.count() # Creates a new DF off the GroupedData with counts
.orderBy("count", ascending=False) # Sorts the aggregated DF, as a new DF
.show(n=10, truncate=False) # Prints the last DF
Just to use your example to explain why this doesn't work, calling count() on a pyspark.sql.group.GroupedData creates a new data frame with aggregation results. But count() called on a DataFrame object returns just the number of records, which means that the following call, .where(col("CallType").isNotNull()), is made on a long, which simply doesn't make sense. Longs don't have that filter method.
As said above, you may visualize it differently by rewriting the code in separate statements:
call_type_df = fire_ts_df.select("CallType")
non_null_call_type = call_type_df.where(col("CallType").isNotNull())
groupings = non_null_call_type.groupBy("CallType")
counts_by_call_type_df = groupings.count()
ordered_counts = counts_by_call_type_df.orderBy("count", ascending=False)
ordered_counts.show(n=10, truncate=False)
As you can see, the ordering is meaningful as the succession of operations is consistent with their respective output.
Chained calls make what is referred to as fluent APIs, which minimize verbosity. But this does not remove the fact that a chained method must be applicable to the type of the output of the preceding call (and in fact that the next operation is intended to be applied on the value produced by the one preceding it).
I am trying to use the function getKmers from this post
def getKmers(sequence, size=6):
return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
However my data set has over 50000 observations and when I ran this script my notebook crashed every time. How should I optimize?
One solution that I can think of is to divide the data set into pieces and run this code iteratively. I do not think this will work since apply() really just iterates through each row. I am not sure what is the problem here.
Try to do it like this. You're unnecessarily going over each column by using apply that way, when you just want to use the sequence column.
human_data['words'] = human_data['sequence'].apply(getKmers)
Edit: While this is faster (you're forgoing running the lambda function), your original way was not going through each column, I mixed apply with applymap up.
I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!
I'm trying to apply multiple functions to my pandas dataframes, which is being called from other internal scripts. However, when applying them, the code get really long and I think that isn't a good engineering practice in general:
df['data_udo_product_brand_mode'] = df['data_udo_product_brand'].fillna(
'[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(lambda x: x.strip('"').lower())
Is there a more efficient and better way to apply multiple functions to a dataframe/dataframe columns than the above way?
Thanks in advance!
Possibly there may be a way, but one thing can be done here that the lambda function can be separated out so to get rid from the code getting long .For example:
#Lowering character and removing spaces
strip_lower = lambda x: x.strip('"').lower()
df['data_udo_product_brand_mode']=df['data_udo_product_brand'].fillna('[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(strip_lower)
I have a Python package installed that displays Google geography info from a search term. I cannot apply the function across all the rows in my data. I can confirm that it works for a single result:
But how can I apply this across every row in the 7th column?
I understand I need a for loop:
But how do I apply my function from the package to each row?
Thanks a lot
You can use the apply function which is faster:
data2['geocode_result'] = data2['ghana_city'].apply(lambda x: gmaps.geocode(x))
Result will be JSON strings in the geocode_result column, so you may have to define some custom function to extract the information you want from the strings. At least apply will be a step towards the right direction.
pd.Series.map is one way to vectorise your algorithm:
data2['geocode_result'] = data2['ghana_city'].map(gmaps.geocode)
This is still not vectorised as gmaps.geocode() will be applied to every element in the series. But, similar to #ScratchNPurr's approach, you should see a improvement.