Python / Pyspark - Correct method chaining order rules - python

Coming from a SQL development background, and currently learning pyspark / python I am a bit confused with querying data / chaining methods, using python.
for instance the query below (taken from 'Learning Spark 2nd Edition'):
fire_ts_df.
select("CallType")
.where(col("CallType").isNotNull())
.groupBy("CallType")
.count()
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
will execute just fine.
What i don't understand though, is that if i had written the code like: (moved the call to 'count()' higher)
fire_ts_df.
select("CallType")
.count()
.where(col("CallType").isNotNull())
.groupBy("CallType")
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
this wouldn't work.
The problem is that i don't want to memorize the order, but i want to understand it. I feel it has something to do with proper method chaining in Python / Pyspark but I am not sure how to justify it. In other words, in a case like this, where multiple methods should be invoked and chained using (.), what is the right order and is there any specific rule to follow?
Thanks a lot in advance

The important thing to note here is that chained methods necessarily do not occur in random order. The operations represented by these method calls are not some associative transformations applied flatly on the data from left to right.
Each method call could be written as a separate statement, where each statement produces a result that makes the input to the next operation, and so on until the result.
fire_ts_df.
select("CallType") # selects column CallType into a 1-col DF
.where(col("CallType").isNotNull()) # Filters rows on the 1-column DF from select()
.groupBy("CallType") # Group filtered DF by the one column into a pyspark.sql.group.GroupedData object
.count() # Creates a new DF off the GroupedData with counts
.orderBy("count", ascending=False) # Sorts the aggregated DF, as a new DF
.show(n=10, truncate=False) # Prints the last DF
Just to use your example to explain why this doesn't work, calling count() on a pyspark.sql.group.GroupedData creates a new data frame with aggregation results. But count() called on a DataFrame object returns just the number of records, which means that the following call, .where(col("CallType").isNotNull()), is made on a long, which simply doesn't make sense. Longs don't have that filter method.
As said above, you may visualize it differently by rewriting the code in separate statements:
call_type_df = fire_ts_df.select("CallType")
non_null_call_type = call_type_df.where(col("CallType").isNotNull())
groupings = non_null_call_type.groupBy("CallType")
counts_by_call_type_df = groupings.count()
ordered_counts = counts_by_call_type_df.orderBy("count", ascending=False)
ordered_counts.show(n=10, truncate=False)
As you can see, the ordering is meaningful as the succession of operations is consistent with their respective output.
Chained calls make what is referred to as fluent APIs, which minimize verbosity. But this does not remove the fact that a chained method must be applicable to the type of the output of the preceding call (and in fact that the next operation is intended to be applied on the value produced by the one preceding it).

Related

Counting occurance of string / cateogry pandas groupby aggregate

I have data in a tabular format with id per row. In columns I have set a flag that has one or more categorical values i.e. condition_one, condition_two
I'm generating summary statistics using the below:
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
# ---
aggregations = {
'column_one': ['count','first','last','nunique'],
'conditions_column': [function_count_certain_condition]
}
df_aggregate_stats = df.groupby(['id_column']).agg(aggregations)
This works but doesn't seem particularly pythonic or performant. I tried using value_counts() but got a key error
particularly pythonic
Yes, you are using lambda which is stored in variable (whole point of lambda is missing if it is not nameless) and then shove name for it. Just use def to define function, that is replace
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
using
def number_of_two_conditions(x):
return x.str.count("condition_two").sum()
performant
Firstly be warned against premature optimization. If that code works fast enough for your code do not try force it to be faster. Regarding that particular function I do not see anything to cause excessive execution time as both substring counting and addition are generally fast operations.

Is overwriting variables names for lengthy operations bad style?

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?
You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."
Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

How to (fast) iterate over two Dataframes (Pandas) with functions applied for comparison

I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!

How do I update value in DataFrame with mask when iterating through rows

With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.

Use dropna on subset for clean-up

I need do check for completeness on a subset of my pandas.DataFrame.
Currently I am doing this:
special = df[df.kind=='special']
others = df[df.kind!='special']
special = special.dropna(how='any')
all = pd.concat([special, others])
I am wondering if I'm not missing anything of the powerful Pandas API that makes this possible in one line?
I don't have access to Pandas from where I'm writing, however pd.DataFrame.isnull() checks whether things are null, and pd.DataFrame.any() can check conditions by row.
Consequently, if you do
(df.kind != 'special') | ~df.isnull().any(axis=1)
this should give the rows you want to keep. You can just use normal indexing on this expression.
It would be interesting to see if this at all speeds things up (it checks things on more rows than your solution does, but might create smaller DataFrames).

Categories

Resources