Use dropna on subset for clean-up - python

I need do check for completeness on a subset of my pandas.DataFrame.
Currently I am doing this:
special = df[df.kind=='special']
others = df[df.kind!='special']
special = special.dropna(how='any')
all = pd.concat([special, others])
I am wondering if I'm not missing anything of the powerful Pandas API that makes this possible in one line?

I don't have access to Pandas from where I'm writing, however pd.DataFrame.isnull() checks whether things are null, and pd.DataFrame.any() can check conditions by row.
Consequently, if you do
(df.kind != 'special') | ~df.isnull().any(axis=1)
this should give the rows you want to keep. You can just use normal indexing on this expression.
It would be interesting to see if this at all speeds things up (it checks things on more rows than your solution does, but might create smaller DataFrames).

Related

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

Python / Pyspark - Correct method chaining order rules

Coming from a SQL development background, and currently learning pyspark / python I am a bit confused with querying data / chaining methods, using python.
for instance the query below (taken from 'Learning Spark 2nd Edition'):
fire_ts_df.
select("CallType")
.where(col("CallType").isNotNull())
.groupBy("CallType")
.count()
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
will execute just fine.
What i don't understand though, is that if i had written the code like: (moved the call to 'count()' higher)
fire_ts_df.
select("CallType")
.count()
.where(col("CallType").isNotNull())
.groupBy("CallType")
.orderBy("count", ascending=False)
.show(n=10, truncate=False)
this wouldn't work.
The problem is that i don't want to memorize the order, but i want to understand it. I feel it has something to do with proper method chaining in Python / Pyspark but I am not sure how to justify it. In other words, in a case like this, where multiple methods should be invoked and chained using (.), what is the right order and is there any specific rule to follow?
Thanks a lot in advance
The important thing to note here is that chained methods necessarily do not occur in random order. The operations represented by these method calls are not some associative transformations applied flatly on the data from left to right.
Each method call could be written as a separate statement, where each statement produces a result that makes the input to the next operation, and so on until the result.
fire_ts_df.
select("CallType") # selects column CallType into a 1-col DF
.where(col("CallType").isNotNull()) # Filters rows on the 1-column DF from select()
.groupBy("CallType") # Group filtered DF by the one column into a pyspark.sql.group.GroupedData object
.count() # Creates a new DF off the GroupedData with counts
.orderBy("count", ascending=False) # Sorts the aggregated DF, as a new DF
.show(n=10, truncate=False) # Prints the last DF
Just to use your example to explain why this doesn't work, calling count() on a pyspark.sql.group.GroupedData creates a new data frame with aggregation results. But count() called on a DataFrame object returns just the number of records, which means that the following call, .where(col("CallType").isNotNull()), is made on a long, which simply doesn't make sense. Longs don't have that filter method.
As said above, you may visualize it differently by rewriting the code in separate statements:
call_type_df = fire_ts_df.select("CallType")
non_null_call_type = call_type_df.where(col("CallType").isNotNull())
groupings = non_null_call_type.groupBy("CallType")
counts_by_call_type_df = groupings.count()
ordered_counts = counts_by_call_type_df.orderBy("count", ascending=False)
ordered_counts.show(n=10, truncate=False)
As you can see, the ordering is meaningful as the succession of operations is consistent with their respective output.
Chained calls make what is referred to as fluent APIs, which minimize verbosity. But this does not remove the fact that a chained method must be applicable to the type of the output of the preceding call (and in fact that the next operation is intended to be applied on the value produced by the one preceding it).

How do I update value in DataFrame with mask when iterating through rows

With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.

What is the purpose of the alias method in PySpark?

While learning Spark in Python, I'm having trouble understanding both the purpose of the alias method and its usage. The documentation shows it being used to create copies of an existing DataFrame with new names, then join them together:
>>> from pyspark.sql.functions import *
>>> df_as1 = df.alias("df_as1")
>>> df_as2 = df.alias("df_as2")
>>> joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')
>>> joined_df.select("df_as1.name", "df_as2.name", "df_as2.age").collect()
[Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)]
My question has two parts:
What is the purpose of the alias input? It seems redundant to give the alias string "df_as1" when we are already assigning the new DataFrame to the variable df_as1. If we were to instead use df_as1 = df.alias("new_df"), where would "new_df" ever appear?
In general, when is the alias function useful? The example above feels a bit artificial, but from exploring tutorials and examples it seems to be used regularly -- I'm just not clear on what value it provides.
Edit: some of my original confusion came from the fact that both DataFrame and Column have alias methods. Nevertheless, I'm still curious about both of the above questions, with question 2 now applying to Column.alias as well.
The variable name is irrelevant and can be whatever you like it to be. It's the alias what will be used in string column identifiers and printouts.
I think that the main purpose of aliases is to achieve better brevity and avoid possible confusion when having conflicting column names. For example what was simply 'age' could be aliased to 'max_age' for brevity after you searched for the biggest value in that column. Or you could have a data frame for employees in a company joined with itself and filter so that you have manager-subordinate pairs. It could be useful to use column names like "manager.name" in such context.

OpenPyXL Using Built-In Conditional Formatting ie: Duplicate and Unique Values

I am writing a python method that checks a specific column in Excel and highlights duplicate values in red (if any), then copy those rows onto a separate sheet that I will use to check to see why they have duplicate values. This is just for Asset Management where I want to check to make sure there are no two exact serial numbers or Asset ID numbers etc.
At this moment I just want to check the column and highlight duplicate values in red. As of now, I have this method started and it runs it just does not highlight of the cells that have duplicate values. I am using a test sheet with these values in column A,
(336,565,635,567,474,326,366,756,879,567,453,657,678,324,987,667,567,657,567)The number "567" repeats a few times.
def check_duplicate_values(self,wb):
self.wb=wb
ws=self.wb.active
dxf = DifferentialStyle(fill=self.red_fill())
rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None, formula=['COUNTIF($A$1:$A1,A1)>1'])
ws.conditional_formatting.add('Sheet1!$A:$A',rule) #Not sure if I need this
self.wb.save('test.xlsx')
In Excel, I can just create a Conditional Format rule to accomplish this however in OpenPyXL I am not sure if I am using their built-in methods correctly. Also, could my formula be incorrect?
Whose built-in methods are you referring to? openpyxl is a file format library and, hence, allows you manage conditional formats as they are stored in Excel worksheets. Unfortunately, the details of the rules are not very clear from the specification so form of reverse engineering from an existing is generally required, though it's probably worth noting that rules created by Excel are almost always more verbose than actually required.
I would direct further questions to the openpyxl mailing list.
Just remove the formula and you're good to go.
duplicate_rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None)
You can also use unique rule:
unique_rule = Rule(type="uniqueValues", dxf=dxf, stopIfTrue=None)
Check this out for more info: https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/formatting/rule.html#RuleType

Categories

Resources