Pandas dropna not working as expected - python

I have a dataframe and I used dropna() on it successfully as shown:
proc_train.isnull().any()
id False
perc_premium_paid_by_cash_credit False
age_in_days False
Income False
Count_3-6_months_late False
Count_6-12_months_late False
Count_more_than_12_months_late False
application_underwriting_score False
no_of_premiums_paid False
premium False
renewal False
sourcing_channel_B False
sourcing_channel_C False
sourcing_channel_D False
sourcing_channel_E False
Urban/Rural False
prem_to_inc_ratio False
late36_612 False
late36_12more False
late612_12more False
perc_times_prem False
Then I try to take a selection of the data to use as input variables:
X_train = proc_train.loc[:, proc_train.columns != 'renewal']
X_train = X.loc[:, X.columns != 'id']
but it then gives all the null values back:
X_train.isnull().any()
perc_premium_paid_by_cash_credit False
age_in_days False
Income False
Count_3-6_months_late True
Count_6-12_months_late True
Count_more_than_12_months_late True
application_underwriting_score True
no_of_premiums_paid False
premium False
sourcing_channel_B False
sourcing_channel_C False
sourcing_channel_D False
sourcing_channel_E False
Urban/Rural False
prem_to_inc_ratio False
late36_612 True
late36_12more True
late612_12more True
perc_times_prem False
Why does this happen and what would be a better way to run this?

This section:
X_train = X.loc[:, X.columns != 'id']
should be
X_train = X_train.loc[:, X_train.columns != 'id']
which produces the same all-False result for isnull().any() as before.

Related

Editorconfig for python docstring

I am using the pycharm IDE and I would like to automatize the indentation of the python docstrings with the .editorconfig file. I get to control almost everything with the following configuration (just in case it is useful):
[{*.py,*.pyw}]
ij_python_align_collections_and_comprehensions = true
ij_python_align_multiline_imports = true
ij_python_align_multiline_parameters = true
ij_python_align_multiline_parameters_in_calls = true
ij_python_blank_line_at_file_end = true
ij_python_blank_lines_after_imports = 1
ij_python_blank_lines_after_local_imports = 0
ij_python_blank_lines_around_class = 1
ij_python_blank_lines_around_method = 1
ij_python_blank_lines_around_top_level_classes_functions = 2
ij_python_blank_lines_before_first_method = 0
ij_python_call_parameters_new_line_after_left_paren = false
ij_python_call_parameters_right_paren_on_new_line = false
ij_python_call_parameters_wrap = normal
ij_python_dict_alignment = 0
ij_python_dict_new_line_after_left_brace = false
ij_python_dict_new_line_before_right_brace = false
ij_python_dict_wrapping = 1
ij_python_from_import_new_line_after_left_parenthesis = false
ij_python_from_import_new_line_before_right_parenthesis = false
ij_python_from_import_parentheses_force_if_multiline = false
ij_python_from_import_trailing_comma_if_multiline = false
ij_python_from_import_wrapping = 1
ij_python_hang_closing_brackets = false
ij_python_keep_blank_lines_in_code = 1
ij_python_keep_blank_lines_in_declarations = 1
ij_python_keep_indents_on_empty_lines = false
ij_python_keep_line_breaks = true
ij_python_method_parameters_new_line_after_left_paren = false
ij_python_method_parameters_right_paren_on_new_line = false
ij_python_method_parameters_wrap = normal
ij_python_new_line_after_colon = false
ij_python_new_line_after_colon_multi_clause = true
ij_python_optimize_imports_always_split_from_imports = false
ij_python_optimize_imports_case_insensitive_order = false
ij_python_optimize_imports_join_from_imports_with_same_source = false
ij_python_optimize_imports_sort_by_type_first = true
ij_python_optimize_imports_sort_imports = true
ij_python_optimize_imports_sort_names_in_from_imports = false
ij_python_space_after_comma = true
ij_python_space_after_number_sign = true
ij_python_space_after_py_colon = true
ij_python_space_before_backslash = true
ij_python_space_before_comma = false
ij_python_space_before_for_semicolon = false
ij_python_space_before_lbracket = false
ij_python_space_before_method_call_parentheses = false
ij_python_space_before_method_parentheses = false
ij_python_space_before_number_sign = true
ij_python_space_before_py_colon = false
ij_python_space_within_empty_method_call_parentheses = false
ij_python_space_within_empty_method_parentheses = false
ij_python_spaces_around_additive_operators = true
ij_python_spaces_around_assignment_operators = true
ij_python_spaces_around_bitwise_operators = true
ij_python_spaces_around_eq_in_keyword_argument = false
ij_python_spaces_around_eq_in_named_parameter = false
ij_python_spaces_around_equality_operators = true
ij_python_spaces_around_multiplicative_operators = true
ij_python_spaces_around_power_operator = true
ij_python_spaces_around_relational_operators = true
ij_python_spaces_around_shift_operators = true
ij_python_spaces_within_braces = false
ij_python_spaces_within_brackets = false
ij_python_spaces_within_method_call_parentheses = false
ij_python_spaces_within_method_parentheses = false
ij_python_use_continuation_indent_for_arguments = false
ij_python_use_continuation_indent_for_collection_and_comprehensions = false
ij_python_use_continuation_indent_for_parameters = true
ij_python_wrap_long_lines = false
It would be rather useful to manage all the indentation config from the same file. That is, adding the docstring configuration to the .editorconfig. Someone knows if it is possible to control the docstring style by using .editorconfig?

Sklearn KernelDensity gives identical results for two different models

I'm having trouble with KernelDensity from sklearn. I put in two completely different arrays to create two different models, but the two models have identical results (scores). They should have different results for different models, shouldn't they?
Here's my code:
from sklearn.neighbors import KernelDensity
import numpy as np
kde = KernelDensity(kernel="gaussian", bandwidth=15)
def reproducible_example():
X1 = np.array([9,18,28,35,54,59,65,83,89,116,119,124,144])
X2 = np.array([39,51,57,61,66,81,88,103,120,126,130,132,134])
model1 = kde.fit(X1[:, np.newaxis])
model2 = kde.fit(X2[:, np.newaxis])
X_plot = np.linspace(0, 129, 130)
score1 = model1.score_samples(X_plot[:, np.newaxis])
score2 = model2.score_samples(X_plot[:, np.newaxis])
print(np.exp(score1) == np.exp(score2))
reproducible_example()
The expected output is False, as the two different models should return two different scores. Instead, the output is this:
[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True]
Indicating that the two results are identical. How is that possible?

the pandas function any() don’t return the result what i want

I have the following DataFrame
df = pd.DataFrame(
{
'class': ['0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0'],
'item': ['1','1','2','2','2','3','3','3','3','3','4','4','5','5','5','5','5','5','5'],
'last_PO_code': ['103','103','103','104','103','103','104','105','106','103','103','104','103','103','104','105','105','106','1046'],
'qty': [3,4,3,3,2,4,4,3,3,3,5,5,2,6,8,2,6,2,6],
}
)
I apply the following rules for each unique item in the item column to this DataFrame:
last_PO_code has '103' only.
last_PO_code has ('103' & '104') and (qty column of '103' > qty column of '104')
last_PO_code has ('103' & '104' & '105' & '106') and (qty column of '105' == qty column of '106') and (qty column of '103' > qty column of '104')
last_PO_code don't have '103'
last_PO_code has ('103' & '104') and (qty column of '103' == qty column of '104')
last_PO_code has ('103' & '104' & '105' & '106') and (qty column of '105' == qty column of '106') and (qty column of '103' == qty column of '104')
I wrote the following code, but the result is not what I want.
regle1 = lambda x: True if x['last_PO_code'].eq('103').all() else False
regle2 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and x['last_PO_code'].eq('103').sum() > x['last_PO_code'].eq('104').sum() \
else False
regle3 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and x['last_PO_code'].eq('105').any() \
and x['last_PO_code'].eq('106').any() \
and x['last_PO_code'].eq('103').sum() > x['last_PO_code'].eq('104').sum() \
and x['last_PO_code'].eq('105').sum() == x['last_PO_code'].eq('106').sum() \
else False
regle4 = lambda x: False if x['last_PO_code'].eq('103').any() else True
regle5 = lambda x: True if (x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any()) \
and x['last_PO_code'].eq('103').sum() == x['last_PO_code'].eq('104').sum() \
else False
regle6 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and x['last_PO_code'].eq('105').any() \
and x['last_PO_code'].eq('106').any() \
and x['last_PO_code'].eq('103').sum() == x['last_PO_code'].eq('104').sum() \
and x['last_PO_code'].eq('105').sum() == x['last_PO_code'].eq('106').sum() \
else False
df2 = df.groupby(['class','item']).apply(lambda x: pd.Series({'regle1' : regle1(x),
'regle2': regle2(x),
'regle3' : regle3(x)
}))
Only regle1 does what I want for all items. For me the problem comes from the any() function. Either I use it badly or I don't understand it well.
What I have :
regle1 regle2 regle3 regle4 regle5 regle6
class item
0 1 True False False False False False
2 False True False False False False
3 False True True False False False
4 False False False False True False
5 False True True False False False
What I want :
regle1 regle2 regle3 regle4 regle5 regle6
class item
0 1 True False False False False False
2 False True False False False False
3 False True True False False False
4 False False False False True False
5 False False False False True True
All the mistakes I noticed were on item 5, but I don't understand why
The problem is, that you are summing the number of 'last_PO_code' instead of 'qty'. In each lambda, you must have:
(x['last_PO_code'].eq('103')*x['qty']).sum()
or as mozway suggested, even better:
x.loc[x['last_PO_code'].eq('103'), 'qty'].sum()
instead of:
x['last_PO_code'].eq('103').sum()
The whole code:
egle1 = lambda x: True if x['last_PO_code'].eq('103').all() else False
regle2 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and (x['last_PO_code'].eq('103') * x['qty']).sum() > (x['last_PO_code'].eq('104') * x['qty']).sum() \
else False
regle3 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and x['last_PO_code'].eq('105').any() \
and x['last_PO_code'].eq('106').any() \
and (x['last_PO_code'].eq('103')*x['qty']).sum() > (x['last_PO_code'].eq('104')*x['qty']).sum() \
and (x['last_PO_code'].eq('105')*x['qty']).sum() == (x['last_PO_code'].eq('106')*x['qty']).sum() \
else False
regle4 = lambda x: False if x['last_PO_code'].eq('103').any() else True
regle5 = lambda x: True if (x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any()) \
and (x['last_PO_code'].eq('103')*x['qty']).sum() == (x['last_PO_code'].eq('104')*x['qty']).sum() \
else False
regle6 = lambda x: True if x['last_PO_code'].eq('103').any() \
and x['last_PO_code'].eq('104').any() \
and x['last_PO_code'].eq('105').any() \
and x['last_PO_code'].eq('106').any() \
and (x['last_PO_code'].eq('103')*x['qty']).sum() == (x['last_PO_code'].eq('104')*x['qty']).sum() \
and (x['last_PO_code'].eq('105')*x['qty']).sum() == (x['last_PO_code'].eq('106')*x['qty']).sum() \
else False
df2 = df.groupby(['class','item']).apply(lambda x: pd.Series({'regle1' : regle1(x),
'regle2' : regle2(x),
'regle3' : regle3(x),
'regle4' : regle4(x),
'regle5' : regle5(x),
'regle6' : regle6(x),
}))
# regle1 regle2 regle3 regle4 regle5 regle6
#class item
#0 1 True False False False False False
# 2 False True False False False False
# 3 False True True False False False
# 4 False False False False True False
# 5 False False False False True True
PS. At this moment maybe it's time to use normal functions instead of lambdas, to have cleaner code :D. You also have repeatable chunks of code in your lambda, which could be easily automated.
PS2. I assumed, that in your example data, you have a typo (there shuld be 106 instead of 1046

How can I hide columns in Openpyxl?

I'm hiding a bunch of columns in an Excel sheet. I'm getting this error: AttributeError: can't set attribute from this line worksheet.column_dimensions['B'].visible = False
Sorry if this is a super simple question. I just updated to a new version of Openpyxl/Pandas so i'm now having to go through my code and make changes to fit the new version's documentation.
worksheet.column_dimensions['B'].visible = False
worksheet.column_dimensions['D'].visible = False
worksheet.column_dimensions['E'].visible = False
worksheet.column_dimensions['F'].visible = False
worksheet.column_dimensions['G'].visible = False
worksheet.column_dimensions['H'].visible = False
worksheet.column_dimensions['I'].visible = False
worksheet.column_dimensions['K'].visible = False
worksheet.column_dimensions['L'].visible = False
worksheet.column_dimensions['M'].visible = False
worksheet.column_dimensions['N'].visible = False
worksheet.column_dimensions['O'].visible = False
worksheet.column_dimensions['P'].visible = False
worksheet.column_dimensions['Q'].visible = False
worksheet.column_dimensions['R'].visible = False
worksheet.column_dimensions['S'].visible = False
worksheet.column_dimensions['T'].visible = False
worksheet.column_dimensions['U'].visible = False
worksheet.column_dimensions['V'].visible = False
worksheet.column_dimensions['W'].visible = False
worksheet.column_dimensions['X'].visible = False
worksheet.column_dimensions['Y'].visible = False
worksheet.column_dimensions['Z'].visible = False
worksheet.column_dimensions['AA'].visible = False
worksheet.column_dimensions['AB'].visible = False
worksheet.column_dimensions['AC'].visible = False
worksheet.column_dimensions['AD'].visible = False
worksheet.column_dimensions['AE'].visible = False
worksheet.column_dimensions['AF'].visible = False
worksheet.column_dimensions['AG'].visible = False
worksheet.column_dimensions['AH'].visible = False
worksheet.column_dimensions['AI'].visible = False
worksheet.column_dimensions['AJ'].visible = False
worksheet.column_dimensions['AK'].visible = False
worksheet.column_dimensions['AM'].visible = False
worksheet.column_dimensions['AN'].visible = False
worksheet.column_dimensions['AP'].visible = False
worksheet.column_dimensions['AQ'].visible = False
worksheet.column_dimensions['AR'].visible = False
worksheet.column_dimensions['AS'].visible = False
worksheet.column_dimensions['AT'].visible = False
worksheet.column_dimensions['AU'].visible = False
worksheet.column_dimensions['AV'].visible = False
worksheet.column_dimensions['AW'].visible = False
worksheet.column_dimensions['AX'].visible = False
worksheet.column_dimensions['AY'].visible = False
worksheet.column_dimensions['AZ'].visible = False
worksheet.column_dimensions['BA'].visible = False
worksheet.column_dimensions['BB'].visible = False
worksheet.column_dimensions['BC'].visible = False
worksheet.column_dimensions['BD'].visible = False
worksheet.column_dimensions['BE'].visible = False
worksheet.column_dimensions['BF'].visible = False
worksheet.column_dimensions['BH'].visible = False
worksheet.column_dimensions['BI'].visible = False
worksheet.column_dimensions['BJ'].visible = False
worksheet.column_dimensions['BK'].visible = False
worksheet.column_dimensions['BL'].visible = False
worksheet.column_dimensions['BM'].visible = False
worksheet.column_dimensions['BN'].visible = False
worksheet.column_dimensions['BO'].visible = False
worksheet.column_dimensions['BP'].visible = False
worksheet.column_dimensions['BQ'].visible = False
worksheet.column_dimensions['BR'].visible = False
worksheet.column_dimensions['BS'].visible = False
worksheet.column_dimensions['BT'].visible = False
worksheet.column_dimensions['BU'].visible = False
worksheet.column_dimensions['BV'].visible = False
worksheet.column_dimensions['BW'].visible = False
worksheet.column_dimensions['BX'].visible = False
worksheet.column_dimensions['BY'].visible = False
worksheet.column_dimensions['BZ'].visible = False
worksheet.column_dimensions['CA'].visible = False
worksheet.column_dimensions['CB'].visible = False
worksheet.column_dimensions['CC'].visible = False
worksheet.column_dimensions['CD'].visible = False
worksheet.column_dimensions['CE'].visible = False
worksheet.column_dimensions['CF'].visible = False
worksheet.column_dimensions['CG'].visible = False
worksheet.column_dimensions['CH'].visible = False
worksheet.column_dimensions['CI'].visible = False
worksheet.column_dimensions['CJ'].visible = False
worksheet.column_dimensions['CK'].visible = False
worksheet.column_dimensions['CL'].visible = False
worksheet.column_dimensions['CM'].visible = False
worksheet.column_dimensions['CN'].visible = False
worksheet.column_dimensions['CO'].visible = False
worksheet.column_dimensions['CP'].visible = False
worksheet.column_dimensions['CQ'].visible = False
worksheet.column_dimensions['CR'].visible = False
worksheet.column_dimensions['CS'].visible = False
worksheet.column_dimensions['CU'].visible = False
Also, if someone could tell me if there's a more efficient way to hide the columns, which i'm certain there probably is, that would be great.
You should set the hidden attribute to True:
worksheet.column_dimensions['A'].hidden= True
In order to hide more than one column:
for col in ['A', 'B', 'C']:
worksheet.column_dimensions[col].hidden= True
Columns can be grouped:
ws.column_dimensions.group(start='B', end='CU', hidden=True)
You can use a loop for a defined workbook wb.
in this example I have 10 columns with data and want to hidden all the remaining
16385 is the index of the last excel column, XFD, +1.
import openpyxl as op
worksheet = wb['Sheet1']
max_column =ws.max_column
last_column = op.utils.cell.column_index_from_string('XFD')
for idx in range(max_column+1, last_column+1):
ws.column_dimensions[op.utils.get_column_letter(idx)].hidden = True
if you know the positions of your columns then will be easy

Maximum limit in the length of expression evaluated by eval() in python

Consider the example
a = "( False or False ) and not ( False and True and False ) and not ( False and True and False ) "
print eval(a)
b = "( False or False or False or False or False or False or True or False or False or False or False or False or False or False or False or False or False or False ) and not False and not False and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and True ) and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and False ) and not ( False and False) and not False"
print eval(b)
First one gives proper output. but for second eventhough synax is correct it is giving
SyntaxError: EOL while scanning string literal
because of length. I need to evaluate large expressions in my program. Any suggestions?
Try to find the limit empirically:
b = 'False or False'
while True:
try:
b = b + b[5:]
print len(b), eval(b)
except:
print len(b)
break
I stopped it at len(b) == 288MiB. Interestingly, python used up to 5.5GiB of RAM at the 288MiB level.

Categories

Resources