Ignoring NaN/null values while looping through data - python

I wasn't able to find any clear answers on what I assume to be a simple question. This is for Python 3. What are some of your tips and tricks when applying functions, loops, etc... on your data when your column has both null and non null values?
Here is the example I ran into when I was cleaning some data today. I have a function that takes two columns from my merged dataframe then calculates a ratio showing how similar two strings are.
imports:
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
import pyodbc
import difflib
import os
from functools import partial
import datetime
my function:
def apply_sm(merged, c1, c2):
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
Here is me calling the function in my code example:
merged['NameMatchRatio'] = merged.apply(partial(apply_sm, c1='CLIENT NAME', c2='ClientName'), axis=1)
CLIENT NAME has no null values, while ClientName does have null values (which throw out errors when I try to apply my function). How can I apply my function while ignoring the NaN values (in either column just in case)?
Thank you for your time and assistance.

You can use math.isnan to check if a value is nan and skip it. Alternatively, you can also replace nan with zero or something else and then apply your function on it. It really depends on what you want to achieve.
A simple example:
import math
test_variable = math.nan
if math.isnan(test_variable):
print("it is a nan value")
Just incorporate this logic into your code as you deem fit.

def apply_sm(merged, c1, c2):
if not merged[[c1,c2]].isnull().any():
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
return 0.0 # <-- you could handle the Null case here

Related

Pandas qcut function duplicates parameter

Maybe I don't get the point? but why isn't in Pandas qcut function accepting "ignore" as argument from duplicates?
So small Datasets with duplicate Values are printing the Error:
"Bin edges must be unique"
and the advice to use the "drop" option. But if you want to have a fixed Number of bins there is no possibility?
small code example thats not working:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
pd.qcut(data,10,labels=np.arange(0,10),duplicates="raise")
small code how it works, but don't get the same number of bins:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
qcut(data,4,labels=np.arange(0,3),duplicates="drop")
What could be a possible solution:
Insert a third option "ignore" to https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L405
Change the if else block in https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L418-L424
to
if duplicates == "raise":
raise ValueError(
f"Bin edges must be unique: {repr(bins)}.\n"
f"You can drop duplicate edges by setting the 'duplicates' kwarg"
)
elif duplicates == "drop":
bins = unique_bins

Writing a fast (array-operation) function that solves a function in one column to get the next function value

I have a DataFrame in which I want to define each column b entry by numerically solving an equation that uses the corresponding element from column a. For instance, the first entry, let $a[0]$ and $b[0]$ be the first entries of column $a$ and column $b$. $a[0]$ and a function $f(x)=e^x - a[0]*x^2$ are given. I want to define $b[0]$ by solving for a zero of $f$. The same function $f$ is used to define $b[1],b[2]$, and so on.
Currently, I am doing this entry-wise by using scipy's fsolve, via a for loop on column b's entries. This does work, buit it is slow and I have heard it is bad practice to use for loops for DataFrames.
I would appreciate any advice about how to create a function that is faster. Thanks in advance.
Let's take the case there's only one solution then you could start to play from
import pandas as pd
import numpy as np
from scipy.optimize import root
def f(x,a):
return np.exp(x) - a*x**2
n = 100
df = pd.DataFrame({"a":np.arange(1, n+1)})
df["sol"] = df["a"].apply(lambda a: root(f,x0=0, args=a).x[0])

Mixed data types in data fields

I am authoring a code designed to detect index and report errors found within very large data sets. I am reading in the data set (csv) using pandas, creating a dataframe with dozens of columns. Numerical errors are easy by converting the column of interest to an np array and using basic logical expression and the np.where function. Bam!
One of the errors I am looking for is an
invalid data type
For example, if the column was supposed to be an array of floats but a string was inadvertently entered smack dab in the middle of all of the floats. When converting to a np array it THEN converts all values into strings and causes the logic expressions to fail (as would be expected).
Ideally all non-numeric entries for that data column would be indexed as
invalid data type
with the values logged. It would then replace the value with NaN, convert the array of strings to the originally intended float values, and then continue with the assessment of numerical error checks.
This could be simply solved through for loops with a few try/catch statements. But being new to python. I am hoping for a more elegant solution.
Any suggestions?
Have a look at great expectatoins which aims to solve a similar problem. Note that until they implement their expect_column_values_to_be_parseable_as_type, you can force you column to be a string and use regex for the checks instead. For example, say you had a column called 'AGE' and wanted to validate it as an integer between 18 and 120
import great_expectations as ge
gf = ge.read_csv("my_datacsv",
dtype={
'AGE':str,
})
result = gf.expect_column_values_to_match_regex('AGE',
r'1[8-9]|[2-9][0-9]',
result_format={'result_format': 'COMPLETE'})
Alternatively, using numpy maybe something like this:
import numpy as np
#np.vectorize
def is_num(num):
try:
float(num)
return True
except:
return False
A = np.array([1,2,34,'e',5])
is_num(A)
which returns
array([ True, True, True, False, True])

Applying a function many-to-many using pandas

I'm using pandas to do some conditional filtering based on string matching using the fuzzywuzzy module. I've written some code that works, but is painfully slow and goes against every instinct in my body because I'm using a for loop over a pandas Series.
My issue is that I want to compare array of strings to another, and if a string in one array is similar enough to ANY string in the other array, I want to remove it from the array completely. My current code is this:
from fuzzywuzzy import fuzz
import pandas as pd
for value in new_contacts['StringMatch']: # this is a pandas column in a dataframe
previous_contacts['ratio'] = previous_contacts['StringMatch'].apply(lambda x: fuzz.ratio(x, value))
previous_contacts = previous_contacts[previous_contacts['ratio'] > 97] # fuzz.ratio outputs an int between 0 and 100
previous_contacts.drop('ratio', axis=1, inplace=True)
Does anyone have any suggestions / best practices to make this code faster?
There might be a faster way to do what you are asking. If possible, I'd ask you to reevaluate your need for the fuzzywuzzy package. The edit distance computation is very expensive as it constructs a matrix of size n * m (n and m being the sizes of the two strings) for each pair of strings in your arrays.

Using latest panda APIs to compute exponential moving average

I have a python v3.6 function using panda to compute exponential moving average of a list containing floating numbers. Here is the function and it is tested to work;
def get_moving_average(values, period):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = pd.ewma(values, span=period)[-1]
return moving_average
However, pd.ewma is a deprecated function and although it still works, I would like to use the latest API to use panda the correct way.
Here is the documentation for the latest exponential moving average API.
http://pandas.pydata.org/pandas-docs/stable/api.html#exponentially-weighted-moving-window-functions
I modified the original function into this to use the latest API;
def get_moving_average(values, period, type="exponential"):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = 0
moving_average = pd.ewm.mean(values, span=period)[-1]
return moving_average
Unfortunately, I got the error AttributeError: module 'pandas' has no attribute 'EWM'
The ewm() method now has a similar API to moving() and expanding(): you call ewm() and then follow it with a compatible method like mean(). For example:
df=pd.DataFrame({'x':np.random.randn(5)})
df['x'].ewm(halflife=2).mean()
0 -0.442148
1 -0.318170
2 0.099168
3 -0.062827
4 -0.371739
Name: x, dtype: float64
If you try df['x'].ewm() without arguments it will tell you:
Must pass one of com, span, halflife, or alpha
See below for documentation that may be more clear than the link in the OP:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html#pandas.DataFrame.ewm
http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows

Categories

Resources