I am authoring a code designed to detect index and report errors found within very large data sets. I am reading in the data set (csv) using pandas, creating a dataframe with dozens of columns. Numerical errors are easy by converting the column of interest to an np array and using basic logical expression and the np.where function. Bam!
One of the errors I am looking for is an
invalid data type
For example, if the column was supposed to be an array of floats but a string was inadvertently entered smack dab in the middle of all of the floats. When converting to a np array it THEN converts all values into strings and causes the logic expressions to fail (as would be expected).
Ideally all non-numeric entries for that data column would be indexed as
invalid data type
with the values logged. It would then replace the value with NaN, convert the array of strings to the originally intended float values, and then continue with the assessment of numerical error checks.
This could be simply solved through for loops with a few try/catch statements. But being new to python. I am hoping for a more elegant solution.
Any suggestions?
Have a look at great expectatoins which aims to solve a similar problem. Note that until they implement their expect_column_values_to_be_parseable_as_type, you can force you column to be a string and use regex for the checks instead. For example, say you had a column called 'AGE' and wanted to validate it as an integer between 18 and 120
import great_expectations as ge
gf = ge.read_csv("my_datacsv",
dtype={
'AGE':str,
})
result = gf.expect_column_values_to_match_regex('AGE',
r'1[8-9]|[2-9][0-9]',
result_format={'result_format': 'COMPLETE'})
Alternatively, using numpy maybe something like this:
import numpy as np
#np.vectorize
def is_num(num):
try:
float(num)
return True
except:
return False
A = np.array([1,2,34,'e',5])
is_num(A)
which returns
array([ True, True, True, False, True])
Related
I have a dataframe with the following columns.
When I do correlation matrix, I see only the columns that are of int data types. I am new to ML, Can someone guide me what is the mistake I am doing here ?
As you correctly observe and #Kraigolas states from the docs
numeric_onlybool, default True
Include only float, int or boolean data.
Meaning that by default will only compute values from numerical columns. You can change this by using:
df.corr(numeric_only=False)
However, this means pandas will try to converte the values to float to perform the correlation, but if the values in the columns are not numerical, it will fail returning:
ValueError: could not convert string to float: 'X'
From the docs, by default numeric_only is set to True in the corr function. You need to set it to False so it compares non numeric columns. Observe that the columns in your final results were the only ones with numeric dtypes.
This behaviour is deprecated though: in future versions of pandas, numeric_only will be set to False.
Convert the non-numeric numbers to numeric values using pd.to_numeric.
df = df.apply([pd.to_numeric])
Also, convert all categorical data such as city name to dummy variables that can be used to compute correlation, as is done in this thread. Essentially, all the data you want to compute correlation on needs to be either a float or integer, preferably all one or the other, otherwise, you're likely to have problems.
This question is related to the question I posted yesterday, which can be found here.
So, I went ahead and implemented the solution provided by Jan to the entire data set. The solution is as follows:
import re
def is_probably_english(row, threshold=0.90):
regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
ascii = [character for character in row['App'] if regular_expression.search(character)]
quotient = len(ascii) / len(row['App'])
passed = True if quotient >= threshold else False
return passed
google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)
google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]
So, from what I understand, we are filtering the google_play_store_no_duplicates DataFrame using the is_probably_english function and storing the result, which is a boolean, into another DataFrame (google_play_store_is_probably_english). The google_play_store_is_probably_english is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in a new DataFrame.
Does this make sense and does it seem like a sound way to approach the problem? Is there a better way to do this?
This makes sense, I think this is the best way to do it, the result of the function is a boolean as you said and then when you apply it in a pd.Series you end up with a pd.Series of booleans, which is usually called a boolean mask. This concept can be very useful in pandas when you want to filter rows by some parameters.
Here is an article about boolean masks in pandas.
I am using the numpy rate function in order to mimic the Excel Rate function on loans.
The function returns the correct result when working with a subset of my dataframe (1 million records).
However, when working with the entire dataframe (over 10 million records), it returns null results for all.
Could this be a memory issue? If that is the case, how can it be solved?
I have already tried to chunk the data and use a while/for loop to calculate, but this didn't solve the problem.
This worked (not when I looped through the 10 million records though):
test = df2.iloc[:1000000,:]
test = test.loc[:,['LoanTerm',Instalment,'LoanAmount']]
test['True_Effective_Rate'] = ((1+np.rate(test['LoanTerm'],-test['Instalment'],test['LoanAmount'],0))**12-1)*100
I am trying to get this to work:
df2['True_Effective_Rate'] = ((1+np.rate(df2['LoanTerm'],-df2['Instalment'],df2['LoanAmount'],0))**12-1)*100
I see a similar question has been asked in the past where all the values returned are nulls when one of the parameter inputs are incorrect.
Using numpy.rate, on numpy array returns nan's unexpectedly
My dataframe doesn't have 0 values though. How can I prevent this from happening?
You can use apply to calculate this value once per row, so only invalid rows will be nan, not the entire result.
import pandas as pd
import numpy_financial as npf # i get a warning using np.rate
i = {
'LoanAmount': [5_000,20_000,15_000, 50_000.0, 14_000,1_000_000,10_000],
'LoanTerm': [72, 12,60, 36,72,12,-1],
'Instalment': [336.0,5000.0,333.0,0.0,-10,1000.0,20],}
df = pd.DataFrame(i)
df.apply(lambda x: npf.rate(nper=x.LoanTerm,pv=x.LoanAmount,pmt=-1*x.Instalment,fv=0),axis=1)
This will be slower for large datasets since you cannot take advantage of vectorisation.
You can also filter your dataframe entries to be only the valid values. It is hard to reproduce what is invalid, since you are not sharing the inputs but in my example above both loan term and installment must be >0.
valid = df.loc[(df.Installment > 0) & (df.LoanTerm > 0)]
npf.rate(nper=valid.LoanTerm,pv=valid.LoanAmount,pmt=-1*valid.Installment,fv=0)
I know using == for float is generally not safe. But does it work for the below scenario?
Read from csv file A.csv, save first half of the data to csv file B.csv without doing anything.
Read from both A.csv and B.csv. Use == to check if data match everywhere in the first half.
These are all done with Pandas. The columns in A.csv have types datetime, string, and float. Obviously == works for datetime and string, so if == works for float as well in this case, it saves a lot of work.
It seems to be working for all my tests, but can I assume it will work all the time?
The same string representation will become the same float representation when put through the same parse routine. The float inaccuracy issue occurs either when mathematical operations are performed on the values or when high-precision representations are used, but equality on low-precision values is no reason to worry.
No, you cannot assume that this will work all the time.
For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.
To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:
>>> import pandas as pd
>>> import numpy as np
Now create some random values, and put them into a Pandas Series object:
>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')
Now we use the to_csv method to write these values out to a file, and then read the contents of that file back into a DataFrame:
>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')
Finally, let's extract the values from the relevant column of df and compare. We'll sum the result of the == operation to find out how many of the 10000 input values were recovered exactly.
>>> sum(test_values == df['test_values'])
7808
So approximately 78% of the values were recovered correctly; the others were not.
This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision argument to the CSV reader. By supplying float_precision='round_trip' to the read_csv operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:
>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000
Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv, this time using regularly-spaced values instead of random ones:
>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)
Now we read A.csv, and write the first half of the data back out again to B.csv, as in your Step 1.
>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)
Then we read in both A.csv and B.csv, and compare the first half of A with B, as in your Step 2.
>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251
So again, several of the values don't compare correctly. Opening up the files, A.csv looks pretty much as I expect. Here are the first 15 entries in A.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015
And here are the corresponding entries in B.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015
See this bug report for more information on the introduction of the float_precision keyword to read_csv.
I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.