So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().
Related
I got an unexpected behaviour, at least for me, while working with pandas dataframe. I use a 2d array to check if the values are smaller than. If I check the entire Dataframe at once, the check is wrong at some values. But if I check explicitly the concerned cell the result is correct.
print(df.loc[5,397])
out: 14.4 #--> its actually 14.3999996185
print(df.loc[5,397] < 14.4)
out: True
print(df.loc[4:6,396:398] < 14.4)
out:
396 397 398
4 False False False
5 False False False #[5,397] should be True!
6 False False False
However, if I try to reproduce the error, I got the correct result?!
data = numpy.array([[15,15,15], [15,14.3999996185,15], [15,15,15]])
df = pd.DataFrame(data)
print(df.loc[1,1])
out: 14.3999996185
print(df.loc[1,1] < 14.4)
out: True
print(df < 14.4)
out:
0 1 2
0 False False False
1 False True False
2 False False False
Thank you
Is there a more efficient way to do the following? Ideally, by using only one if statement?
Suppose there is a dataframe with an "author" series, a "comedy" series (default = True), and a "horror" series (default = False). I want to search the author series for "stephen king" and "lovecraft" and in those cases change the value of "comedy" from True to False and change the value of "horror" from False to True.
for count,text in enumerate(df.loc[0:, "author"]):
if "stephen king" in str(text):
df.loc[count, "comedy"] = False
df.loc[count, "horror"] = True
continue
elif "lovecraft" in str(text):
df.loc[count, "comedy"] = False
df.loc[count, "horror"] = True
continue
When I try using str.contains(), I get the error "str' object has no attribute 'str'".
Don't enumerate a data frame, index and slice it.
ix = df.author.str.contains('stephen king')
df.loc[ix, 'comedy'] = False
df.loc[ix, 'horror'] = True
ix = df.author.str.contains('lovecraft')
df.loc[ix, 'comedy'] = False
df.loc[ix, 'horror'] = True
You can assign these values with df.loc. If string contains 'stephen king' or 'lovecraft', put False in column 'comedy' and True in column 'horror':
df.loc[df['author'].str.contains('stephen king|lovecraft'),
['comedy', 'horror']] = [False, True]
You can check that the column contains of several values by using
df['Author'].str.contains('|'.join(list_of_authors)
then assign the values using loc.
Ex;
>>> df
Author Comedy Horror
0 stephen king True False
1 lovecraft True False
2 jonathan ames True True
3 stephen king False True
4 oprah True False
>>> df.loc[df['Author'].str.contains('|'.join(['stephen king','lovecraft']),case=False,na=False),('Comedy','Horror')]=False,True
>>> df
Author Comedy Horror
0 stephen king False True
1 lovecraft False True
2 jonathan ames True True
3 stephen king False True
4 oprah True False
I have a pandas dataframe that looks like so:
datetime Online TEST
61 2018-03-03 True False
62 2018-03-04 True False
63 2018-03-05 True False
64 2018-03-06 True False
65 2018-03-07 True False
66 2018-03-08 True False
67 2018-03-09 True False
68 2018-03-10 True False
69 2018-03-11 False False
70 2018-03-12 False False
I need to check that for each False in the TEST column, that within a date range of 7 days that there is a False in the Online column. For example, on 2018-03-03, since TEST is False, I would want to check all plus or minus 7 days (ie plus or minus timedelta(days = 7)) for False values in the Online column. So since there are no False Online values within a 7 day time frame, then we would return a False. On the other hand, consider the date 2018-03-09, where Online is True and TEST is False. Since there is a False in Online on the day 2018-03-11, I need to return a boolean True value saying that there was a False within my 7 day time range.
I can achieve this using some slow and ugly looping mechanisms (ie go through each row using DataFrame.iterrows(), check if TEST is false, then pull the time window of plus or minus 7 days to see if Online also has a corresponding False value. But I would ideally like to have something snazzier and faster. For a visual, this is what I need my final dataframe to look like:
datetime Online TEST Check
61 2018-03-03 True False False
62 2018-03-04 True False True
63 2018-03-05 True False True
64 2018-03-06 True False True
65 2018-03-07 True False True
66 2018-03-08 True False True
67 2018-03-09 True False True
68 2018-03-10 True False True
69 2018-03-11 False False True
70 2018-03-12 False False True
Any ideas out there? Thanks in advance!
Building upon great #piRSquared comments (I didn't even know about the rolling method, it seems very useful!), you can use
check = ~(df.TEST + df.Online.rolling(15, center=True, min_periods=1).apply(np.prod).eq(1))
The second summand creates a Series object in which every element is a boolean indicating if there is not any False value in a window of size 15; this is achieved by multiplying (NumPy's prod function) all the values inside this rolling window.
The sum (with the logical inverse operator ~) is what compares booleans so we only get True in the Check series if there are two False in both columns (element-wise, of course).
Hope it helps.
I am trying to get the external coordinates of a polygon from a numpy boolean grid. For example, from a (16, 16) ndarray such as the following one
[
[False False False False False False True True True True False False False False False False],
[False False False False False True True True True True True False False False False False],
[False False False False False False False False False False True True False False False False],
[False False False False False False False False False False False True False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False]
]
If we plot that ndarray it will be like the following:
I would like to get the following coordinates, in order, such that we could draw the external ring of such polygon, e.g., [(5 1), (6 0), (7 0), (8 0), (9 0), (10 1), (11 2), (11 3), (10 2), (9 1), (8 1), (7 1), (6 1)]. What I have so far is the following:
# Consider that the boolean ndarray above is called 'prediction'
import numpy as np
from shapely.geometry import Polygon, Point
import matplotlib.pyplot as plt
# Get the coordinates that match the boolean polygon
(y, x) = np.where(prediction == True)
# Iterate on each of the coordinates, however my problem is that it is not aware of the contour order as it should be :/
coordinates = [Point(x_coordinate, y_coordinate) for x_coordinate, y_coordinate in itertools.izip(x, y)]
# Build the polygon out of the points
polygon = Polygon([[coordinate.x, coordinate.y] for coordinate in coordinates])
exterior_x, exterior_y = polygon.exterior.xy
# Plotting
fig = plt.figure(1, figsize=(5, 5))
ax = fig.add_subplot(1, 2, 1)
ax.plot(exterior_x, exterior_y, color='#6699cc')
ax.invert_yaxis()
plt.subplot(1, 2, 2)
plt.imshow(prediction)
plt.show()
The problem is that I am building the polygon not considering the order so that the result of polygon.exterior.xy will create the external ring. My approach will create the wrong contour of the polygon such as:
However, I am unable to come up with a general approach for this problem. I welcome any suggestion on how to tackle this problem. Thanks in advance.
Perhaps you can move the question to GIS stack exchange site. There you will probably get more help on this.
Anyway, a quick search shows this anwer, where it is suggested to use rasterio library, which I understand is what you need.
Adapted to your case, it can be something as:
import numpy as np
import rasterio.features
# Convert your array to 0-1 integers
myarray = [[1 if t else 0 for t in row] for row in myarray]
# Build a numpy array
myarray = np.array(myarray)
# Convert the type (don't even know why this was needed in my computer, but raised exception if not converted.
myarray = myarray.astype(np.int32)
# Let the library do the magic. You should take a look at the rasterio.features.shapes output
mypols = [p[0]['coordinates']
for p in rasterio.features.shapes(myarray)]
mypols is now an array of coordinates that you can easily convert to shapely Polygons.
Beware of properly testing stranger cases. I tried to build a multipolygon, and the library returned each connected component as a polygon. Fortunately, it returns for each polygon the associated value, so you can post process as you like.
Polygons with interior rings seem to be handled OK, though.
I don't know what is the behavior you would expect in those cases.
I'd use ConvexHull, it tries to find the smallest enveloppe which contains all your points, that would be the polygon contour : ConvexHull with Scipy
I am getting wrong results when I do an element wise comparison on a numpy array of floats.
For eg:
import numpy as np
a = np.arange(4, 5 + 0.025, 0.025)
print a
mask = a==5.0
print mask
na = a[mask]
print na
When I run the above code, a == 5.0 doesn't give me a True value for the
index where the value is in fact 5.0
I also tried setting the dtype of array to numpy.double thinking it could
be a floating point precision issue but it still returns me wrong result.
I am pretty sure I am missing something here....can anyone point me to right direction or tell me what's wrong with the code above?
Thanks!
There is an imprecision here when using float types, use np.isclose to compare an array against a scalar float value:
In [50]:
mask = np.isclose(a,5.0)
print(mask)
na = a[mask]
na
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False True False]
Out[50]:
array([ 5.])