Pandas .loc doesn't find null values based on .isnull() boolean array - python

I'm trying to get info about the null-values in my DF column LotFrontage. As you can see, there are some of them, confirmed in 2 ways:
hp_lot = houseprices[['LotFrontage', 'LotArea']]
hp_lot.describe()
hp_lot.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 1 to 2919
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LotFrontage 2433 non-null float64
1 LotArea 2919 non-null int64
dtypes: float64(1), int64(1)
memory usage: 68.4 KB
houseprices_num['LotFrontage'].isnull().describe()
output:
count 2919
unique 2
top False
freq 2433
But when I'm trying to locate them, I'm just getting this:
lf_null = houseprices_num.loc[houseprices_num['LotFrontage'].isnull(), ['LotFrontage']]
lf_null.describe()
output:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: LotFrontage, dtype: float64
My question is where the hell are my null-values? And if I messed up something in the syntax, why am I not getting an error message of some kind?
The variables:
traindf = pd.read_csv('.\\train.csv', sep=',', header=1, index_col='Id')
testdf = pd.read_csv('.\\test.csv', sep=',', header=0, index_col='Id')
houseprices = pd.concat([traindf, testdf], axis=0)
houseprices_num = houseprices[['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']]

The mistake is with this line
lf_null = houseprices_num['LotFrontage'].loc[houseprices_num['LotFrontage'].isnull()]
lf_null.describe()
You need to use loc in the following way
lf_null = houseprices_num.loc[houseprices_num['LotFrontage'].isnull(), ['LotFrontage']]
lf_null.describe()

Related

Pandas Data Frame Graphing Issue

I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

Filtering out string in a Panda Dataframe

I have the following formulas that I use to compute data in my Dataframe. The Datframe consists of data downloaded. My Index is made of dates, and the first row contains only strings..
cols = df.columns.values.tolist()
weight =
pd.DataFrame([df[col] / df.sum(axis=1) for col in df], index=cols).T
std = pd.DataFrame([df.std(axis=1) for col in df], index=cols).T
A B C D E
2006-04-27 00:00:00 'dd' 'de' 'ede' 'wew' 'were'
2006-04-28 00:00:00 69.62 69.62 6.518 65.09 69.62
2006-05-01 00:00:00 71.5 71.5 6.522 65.16 71.5
2006-05-02 00:00:00 72.34 72.34 6.669 66.55 72.34
2006-05-03 00:00:00 70.22 70.22 6.662 66.46 70.22
2006-05-04 00:00:00 68.32 68.32 6.758 67.48 68.32
2006-05-05 00:00:00 68 68 6.805 67.99 68
2006-05-08 00:00:00 67.88 67.88 6.768 67.56 67.88
The Issue I am having is that the formulas I use do not seem to ignore the Index and also the first Indexed row where it's only 'strings'. Thus i get the following error for the weight formula:
TypeError: Cannot compare type 'Timestamp' with type 'str'
and I get the following error for the std formula:
ValueError: No axis named 1 for object type
You could filter the rows so as to compute weight and standard deviation as follows:
df_string = df.iloc[0] # Assign First row to DF
df_numeric = df.iloc[1:].astype(float) # Assign All rows after first row to DF
cols = df_numeric.columns.values.tolist()
Computing:
weight = pd.DataFrame([df_numeric[col] / df_numeric.sum(axis=1) for col in df_numeric],
index=cols).T
weight
std = pd.DataFrame([df_numeric.std(axis=1) for col in df_numeric],index=cols).T
std
To reassign, say std values back to the original DF, you could do:
df_string_std = df_string.to_frame().T.append(std)
df_string_std
As the OP had difficulty in reproducing the results, here is the complete summary of the DF used:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2006-04-27 to 2006-05-08
Data columns (total 5 columns):
A 8 non-null object
B 8 non-null object
C 8 non-null object
D 8 non-null object
E 8 non-null object
dtypes: object(5)
memory usage: 384.0+ bytes
df.index
DatetimeIndex(['2006-04-27', '2006-04-28', '2006-05-01', '2006-05-02',
'2006-05-03', '2006-05-04', '2006-05-05', '2006-05-08'],
dtype='datetime64[ns]', name='Date', freq=None)
Starting DFused:
df

pandas update a dataframe column based on prefiltered groupby object

given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F

missing values using pandas.rolling_mean

I have lots of missing values when calculating rollng_mean with:
import datetime as dt
import pandas as pd
import pandas.io.data as web
stocklist = ['MSFT', 'BELG.BR']
# read historical prices for last 11 years
def get_px(stock, start):
return web.get_data_yahoo(stock, start)['Adj Close']
today = dt.date.today()
start = str(dt.date(today.year-11, today.month, today.day))
px = pd.DataFrame({n: get_px(n, start) for n in stocklist})
px.ffill()
sma200 = pd.rolling_mean(px, 200)
got following result:
In [14]: px
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 2270 non-null values
MSFT 2769 non-null values
dtypes: float64(2)
In [15]: sma200
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 689 non-null values
MSFT 400 non-null values
dtypes: float64(2)
Any idea why most of the sma200 rolling_mean values are missing and how to get the complete list ?
px.ffill() returns a new DataFrame. To modify px itself, use inplace = True.
px.ffill(inplace = True)
sma200 = pd.rolling_mean(px, 200)
print(sma200)
yields
Data columns:
BELG.BR 2085 non-null values
MSFT 2635 non-null values
dtypes: float64(2)
If you print sma200, you will probably find lots of null or missing values. This is because the threshold for number of non-nulls is high by default for rolling_mean.
Try using
sma200 = pd.rolling_mean(px, 200, min_periods=2)
From the pandas docs:
min_periods: threshold of non-null data points to require (otherwise result is NA)
You could also try changing the size of the window if your dataset is missing many points.

Categories

Resources