Why does value_counts not show all values present? - python

I am using pandas 0.18.1 on a large dataframe. I am confused by the behaviour of value_counts(). This is my code:
print df.phase.value_counts()
def normalise_phase(x):
print x
return int(str(x).split('/')[0])
df['phase_normalised'] = df['phase'].apply(normalise_phase)
This prints the following:
2 35092
3 26248
1 24646
4 22189
1/2 8295
2/3 4219
0 1829
dtype: int64
1
nan
Two questions:
Why is nan printing as an output of normalise_phase, when nan
is not listed as a value in value_counts?
Why does value_counts show dtype as int64 if it has string values like
1/2 and nan in it too?

You need to pass dropna=False for NaNs to be tallied (see the docs).
int64 is the dtype of the series (counts of the values). The values themselves are the index. dtype of the index will be object, if you check.
ser = pd.Series([1, '1/2', '1/2', 3, np.nan, 5])
ser.value_counts(dropna=False)
Out:
1/2 2
5 1
3 1
1 1
NaN 1
dtype: int64
ser.value_counts(dropna=False).index
Out: Index(['1/2', 5, 3, 1, nan], dtype='object')

Related

Adding values of two Pandas series with different column names

I have two pandas series of the same length but with different column names. How can one add the values in them?
series.add(other, fill_value=0, axis=0) does avoid NaN-values, but the values are not added. Instead, the result is a concatenation of the two series.
Is there a way to obtain a new series consisting of the sum of the values in two series?
Mismatched indices
This issue is your 2 series have different indices. Here's an example:
s1 = pd.Series([1, np.nan, 3, np.nan, 5], index=np.arange(5))
s2 = pd.Series([np.nan, 7, 8, np.nan, np.nan], index=np.arange(5)+10)
print(s1.add(s2, fill_value=0, axis=0))
0 1.0
1 NaN
2 3.0
3 NaN
4 5.0
10 NaN
11 7.0
12 8.0
13 NaN
14 NaN
dtype: float64
You have 2 options: reindex via, for example, a dictionary or disregard indices and add your series positionally.
Map index of one series to align with the other
You can use a dictionary to realign. The mapping below is arbitrary. NaN values occur where, after reindexing, values in both series are NaN:
index_map = dict(zip(np.arange(5) + 10, [3, 2, 4, 0, 1]))
s2.index = s2.index.map(index_map)
print(s1.add(s2, fill_value=0, axis=0))
0 1.0
1 NaN
2 10.0
3 NaN
4 13.0
dtype: float64
Disregard indices; use positional location only
In this case, you can either construct a new series with the regular pd.RangeIndex as index (i.e. 0, 1, 2, ...), or use an index from one of the input series:
# normalized index
res = pd.Series(s1.values + s2.values)
# take index from s1
res = pd.Series(s1.values + s2.values, index=s1.index)
The values attribute lets you access the underlying raw numpy arrays. You can add those.
raw_sum = series.values + other.values
series2 = Series(raw_sum, index=series.index)
This also works:
series2 = series + other.values

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Python/Pandas: Unexpected indices when doing a groupby-apply

I'm using Pandas and Numpy on Python3 with the following versions:
Python 3.5.1 (via Anaconda 2.5.0) 64 bits
Pandas 0.19.1
Numpy 1.11.2 (probably not relevant here)
Here is the minimal code producing the problem:
import pandas as pd
import numpy as np
a = pd.DataFrame({'i' : [1,1,1,1,1], 'a': [1,2,5,6,100], 'b': [2, 4,10, np.nan, np.nan]})
a.set_index(keys='a', inplace=True)
v = a.groupby(level=0).apply(lambda x: x.sort_values(by='i')['b'].rolling(2, min_periods=0).mean())
v.index.names
This code is a simple groupby-apply, but I don't understand the outcome:
FrozenList(['a', 'a'])
For some reason, the index of the result is ['a', 'a'], which seems to be a very doubtful choice from pandas. I would have expected a simple ['a'].
Does anyone have some idea about why Pandas chooses to duplicate the column in the index?
Thanks in advance.
This is happening because sort_values returns a DataFrame or Series so the index is being concatenated to the existing groupby index, the same thing happens if you did shift on the 'b' column:
In [99]:
v = a.groupby(level=0).apply(lambda x: x['b'].shift())
v
Out[99]:
a a
1 1 NaN
2 2 NaN
5 5 NaN
6 6 NaN
100 100 NaN
Name: b, dtype: float64
even with as_index=False it would still produce a multi-index:
In [102]:
v = a.groupby(level=0, as_index=False).apply(lambda x: x['b'].shift())
v
Out[102]:
a
0 1 NaN
1 2 NaN
2 5 NaN
3 6 NaN
4 100 NaN
Name: b, dtype: float64
if the lambda was returning a plain scalar value then no duplicating index is created:
In [104]:
v = a.groupby(level=0).apply(lambda x: x['b'].max())
v
Out[104]:
a
1 2.0
2 4.0
5 10.0
6 NaN
100 NaN
dtype: float64
I don't think this is a bug rather some semantics to be aware of that some methods will return an object where the index will be aligned with the pre-existing index.

pandas dividing a column by lagged values

I'm trying to divide a Pandas DataFrame column by a lagged value, which is 1 in this example.
Create the dataframe. This example only has 1 column, even though my real data has dozens
dTest = pd.DataFrame(data={'Open': [0.99355, 0.99398, 0.99534, 0.99419]})
When I try this vector division (I'm a Python newbie coming from R):
dTest.ix[range(1,4),'Open'] / dTest.ix[range(0,3),'Open']
I get this output:
NaN 1 1 NaN
But I'm expecting:
1.0004327915052085
1.0013682367854484
0.9988446159101413
There's clearly something that I don't understand about the data structure. I'm expecting 3 values but it's outputting 4. What am I missing?
What you tried failed because the sliced ranges of the indices only overlap on the middle 2 rows. You should use shift to shift the rows to achieve what you want:
In [166]:
dTest['Open'] / dTest['Open'].shift()
Out[166]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
you can also use div:
In [159]:
dTest['Open'].div(dTest['Open'].shift(), axis=0)
Out[159]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
You can see that the indices are different when you slice so when using / only the common indices are affected:
In [164]:
dTest.ix[range(0,3),'Open']
Out[164]:
0 0.99355
1 0.99398
2 0.99534
Name: Open, dtype: float64
In [165]:
dTest.ix[range(1,4),'Open']
Out[165]:
1 0.99398
2 0.99534
3 0.99419
Name: Open, dtype: float64
here:
In [168]:
dTest.ix[range(0,3),'Open'].index.intersection(dTest.ix[range(1,4),'Open'].index
Out[168]:
Int64Index([1, 2], dtype='int64')

Concatenating Columns Pandas

I'm trying to concatenate several columns which mostly contain NaNs to one, but here is an example on 2 only:
2013-06-18 21:46:33.422096-05:00 A NaN
2013-06-18 21:46:35.715770-05:00 A NaN
2013-06-18 21:46:42.669825-05:00 NaN B
2013-06-18 21:46:45.409733-05:00 A NaN
2013-06-18 21:46:47.130747-05:00 NaN B
2013-06-18 21:46:47.131314-05:00 NaN B
This could go on for 3 or 4 or 10 columns, always 1 being pd.notnull() and the rest are NaN.
I want to concatenate these into 1 column the fastest way possible. How can I do this?
You get one string per line and the other cells are NaN, then the math to apply is to ask for the max value:
df.max(axis=1)
As per comment, if it doesn't work in Python 3, project your NaN into strings before:
df.fillna('').max(axis=1)
You could do
In [278]: df = pd.DataFrame([[1, np.nan], [2, np.nan], [np.nan, 3]])
In [279]: df
Out[279]:
0 1
0 1 NaN
1 2 NaN
2 NaN 3
In [280]: df.sum(1)
Out[280]:
0 1
1 2
2 3
dtype: float64
Since NaNs are treated as 0 when summed, they don't show up.
A couple of caveats: You need to be sure that only one of the columns has a non-Nan for this to work. It will also only work on numeric data.
You can also use
df.fillna(method='ffill', axis=1).iloc[:, -1]
The last column will now contain all the valid observations since the valid ones have been filled ahead. See the documentation here. The second way should be more flexible but slower. I slice off every row and the last column with iloc[:, -1].

Categories

Resources