Select only available rows of a pandas dataframe

Select only available rows of a pandas dataframe - python

Let say I have the following pandas df
import pandas as pd
d = [0.0, 1.0, 2.0]
e = pd.Series(d, index = ['a', 'b', 'c'])
df = pd.DataFrame({'A': 1., 'B': e, 'C': pd.Timestamp('20130102')})
Now I have another array
select = ['c', 'a', 'x']
Clearly, the element 'x' is not available in my original df. How can I select rows of df based on select but choose only available rows without any error? i.e. in this case, I want to select only rows corresponding to 'c' and 'a' maintaining this order.
Any pointer will be very helpful.

You could use reindex + dropna:
out = df.reindex(select).dropna()
you could also filter select before reindex:
out = df.reindex([i for i in select if i in df.index])
Output:
A B C
c 1.0 2.0 2013-01-02
a 1.0 0.0 2013-01-02

Related

assign 0 when value_count() is not found

I have a column that looks like this:
group
A
A
A
B
B
C
The value C exists sometimes but not always. This works fine when the C is present. However, if C does not occur in the column, it throws a key error.
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
I want to check whether C has a count or not. If not, I want to assign new_df["C"] a value of 0. I tried this but i still get a keyerror. What else can I try?
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
if (df.group.value_counts()['consents']):
new_df["C"] = value_counts.consents
else:
new_df["C"] = 0

One way of doing it is by converting series into dictionary and getting the key, unless not found return the default value (in your case it is 0):
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'D']})
new_df = {}
character = "C"
new_df[character] = df.group.value_counts().to_dict().get(character, 0)
output of new_df
{'C': 0}
However, I am not sure what new_df should be, it seems that it is a dictionary? Or it might be a new dataframe object?

One way could be to convert the group column to Categorical type with specified categories. eg:
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B']})
print(df)
# group
# 0 A
# 1 A
# 2 A
# 3 B
# 4 B
categories = ['A', 'B', 'C']
df['group'] = pd.Categorical(df['group'], categories=categories)
df['group'].value_counts()
[out]
A 3
B 2
C 0
Name: group, dtype: int64

How to impute the missing value or value having 0 with the average of two nearby non-zero values in Pandas in python

How to impute the missing value or value having 0 with the average of two nearby non-zero values in Pandas in python shown in this Image

One possibility would be to replace the 0 with None, and then use .bfill() and .ffill() on the column in question:
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e'], 'b': [1, 2, 0, 0, 5]})
df.loc[df['b']==0]=None
df['b'] = (df['b'].ffill()+df['b'].bfill())*0.5
a b
0 a 1.0
1 b 2.0
2 c 3.5
3 d 3.5
4 e 5.0

Get subset of column from before the first non blank

I need to get a subset of a pandas Series starting from the cell before the first non-blank one.
Ex: For the series:
>>> s = pd.Series([np.NaN, np.NaN, 1], index=['a', 'b', 'c'])
>>> s
a NaN
b NaN
c 1.0
dtype: float64
I need to get the subset containing rows 'b' and 'c'. Like this:
b NaN
c 1.0
dtype: float64
I have the following code:
import pandas as pd
import numpy as np
s = pd.Series([np.NaN, np.NaN, 1], index=['a', 'b', 'c'])
lst = s.index.to_list()
s[lst[lst.index(s.first_valid_index())-1:]]
Is there a simpler and/or faster way to do this? Note that the data may contain blanks in place of NAs.

Use get_loc (and you won't have to depend on let anymore either) and first_valid_index, this is slightly more readable:
s[s.index.get_loc(s.first_valid_index())-1:]
b NaN
c 1.0
dtype: float64
This will work assuming your index values are unique.
To handle blanks, use replace,
s2 = pd.Series(['', np.NaN, 1], index=['a', 'b', 'c'])
s2[s2.index.get_loc(s2.replace('', np.nan).first_valid_index())-1:]
b NaN
c 1
dtype: object

I will using idxmax and bfill
s[s.loc[:s.idxmax()].bfill(limit=1).notna()]
b NaN
c 1.0
dtype: float64

Pandas: Multilevel column names

pandas has support for multi-level column names:
>>> x = pd.DataFrame({'instance':['first','first','first'],'foo':['a','b','c'],'bar':rand(3)})
>>> x = x.set_index(['instance','foo']).transpose()
>>> x.columns
MultiIndex
[(u'first', u'a'), (u'first', u'b'), (u'first', u'c')]
>>> x
instance first
foo a b c
bar 0.102885 0.937838 0.907467
This feature is very useful since it allows multiple versions of the same dataframe to be appended 'horizontally' with the 1st level of the column names (in my example instance) distinguishing the instances.
Imagine I already have a dataframe like this:
a b c
bar 0.102885 0.937838 0.907467
Is there a nice way to add another level to the column names, similar to this for row index:
x['instance'] = 'first'
x.set_level('instance',append=True)

Try this:
df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
columns=[('c','a'),('c','b')]
df.columns=pd.MultiIndex.from_tuples(columns)

No need to create a list of tuples
Use: pd.MultiIndex.from_product(iterables)
import pandas as pd
import numpy as np
df = pd.Series(np.random.rand(3), index=["a","b","c"]).to_frame().T
df.columns = pd.MultiIndex.from_product([["new_label"], df.columns])
Resultant DataFrame:
new_label
a b c
0 0.25999 0.337535 0.333568
Pull request from Jan 25, 2014

You can use concat. Give it a dictionary of dataframes where the key is the new column level you want to add.
In [46]: d = {}
In [47]: d['first_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
data=[[10, 0.89, 0.98, 0.31],
[20, 0.34, 0.78, 0.34]]).set_index('idx')
In [48]: pd.concat(d, axis=1)
Out[48]:
first_level
a b c
idx
10 0.89 0.98 0.31
20 0.34 0.78 0.34
You can use the same technique to create multiple levels.
In [49]: d['second_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
data=[[10, 0.29, 0.63, 0.99],
[20, 0.23, 0.26, 0.98]]).set_index('idx')
In [50]: pd.concat(d, axis=1)
Out[50]:
first_level second_level
a b c a b c
idx
10 0.89 0.98 0.31 0.29 0.63 0.99
20 0.34 0.78 0.34 0.23 0.26 0.98

A lot of these solutions seem just a bit more complex than they need to be.
I prefer to make things look as simple and intuitive as possible when speed isn't absolutely necessary. I think this solution accomplishes that.
Tested in versions of pandas as early as 0.22.0.
Simply create a DataFrame (ignore columns in the first step) and then set colums equal to your n-dim list of column names.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2]])
In [3]: df
Out[3]:
0 1 2 3
0 1 1 1 1
1 2 2 2 2
In [4]: df.columns = [['a', 'c', 'e', 'g'], ['b', 'd', 'f', 'h']]
In [5]: df
Out[5]:
a c e g
b d f h
0 1 1 1 1
1 2 2 2 2

x = [('G1','a'),("G1",'b'),("G2",'a'),('G2','b')]
y = [('K1','l'),("K1",'m'),("K2",'l'),('K2','m'),("K3",'l'),('K3','m')]
row_list = pd.MultiIndex.from_tuples(x)
col_list = pd.MultiIndex.from_tuples(y)
A = pd.DataFrame(np.random.randint(2,5,(4,6)), row_list,col_list)
A
This is the most simple and easy way to create MultiLevel columns and rows.

Here is a function that can help you create the tuple, that can be used by pd.MultiIndex.from_tuples(), a bit more generically. Got the idea from #user3377361.
def create_tuple_for_for_columns(df_a, multi_level_col):
"""
Create a columns tuple that can be pandas MultiIndex to create multi level column
:param df_a: pandas dataframe containing the columns that must form the first level of the multi index
:param multi_level_col: name of second level column
:return: tuple containing (second_level_col, firs_level_cols)
"""
temp_columns = []
for item in df_a.columns:
temp_columns.append((multi_level_col, item))
return temp_columns
It can be used like this:
df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
columns = create_tuple_for_for_columns(df, 'c')
df.columns = pd.MultiIndex.from_tuples(columns)

define aggfunc for each values column in pandas pivot table

Was trying to generate a pivot table with multiple "values" columns. I know I can use aggfunc to aggregate values the way I want to, but what if I don't want to sum or avg both columns but instead I want sum of one column while mean of the other one. So is it possible to do so using pandas?
df = pd.DataFrame({
'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)
})
Now this will get a pivot table with sum:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.sum)
And this for mean:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.mean)
How can I get sum for D and mean for E?
Hope my question is clear enough.

You can apply a specific function to a specific column by passing in a dict.
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc={'D':np.sum, 'E':np.mean})

You can concat two DataFrames:
>>> df1 = pd.pivot_table(df, values=['D'], rows=['B'], aggfunc=np.sum)
>>> df2 = pd.pivot_table(df, values=['E'], rows=['B'], aggfunc=np.mean)
>>> pd.concat((df1, df2), axis=1)
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
or you can pass list of functions as aggfunc parameter and then reindex:
>>> df3 = pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=[np.sum, np.mean])
>>> df3
sum mean
D E D E
B
A 1.810847 -4.193425 0.226356 -0.524178
B 2.762190 -3.544245 0.345274 -0.443031
C 0.867519 0.627677 0.108440 0.078460
>>> df3 = df3.ix[:, [('sum', 'D'), ('mean','E')]]
>>> df3.columns = ['D', 'E']
>>> df3
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
Alghouth, it would be nice to have an option to defin aggfunc for each column individually. Don't know how it could be done, may be pass into aggfunc dict-like parameter, like {'D':np.mean, 'E':np.sum}.
update Actually, in your case you can pivot by hand:
>>> df.groupby('B').aggregate({'D':np.sum, 'E':np.mean})
E D
B
A -0.524178 1.810847
B -0.443031 2.762190
C 0.078460 0.867519

table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,'E': np.sum})
table
D E
mean sum
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select only available rows of a pandas dataframe - python

You could use reindex + dropna: out = df.reindex(select).dropna() you could also filter select before reindex: out = df.reindex([i for i in select if i in df.index]) Output: A B C c 1.0 2.0 2013-01-02 a 1.0 0.0 2013-01-02

Related

assign 0 when value_count() is not found

How to impute the missing value or value having 0 with the average of two nearby non-zero values in Pandas in python

Get subset of column from before the first non blank

Pandas: Multilevel column names

define aggfunc for each values column in pandas pivot table

Categories

Resources