How can I get an array that aggregates the grouped column into a single entity (list/array) while also returning NaNs for results that do not match the where clause condition?
# example
df1 = pd.DataFrame({'flag': [1, 1, 0, 0],
'part': ['a', 'b', np.nan, np.nan],
'id': [1, 1, 2, 3]})
# my try
np.where(df1['flag'] == 1, df1.groupby(['id'])['part'].agg(np.array), df1.groupby(['id'])['part'].agg(np.array))
# operands could not be broadcast together with shapes (4,) (3,) (3,)
# expected
np.array((np.array(('a', 'b')), np.array(('a', 'b')), np.nan, np.nan), dtype=object)
Drop the rows having NaN values in the part column, then group the remaining rows by id and aggregate part using list, finally map the aggregated dataframe onto flag column to get the result
s = df1.dropna(subset=['part']).groupby('id')['part'].agg(list)
df1['id'].map(s).to_numpy()
array([list(['a', 'b']), list(['a', 'b']), nan, nan], dtype=object)
Related
Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']
This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values , and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')
Suppose we have a dataframe of N rows and m columns. For each column, I want to find the first index for which a condition is satisfied.
df = pd.DataFrame(np.random.random((50,5)), columns=['A', 'B', 'C', 'D', 'E'])
For each column, I can do
df[df['A']<0.5].index[0]
to find the first row for that column. I was wondering if there is a way to get it for all columns without for loop?
You can use numpy.argmax
>>> (df.to_numpy() < 0.5).argmax(axis=0)
array([0, 0, 1, 0, 1])
I need to subset a dataframe using groups and three conditional rules. If within a group all values in the Value column are none, I need to retain the first row for that group. If within a group all values in the Value column are not none, I need to retain all the values. If within a group some of the values in the Value column are none and others not none, I need to drop all rows where there is a none. Columns Region and ID together define a unique group within the dataframe.
My first approach was to separate the dataframe into two chunks. The first chunk is rows where for a group there are all nulls. The second chunk is everything else. For the chunk of data where rows for a group contained all nulls, I would create a rownumber using a cumulative count of rows by group and query rows where the cumulative count = 1. For the second chunk, I would drop all rows where Value is null. Then I would append the dataframes.
Sample source dataframe
dfInput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'B', 'A', 'A'],
'Value':[0, 1, 1, None, 2, None, None],
})
Desired output dataframe:
dfOutput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'A'],
'Value':[0, 1, 1, 2, None],
})
Just follow your logic and using groupby
dfInput.groupby(['Region','ID']).Value.apply(lambda x : x.head(1) if x.isnull().all() else x.dropna()).\
reset_index(level=[0,1]).sort_index()
Out[86]:
Region ID Value
0 1 A 0.0
1 1 A 1.0
2 2 B 1.0
4 2 B 2.0
5 2 A NaN
I am exploring the MultiIndex, but for some reason the very basic indexing does not work for me.
The index:
In [119]: index
Out[119]: MultiIndex(levels=[[u'Group 1', u'Group 2'], [u'A01', u'A02', u'A03', u'A04']],
labels=[[0, 1, 0, 0], [0, 1, 2, 3]],
names=[u'Group', u'Well'])
The dataframe:
df = pd.DataFrame(np.random.randn(4,2), index=index)
The dataframe has the index:
In [124]: df.index
Out[124]:
MultiIndex(levels=[[u'Group 1', u'Group 2'], [u'A01', u'A02', u'A03', u'A04']],
labels=[[0, 1, 0, 0], [0, 1, 2, 3]],
names=[u'Group', u'User'])
However indexing:
df['Group 1']
only results in an error
KeyError: 'Group 1'
How can this be fixed?
To slice with index, you need loc for data frames as the basic indexing with [] is meant to select columns; Since the data frame doesn't contain a column named Group 1, it raises a key error:
df.loc['Group 1']
# 0 1
#Well
#A01 -0.337359 -0.113165
#A03 0.212714 1.619850
#A04 1.411829 -0.892723
Basic indexing table:
# Object Type Selection Return Value Type
# Series series[label] scalar value
# DataFrame frame[colname] Series corresponding to colname
# Panel panel[itemname] DataFrame corresponding to the itemname
loc indexing table:
#Object Type Indexers
# Series s.loc[indexer]
# DataFrame df.loc[row_indexer,column_indexer]
# Panel p.loc[item_indexer,major_indexer,minor_indexer]