Pandas MultiIndex indexing does not work - python

I am exploring the MultiIndex, but for some reason the very basic indexing does not work for me.
The index:
In [119]: index
Out[119]: MultiIndex(levels=[[u'Group 1', u'Group 2'], [u'A01', u'A02', u'A03', u'A04']],
labels=[[0, 1, 0, 0], [0, 1, 2, 3]],
names=[u'Group', u'Well'])
The dataframe:
df = pd.DataFrame(np.random.randn(4,2), index=index)
The dataframe has the index:
In [124]: df.index
Out[124]:
MultiIndex(levels=[[u'Group 1', u'Group 2'], [u'A01', u'A02', u'A03', u'A04']],
labels=[[0, 1, 0, 0], [0, 1, 2, 3]],
names=[u'Group', u'User'])
However indexing:
df['Group 1']
only results in an error
KeyError: 'Group 1'
How can this be fixed?

To slice with index, you need loc for data frames as the basic indexing with [] is meant to select columns; Since the data frame doesn't contain a column named Group 1, it raises a key error:
df.loc['Group 1']
# 0 1
#Well
#A01 -0.337359 -0.113165
#A03 0.212714 1.619850
#A04 1.411829 -0.892723
Basic indexing table:
# Object Type Selection Return Value Type
# Series series[label] scalar value
# DataFrame frame[colname] Series corresponding to colname
# Panel panel[itemname] DataFrame corresponding to the itemname
loc indexing table:
#Object Type Indexers
# Series s.loc[indexer]
# DataFrame df.loc[row_indexer,column_indexer]
# Panel p.loc[item_indexer,major_indexer,minor_indexer]

Related

Grouping and counting mediana in pandas dataframe

This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values ​​, and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')

Numpy Where and Pandas: How to aggregate groupby values?

How can I get an array that aggregates the grouped column into a single entity (list/array) while also returning NaNs for results that do not match the where clause condition?
# example
df1 = pd.DataFrame({'flag': [1, 1, 0, 0],
'part': ['a', 'b', np.nan, np.nan],
'id': [1, 1, 2, 3]})
# my try
np.where(df1['flag'] == 1, df1.groupby(['id'])['part'].agg(np.array), df1.groupby(['id'])['part'].agg(np.array))
# operands could not be broadcast together with shapes (4,) (3,) (3,)
# expected
np.array((np.array(('a', 'b')), np.array(('a', 'b')), np.nan, np.nan), dtype=object)
Drop the rows having NaN values in the part column, then group the remaining rows by id and aggregate part using list, finally map the aggregated dataframe onto flag column to get the result
s = df1.dropna(subset=['part']).groupby('id')['part'].agg(list)
df1['id'].map(s).to_numpy()
array([list(['a', 'b']), list(['a', 'b']), nan, nan], dtype=object)

extract column information from a single df and input to multiple dfs where identifier needs remapping

I need to append the row data from a column in df1 into separate dfs.
The row value from column 'i1' in df1 should correspond to the name of the dataframe that it needs appending too and there is a common id column across dataframes.
However - the i1 name and the name of the tables are different. I have created a dictionary below so you can see what i mean.
d_map = {'ab1':'c30_sab1',
'cd2':'kjm_1cd2'}
example data and the expected output is shown below is shown with df1. Any pointers would be great. thanks so much
df1
df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3], 'i1': ['ab1','cd2','ab1','cd2','ab1'], 'i2': ['10:25','10:27','11:51','12:01','13:18']})
tables that need appending with i2 column from df1 depending on id and i1 match
c30_sab = pd.DataFrame(data={'id': [1, 2, 3]})
kjm_1cd = pd.DataFrame(data={'id': [1, 2]})
expected output
e_ab1 = pd.DataFrame(data={'id': [1, 2, 3], 'i2': ['10:25','11:51','13:18']})
e_cd2 = pd.DataFrame(data={'id': [1, 2], 'i2': ['10:27','12:01']})
A simple way to do it (assuming you accept repetitions when the df ids are duplicated):
df_ab1 = df[df['i1'] == 'ab1'] # select only the values for 'ab1' df
df_cd2 = df[df['i1'] == 'cd2'] # select only the values for 'cd2' df
e_ab_1 = ab1.merge(df_ab1[['id', 'i2']], on='id')
e_cd_2 = cd2.merge(df_cd2[['id', 'i2']], on='id')

Merged Columns in Python Data Frame

How can I make this table:
into a Pandas data frame? Can't make that Machine Column.
You can't really do that in a dataframe, as you can't have a one-level index combined with a multi-level index on the same axis.
One way to get as close as possible to what you want is to concatenate individual pandas series for the first one-level columns with a two-level dataframe for the 'machine' columns like follows:
pd.concat({
'Company name': pd.Series(['a', 'b', 'c']),
'Number of machines': pd.Series([1, 4, 2]),
'Machines': pd.DataFrame({
'2015-2020': pd.Series([3, 1, 0]),
'2018-2014': pd.Series([1, 8, 3]),
'Other': pd.Series([5, 0, 4]),
})
}, axis=1)
You will still a two-level index as a result, and the first columns will have a 2nd level integer index (0, 1 etc.)
Thank you. My boss asked me to make some process in file and show it to him in excel file like i posted here. (its just example but columns have to be exactly like it)
xlsx

Adding Numpy ndarray into dataframe

I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?
import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN
Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)

Categories

Resources