Select a pandas DataFrame column by indices given in another column - python

Given a dataframe such as
df = pd.DataFrame({1: [10,20,30,40], 2: [50,60,70,80], 3: [90,100,110,120], "select": [2, 3, 1, 1])
I can get a series of values selected from each row want to select values in each row corresponding to the column index given in the select column, like this:
df.apply(lambda r: r[r.select], axis=1) # 50, 100, 30, 40
Is there a better way to do this that doesn't rely on apply?

Use lookup:
df.lookup(df.index, df['select'])
Output:
array([ 50, 100, 30, 40])

Related

How do I know a row label of a series of pandas dataframe?

A CSV file is given. I am supposed to print the name of the row label of a row of the data frame as a string output. How do I do that?
import pandas as pd
df= pd.read_csv('olympics.csv', index_col=0, skiprows=1)
s= df.loc[df['Gold'].idxmax()]
return s.index
Here 'Gold' is a random column index name. I have been trying by this code. But it only prints column indices. But I need to print the row index output as a string .
df = pd.DataFrame({'id':['1','2','3','4','5','6','7','8'],
'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'C':[10, 10, 10, 30, 50, 60, 50, 8],
'D':[9, 8, 7, 6, 5, 4, 3, 2]},
index = list('abcdefgh'))
idxmax() returns the row index,
>>> df['C'].idxmax()
'f'
Selecting that row produces a Series whose name is the index of that row.
>>> df.loc[df['C'].idxmax()]
id 6
A bar
C 60
D 4
Name: f, dtype: object
>>> df.loc[df['C'].idxmax()].name
'f'

Pick first element of groupby object's group by index without converting to list

In the code below, I am iterating groups of groupby object and printing the first item in column
b of each group.
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].tolist()[0]
print(first_item_in_b)
Because the groupby has hierarchical index, in order to pick the first element in b I need to
convert b to list first.
How can I avoid such overhead?
I cannot just remove tolist() like so:
first_item_in_b = group['b'][0]
because it will give KeyError.
You can use Index.get_loc for get position of column b, so possible use iat or iloc only or by first value of index with column name by Series.at.
Or is possible select by position by Series.iat or Series.iloc after selecting by column label b:
for name, group in groups:
#first value by positions from columns names
first_item_in_b = group.iat[0, group.columns.get_loc('b')]
#first value by labels from index
first_item_in_b = group.at[group.index[0],'b']
#fast select first value
first_item_in_b = group['b'].iat[0]
#alternative
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
10
20
30
Using iloc:
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
OUTPUT:
10
20
30
EDIT:
Or Using the Fast integer location scalar accessor.

Pivot Table in Python

I am pretty new to Python and hence I need your help on the following:
I have two tables (dataframes):
Table 1 has all the data and it looks like that:
GenDate column has the generation day.
Date column has dates.
Column D and onwards has different values
I also have the following table:
Column I has "keywords" that can be found in the header of Table 1
Column K has dates that should be in column C of table 1
My goal is to produce a table like the following:
I have omitted a few columns for Illustration purposes.
Every column on table 1 should be split base on the Type that is written on the Header.
Ex. A_Weeks: The Weeks corresponds to 3 Splits, Week1, Week2 and Week3
Each one of these slits has a specific Date.
in the new table, 3 columns should be created, using A_ and then the split name:
A_Week1, A_Week2 and A_Week3.
for each one of these columns, the value that corresponds to the Date of each split should be used.
I hope the explanation is good.
Thanks
You can get the desired table with the following code (follow comments and check panda api reference to learn about functions used):
import numpy as np
import pandas as pd
# initial data
t_1 = pd.DataFrame(
{'GenDate': [1, 1, 1, 2, 2, 2],
'Date': [10, 20, 30, 10, 20, 30],
'A_Days': [11, 12, 13, 14, 15, 16],
'B_Days': [21, 22, 23, 24, 25, 26],
'A_Weeks': [110, 120, 130, 140, np.NaN, 160],
'B_Weeks': [210, 220, 230, 240, np.NaN, 260]})
# initial data
t_2 = pd.DataFrame(
{'Type': ['Days', 'Days', 'Days', 'Weeks', 'Weeks'],
'Split': ['Day1', 'Day2', 'Day3', 'Week1', 'Week2'],
'Date': [10, 20, 30, 10, 30]})
# create multiindex
t_1 = t_1.set_index(['GenDate', 'Date'])
# pivot 'Date' level of MultiIndex - unstack it from index to columns
# and drop columns with all NaN values
tt_1 = t_1.unstack().dropna(axis=1)
# tt_1 is what you need with multi-level column labels
# map to rename columns
t_2 = t_2.set_index(['Type'])
mapping = {
type_: dict(zip(
t_2.loc[type_, :].loc[:, 'Date'],
t_2.loc[type_, :].loc[:, 'Split']))
for type_ in t_2.index.unique()}
# new column names
new_columns = list()
for letter_type, date in tt_1.columns.values:
letter, type_ = letter_type.split('_')
new_columns.append('{}_{}'.format(letter, mapping[type_][date]))
tt_1.columns = new_columns

Pandas DataFrame to multidimensional NumPy Array

I have a Dataframe which I want to transform into a multidimensional array using one of the columns as the 3rd dimension.
As an example:
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'date': np.random.randint(1, 6, 6),
'value1': [11, 12, 13, 14, 15, 16],
'value2': [21, 22, 23, 24, 25, 26]
})
I would like to transform it into a 3D array with dimensions (id, date, values) like this:
The problem is that the 'id's do not have the same number of occurrences so I cannot use np.reshape().
For this simplified example, I was able to use:
ra = np.full((3, 3, 3), np.nan)
for i, value in enumerate(df['id'].unique()):
rows = df.loc[df['id'] == value].shape[0]
ra[i, :rows, :] = df.loc[df['id'] == value, 'date':'value2']
To produce the needed result:
but the original DataFrame contains millions of rows.
Is there a vectorized way to accomplice the same result?
Approach #1
Here's one vectorized approach after sorting id col with df.sort_values('id', inplace=True) as suggested by #Yannis in comments -
count_id = df.id.value_counts().sort_index().values
mask = count_id[:,None] > np.arange(count_id.max())
vals = df.loc[:, 'date':'value2'].values
out_shp = mask.shape + (vals.shape[1],)
out = np.full(out_shp, np.nan)
out[mask] = vals
Approach #2
Another with factorize that doesn't require any pre-sorting -
x = df.id.factorize()[0]
y = df.groupby(x).cumcount().values
vals = df.loc[:, 'date':'value2'].values
out_shp = (x.max()+1, y.max()+1, vals.shape[1])
out = np.full(out_shp, np.nan)
out[x,y] = vals

Pandas Indexing and Column Creation

I have a dataset, df.
I extracted another dataset from df, df_rec, based on a certain condition.
I can access the indexes of df_rec by df_rec.index.
Now, I want to create a column in df, where the index in df if matches with indexes in df_rec should be populated as 1 otherwise 0.
Any help, will be appreciated.
I am thinking, like, which throws error.
df['reccurences'] = 0
df['reccurences'][df.index in df_rec.index] = 1
You can use map on the index of df to chek whether it is in df_res and set the value accordingly as shown below.
df = pd.DataFrame()
df['X'] = [1, 2, 3, 4, 5, 6]
df['Y'] = [10, 20, 30, 40, 50, 60]
df_res = df.loc[df['X'] > 3]
df['C'] = df.index.map(lambda x : 1 if x in df_res.index else 0)
OR you can do like this
df['C'] = [1 if x in df_res.index else 0 for x in df.index]

Categories

Resources