Pandas pivot using selected values as index - python

I read this excellent guide to pivoting but I can't work out how to apply it to my case. I have tidy data like this:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'case': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', ],
... 'perf_var': ['num', 'time', 'num', 'time', 'num', 'time', 'num', 'time'],
... 'perf_value': [1, 10, 2, 20, 1, 30, 2, 40]
... }
... )
>>>
>>> df
case perf_var perf_value
0 a num 1
1 a time 10
2 a num 2
3 a time 20
4 b num 1
5 b time 30
6 b num 2
7 b time 40
What I want is:
To use "case" as the columns
To use the "num" values as the index
To use the "time" values as the value.
to give:
case a b
1.0 10 30
2.0 20 40
All the pivot examples I can see have the index and values in separate columns, but the above seems like a valid/common "tidy" data case to me (I think?). Is it possible to pivot from this?

You need a bit of preprocessing to get your final result :
(df.assign(num=np.where(df.perf_var == "num",
df.perf_value,
np.nan),
time=np.where(df.perf_var == "time",
df.perf_value,
np.nan))
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.loc[:, ["case", "num", "time"]]
.drop_duplicates()
.pivot("num", "case", "time"))
case a b
num
1.0 10.0 30.0
2.0 20.0 40.0
An alternative route to the same end point :
(
df.set_index(["case", "perf_var"], append=True)
.unstack()
.droplevel(0, 1)
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.drop_duplicates()
.droplevel(0)
.set_index("num", append=True)
.unstack(0)
.rename_axis(index=None)
)

Related

A list and a dataframe mapping to get a column value in python pandas

I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6

divide a number by all values of column in pandas

df1 have one column(total) with 2 values 5000 and 1000 each with id A & B respectively. df2 have one column(marks) with 10 values where first 5(100,200,300,400,500) values have id A and next 5 values have id B(10,20,30,40,50).
Now I have to get expected output as
id final_value
- A 50
- A 25
- A 16.6
- A 12.5
- A 10
- B 100
- B 50
- B 33.3
- B 25
- B 20
my code is
new_df = df1['total']/df2['marks']
But I got output as
A 50
B 100
Remaining NaN
pandas division is using both "columns" (series), element by element.
If you want to divide using 'id' as a link, you have to merge your dataframes before :
df1 = pd.DataFrame([[5000, 'A'], [1000, 'B']], columns=['test', 'id'])
df2 = pd.DataFrame([[100, 'A'], [200, 'A'], [300, 'A'], [400, 'A'], [500, 'A'], [10, 'B'], [20, 'B'], [30, 'B'], [40, 'B'], [50, 'B']],columns=['marks', 'id'])
df3 = df1.merge(df2, on='id')
df3['test']/df3['marks']
Setup:
df1 = pd.DataFrame({'total': [5000, 1000]}, index = ['A', 'B'])
df2 = pd.DataFrame({'marks': [100, 200, 300, 400, 500, 10, 20, 30, 40, 50]}, index = ['A','A','A','A','A','B','B','B','B','B'])
Interestingly enough this works:
df2['total'] = df1['total']
df2['final_value'] = df2['total'] / df2['marks']
And then you can just drop the rows and copy answer to new df if you want it as you stated:
new_df = df2[['final_value']]
df2 = df2.drop(['total', 'final_value'], axis = 1)
Assuming your data looks like this:
df1 = pd.DataFrame(dict(id=['A', 'B'], total=[5000,1000]))
df2 = pd.DataFrame(dict(id=['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
vals=[100,200,300,400,500,10,20,30,40,50]))
you can get the new column you're interested in by first merging the two dataframes on the id column and then applying a lambda function to divide the total by the value provided in df1. Specifically:
df2['final_result'] = df2.merge(df1, on='id').apply(lambda x: round(x.total/x.vals, 1), axis=1)
And if you only want the id and final_result columns, you can just select those:
df2[['id', 'final_result']]
Your data should now look like you expected:
id final_result
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
Note that in the lambda function I also applied some rounding to get just 1 decimal as you indicated.
Try:
>>> df1.set_index("id").rename(columns={"total": "marks"}).div(df2.set_index("id")).round(1).reset_index()
id marks
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
It leverages the fact, that for any arithmetical operations between 2 data frames pandas will autofit both dataframe attributes by index, and columns (so the arithmetic operation will be index x on index x and column a on column a)

combine DataFrame MultiIndex to string column

I have following DataFrame:
df = pd.DataFrame([[1,2,3], [11,22,33]], columns = ['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
C
A B
1 2 3
11 22 33
How I make additional 'text' column that will be string combination of the MultiIndex.
Without removing my index!
For example:
C D
A B
1 2 3 1_2
11 22 33 11_22
Perhaps a simple list comprehension might help i.e
df['new'] = ['_'.join(map(str,i)) for i in df.index.tolist()]
C new
A B
1 2 3 1_2
11 22 33 11_22
Solution in python 3.6:
df['new'] = [f'{i}_{j}' for i, j in df.index]
print (df)
C new
A B
1 2 3 1_2
11 22 33 11_22
And bellow:
df['new'] = ['{}_{}'.format(i,j) for i, j in df.index]
Use:
df['new'] = df.index.map('{0[0]}_{0[1]}'.format)
Output:
C new
A B
1 2 3 1_2
11 22 33 11_22
With so many elegant methods it is not clear which one to choose. So, here is a performance comparison of the methods provided in the other answers plus an alternative one for two cases: 1) the multi-index is comprised of integers; 2) the multi-index is comprised of strings.
Jezrael's method (f_3) wins in both cases. However, Dark's (f_2) is the slowest one for the second case. Method 1 performs very poorly with integers due to the type conversion step but is as fast as f_3 with strings.
Case 1:
df = pd.DataFrame({'A': randint(1, 10, num_rows), 'B': randint(10, 20, num_rows), 'C': randint(20, 30, num_rows)})
df.set_index(['A', 'B'], inplace=True)
# Method 1
def f_1(df):
df['D'] = df.index.get_level_values(0).astype('str') + '_' + df.index.get_level_values(1).astype('str')
return df
## Method 2
def f_2(df):
df['D'] = ['_'.join(map(str,i)) for i in df.index.tolist()]
return df
## Method 3
def f_3(df):
df['D'] = [f'{i}_{j}' for i, j in df.index]
return df
## Method 4
def f_4(df):
df['new'] = df.index.map('{0[0]}_{0[1]}'.format)
return df
Case 2:
alpha = list("abcdefghijklmnopqrstuvwxyz")
df = pd.DataFrame({'A': np.random.choice(alpha, size=num_rows), \
'B': np.random.choice(alpha, size=num_rows), \
'C': randint(20, 30, num_rows)})
df.set_index(['A', 'B'], inplace=True)
# Method 1
def f_1(df):
df['D'] = df.index.get_level_values(0) + '_' + df.index.get_level_values(1)
return df

pandas - Apply mean to a specific row in grouped dataframe [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

forward fill specific columns in pandas dataframe

If I have a dataframe with multiple columns ['x', 'y', 'z'], how do I forward fill only one column 'x'? Or a group of columns ['x','y']?
I only know how to do it by axis.
tl;dr:
cols = ['X', 'Y']
df.loc[:,cols] = df.loc[:,cols].ffill()
And I have also added a self containing example:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> ## create dataframe
... ts1 = [0, 1, np.nan, np.nan, np.nan, np.nan]
>>> ts2 = [0, 2, np.nan, 3, np.nan, np.nan]
>>> d = {'X': ts1, 'Y': ts2, 'Z': ts2}
>>> df = pd.DataFrame(data=d)
>>> print(df.head())
X Y Z
0 0 0 0
1 1 2 2
2 NaN NaN NaN
3 NaN 3 3
4 NaN NaN NaN
>>>
>>> ## apply forward fill
... cols = ['X', 'Y']
>>> df.loc[:,cols] = df.loc[:,cols].ffill()
>>> print(df.head())
X Y Z
0 0 0 0
1 1 2 2
2 1 2 NaN
3 1 3 3
4 1 3 NaN
for col in ['X', 'Y']:
df[col] = df[col].ffill()
Alternatively with the inplace parameter:
df['X'].ffill(inplace=True)
df['Y'].ffill(inplace=True)
And no, you cannot do df[['X','Y]].ffill(inplace=True) as this first creates a slice through the column selection and hence inplace forward fill would create a SettingWithCopyWarning. Of course if you have a list of columns you can do this in a loop:
for col in ['X', 'Y']:
df[col].ffill(inplace=True)
The point of using inplace is that it avoids copying the column.
Two columns can be ffill() simultaneously as given below:
df1 = df[['X','Y']].ffill()
I used below code, Here for X and Y method can be different also instead of ffill().
df1 = df.fillna({
'X' : df['X'].ffill(),
'Y' : df['Y'].ffill(),
})
The simplest version I think.
cols = ['X', 'Y']
df[cols] = df[cols].ffill()

Categories

Resources