I have a dataframe as follows:
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
'Hours1': [4, 4, 8, 8],
'Group2': ['Admin', 'Customer', '0', '0'],
'Hours2': [4.0, 2.0, 0.0, 0.0],
'Group3': ['0', 'Admin', '0', '0'],
'Hours3': [0.0, 2.0, 0.0, 0.0],
})
>>> df
ID Group1 Hours1 Group2 Hours2 Group3 Hours3
0 1 Maintenance 4 Admin 4.0 0 0.0
1 2 Shop 4 Customer 2.0 Admin 2.0
2 3 Admin 8 0 0.0 0 0.0
3 4 Shop 8 0 0.0 0 0.0
I would like to create new columns as follows:
desired output:
This is my code and the current output. I understand why it is not giving me what I want but I'm not sure how to modify my code for the desired output
Code:
segment_list=["Maintenance", "Shop", "Admin", "Customer"]
for i in segment_list:
df["Seg_"+i] = np.where((df["Group1"] ==i) | (df["Group2"]==i) | (df["Group3"]==i),
(df["Hours1"] + df["Hours2"] + df["Hours3"])/8,0)
Current output
Propably not the cleanest way, but it does work and I couldn't come up with a more elegant approach.
print(df)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3
# 0 1 Maintenance 4 Admin 4.0 NaN NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0
# 2 3 Admin 8 NaN NaN NaN NaN
# 3 4 Shop 8 NaN NaN NaN NaN
df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']
# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()
# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')
# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()
#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3 Admin Customer \
# 0 1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25
# 2 3 Admin 8 NaN NaN NaN NaN 1.00 NaN
# 3 4 Shop 8 NaN NaN NaN NaN NaN NaN
# Maintenance Shop
# 0 0.5 NaN
# 1 NaN 0.5
# 2 NaN NaN
# 3 NaN 1.0
Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas easier to use (because of groupby and its handling of index and multi-index):
First, some cleaning and reshaping:
# set ID as index and clean up the '0' entries
# which really should be NaN (missing data):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})
# then, convert 'Group1', ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])
# and convert to a long frame with ['ID', 'Group'] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
'Group', append=True)
>>> z
Hours
ID Group
1 Maintenance 4.0
Admin 4.0
2 Shop 4.0
Customer 2.0
Admin 2.0
3 Admin 8.0
4 Shop 8.0
Now, calculate the desired summaries (here, just one: fraction of hours relative to ID's total):
# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(lambda g: g / g.sum()))
>>> z
Hours Seg
ID Group
1 Maintenance 4.0 0.50
Admin 4.0 0.50
2 Shop 4.0 0.50
Customer 2.0 0.25
Admin 2.0 0.25
3 Admin 8.0 1.00
4 Shop 8.0 1.00
At this point, one could reshape to wide again, with MultiIndex columns:
>>> z.unstack('Group')
Hours Seg
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID
1 4.0 NaN 4.0 NaN 0.50 NaN 0.5 NaN
2 2.0 2.0 NaN 4.0 0.25 0.25 NaN 0.5
3 8.0 NaN NaN NaN 1.00 NaN NaN NaN
4 NaN NaN NaN 8.0 NaN NaN NaN 1.0
], axis=1)
Or, closer to the original intention, we can concat horizontally just the Seg portion to the (cleaned up) original:
df2 = pd.concat([
df2,
z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}'),
], axis=1)
>>> df2
Group1 Hours1 Group2 Hours2 Group3 Hours3 Seg_Admin Seg_Customer Seg_Maintenance Seg_Shop
ID
1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN 0.5 NaN
2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25 NaN 0.5
3 Admin 8 NaN NaN NaN NaN 1.00 NaN NaN NaN
4 Shop 8 NaN NaN NaN NaN NaN NaN NaN 1.0
Related
I try to replace NaN values in a pandas DataFrame with a forward fill method combined with a discount rate or decreasing rate of 0.9.
I have the following data set:
Column1 Column2 Column3 Column4
0 1.0 5 -9.0 13.0
1 NaN 6 -10.0 15.0
2 3.0 7 NaN NaN
3 NaN 8 NaN NaN
For reproducibility:
df1 = pd.DataFrame({
'Column1':[1, 'NaN', 3, 'NaN'],
'Column2':[5, 6, 7, 8],
'Column3':[-9, -10, 'NaN', 'NaN'],
'Column4':[13, 15, 'NaN', 'NaN']
})
df1 = df1.replace('NaN',np.nan)
I was able to replace the NaN values with the fillna command and the forward fill ffill method.
df2 = df1.fillna(method='ffill')
Column1 Column2 Column3 Column4
0 1.0 5 -9.0 13.0
1 1.0 6 -10.0 15.0
2 3.0 7 -10.0 15.0
3 3.0 8 -10.0 15.0
Additionally, I am trying to apply the ratio 0.9 to all forward filled NaN values, which would yield the following data set:
NaN value row 2, column 3: -10 * 0.9 = -9
NaN value row 3, column 3: -9 * 0.9 = -8.1
Column1 Column2 Column3 Column4
0 1.0 5 -9.0 13.00
1 0.9 6 -10.0 15.00
2 3.0 7 -9.0 13.50
3 2.7 8 -8.1 12.15
Is there an easy way to deal with that?
Thanks a lot!
Create an exponent mask by counting consecutive NaN sequences using this groupby/cumsum idea:
groups = df1.notna().cumsum()
exp = df1.apply(lambda col: col.isna().groupby(groups[col.name]).cumsum())
# Column1 Column2 Column3 Column4
# 0 0 0 0 0
# 1 1 0 0 0
# 2 0 0 1 1
# 3 1 0 2 2
Then ffill and multiply by 0.9 ** exp:
df2 = df1.ffill().mul(0.9 ** exp)
# Column1 Column2 Column3 Column4
# 0 1.0 5.0 -9.0 13.00
# 1 0.9 6.0 -10.0 15.00
# 2 3.0 7.0 -9.0 13.50
# 3 2.7 8.0 -8.1 12.15
I got some like this: x: Users y: Ratings
and this shows user 1 rating movie 1 with 4.0 user 1 not rating movie 2 user 1 rating movie 3 with 1.0 and so
rating
movieId 1 2 3 4 5 .....
userID
1 4.0 NaN 1.0 4.1 NaN
2 NaN 2 5.1 NaN NaN
3 3.0 2.0 NaN NaN NaN
4 5.0 NaN 2.8 NaN NaN
How could I fill NaN values with mode by Movie
example movieId 1 has ratings 4.0, NaN, 3.0, 5.0 ..... then fill NaNs with 4.0(mode) i tried to use fillna
rating.apply(lambda x: x.fillna(x.mode().item()))
Try
rating.apply(lambda x: x.fillna(x.mode()), axis=0)
specify axis=0
Alternatively,
import numpy as np
import pandas as pd
def fillna_mode(df, cols_to_fill):
for col in cols_to_fill:
df[col].fillna(df[col].mode()[0], inplace=True)
sample = {1: [4.0, np.nan,1.0, 4.1, np.nan],
2: [np.nan, 2, 5.1, np.nan, np.nan]}
rating = pd.DataFrame(sample)
print(rating)
1 2
0 4.0 NaN
1 NaN 2.0
2 1.0 5.1
3 4.1 NaN
4 NaN NaN
fillna_mode(rating, [1, 2])
Output
1 2
0 4.0 2.0
1 1.0 2.0
2 1.0 5.1
3 4.1 2.0
4 1.0 2.0
Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
I have a dataframe similar to the one seen below.
In[2]: df = pd.DataFrame({'P1': [1, 2, None, None, None, None],'P2': [None, None, 3, 4, None, None],'P3': [None, None, None, None, 5, 6]})
Out[2]:
P1 P2 P3
0 1.0 NaN NaN
1 2.0 NaN NaN
2 NaN 3.0 NaN
3 NaN 4.0 NaN
4 NaN NaN 5.0
5 NaN NaN 6.0
And I am trying to merge all of the columns into a single P column in a new dataframe (see below).
P
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
In my actual code, I have an arbitrary list of columns that should be merged, not necessarily P1, P2, and P3 (between 1 and 5 columns). I've tried something along the following lines:
new_series = pd.Series()
desired_columns = ['P1', 'P2', 'P3']
for col in desired_columns:
other_series=df[col]
new_series = new_series.align(other_series)
However this results in a tuple of Series objects, and neither of them appear to contain the data I need. I could iterate through every row, then check each column, but I feel that there is likely an easy pandas solution that I am missing.
If there is only one non None value per row forward filling Nones and select last column by position:
df['P'] = df[['P1', 'P2', 'P3']].ffill(axis=1).iloc[:, -1]
print (df)
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0
Another alternate solution:
So, if we are not column specific within the DataFrame to choose about then we can use bfill() function to populate the non-nan values in the dataframe across columns So, when axis='columns', then the current nan cells will be filled from the value present in the next column in the same row.
>>> df['P'] = df.bfill(axis=1).iloc[:, 0]
>>> df
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0
I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"
You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN