Adding dataframes only on selected rows - python

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5

You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0

Related

Populating an even distribution of values across multiple axis?

Basic Example:
# Given params such as:
params = {
'cols': 8,
'rows': 4,
'n': 4
}
# I'd like to produce (or equivalent):
col0 col1 col2 col3 col4 col5 col6 col7
row_0 0 1 2 3 0 1 2 3
row_1 1 2 3 0 1 2 3 0
row_2 2 3 0 1 2 3 0 1
row_3 3 0 1 2 3 0 1 2
Axis Value Counts:
Where the axis all have an equal distribution of values
df.apply(lambda x: x.value_counts(), axis=1)
0 1 2 3
row_0 2 2 2 2
row_1 2 2 2 2
row_2 2 2 2 2
row_3 2 2 2 2
df.apply(lambda x: x.value_counts())
col0 col1 col2 col3 col4 col5 col6 col7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
My attempt thus far:
import itertools
import pandas as pd
def create_df(cols, rows, n):
x = itertools.cycle(list(itertools.permutations(range(n))))
df = pd.DataFrame(index=range(rows), columns=range(cols))
df[:] = np.reshape([next(x) for _ in range((rows*cols)//n)], (rows, cols))
#df = df.T.add_prefix('row_').T
#df = df.add_prefix('col_')
return df
params = {
'cols': 8,
'rows': 4,
'n': 4
}
df = create_df(**params)
Output:
0 1 2 3 4 5 6 7
0 0 1 2 3 0 1 3 2
1 0 2 1 3 0 2 3 1
2 0 3 1 2 0 3 2 1
3 1 0 2 3 1 0 3 2
# Correct on this Axis:
>>> df.apply(lambda x: x.value_counts(), axis=1)
0 1 2 3
0 2 2 2 2
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
# Incorrect on this Axis:
>>> df.apply(lambda x: x.value_counts())
0 1 2 3 4 5 6 7
0 3.0 1 NaN NaN 3.0 1 NaN NaN
1 1.0 1 2.0 NaN 1.0 1 NaN 2.0
2 NaN 1 2.0 1.0 NaN 1 1.0 2.0
3 NaN 1 NaN 3.0 NaN 1 3.0 NaN
So, I have the conditions I need on one axis, but not on the other.
How can I update my method/create a method to meet both conditions?
You can use numpy.roll:
def create_df(cols, rows, n):
x = itertools.cycle(range(n))
arr = [np.roll([next(x) for _ in range(cols)], -i) for i in range(rows)]
return pd.DataFrame(arr)
Output (with given test case):
0 1 2 3 4 5 6 7
0 0 1 2 3 0 1 2 3
1 1 2 3 0 1 2 3 0
2 2 3 0 1 2 3 0 1
3 3 0 1 2 3 0 1 2
Edit: In Python 3.8+ you can use the := operator (which is significantly faster than my answer above):
def create_df(cols, rows, n):
x = itertools.cycle(range(n))
n = [next(x) for _ in range(cols)]
arr = [n := n[1:]+n[:1] for _ in range(rows)]
return pd.DataFrame(arr)
Output (again with given test case):
0 1 2 3 4 5 6 7
0 1 2 3 0 1 2 3 0
1 2 3 0 1 2 3 0 1
2 3 0 1 2 3 0 1 2
3 0 1 2 3 0 1 2 3
You can tile you input and use a custom roll to shift each row independently:
c = params['cols']
r = params['rows']
n = params['n']
a = np.arange(params['n']) # or any input
b = np.tile(a, (r, c//n))
# array([[0, 1, 2, 3, 0, 1, 2, 3],
# [0, 1, 2, 3, 0, 1, 2, 3],
# [0, 1, 2, 3, 0, 1, 2, 3],
# [0, 1, 2, 3, 0, 1, 2, 3]])
idx = np.arange(r)[:, None]
shift = (np.tile(np.arange(c), (r, 1)) - np.arange(r)[:, None])
df = pd.DataFrame(b[idx, shift])
Output:
0 1 2 3 4 5 6 7
0 0 1 2 3 0 1 2 3
1 3 0 1 2 3 0 1 2
2 2 3 0 1 2 3 0 1
3 1 2 3 0 1 2 3 0
Alternative order:
idx = np.arange(r)[:, None]
shift = (np.tile(np.arange(c), (r, 1)) + np.arange(r)[:, None]) % c
df = pd.DataFrame(b[idx, shift])
Output:
0 1 2 3 4 5 6 7
0 0 1 2 3 0 1 2 3
1 1 2 3 0 1 2 3 0
2 2 3 0 1 2 3 0 1
3 3 0 1 2 3 0 1 2
Other alternative: use a custom strided_indexing_roll function.

Add missing rows within groups

Suppose dataframe:
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 3],
"day": [1, 2, 3, 1, 3, 2],
"value": [1, 2, 3, 4, 5, 6],
})
I need all ids to have the same set of days. How to add rows with missing days?
IIUC
df=df.pivot(*df.columns).fillna(0).stack().reset_index()
id day 0
0 1 1 1.0
1 1 2 2.0
2 1 3 3.0
3 2 1 4.0
4 2 2 0.0
5 2 3 5.0
6 3 1 0.0
7 3 2 6.0
8 3 3 0.0
For solutions with multiple columns filled by 0 for new missing rows use DataFrame.unstack with DataFrame.stack working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day']).unstack(fill_value=0).stack().reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0
Another solution with DataFrame.reindex by MultiIndex.from_product, working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day'])
m = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(m, fill_value=0).reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0

I want to insert new columns by adding 2 consecutive columns in my data frame

I have a data frame with large number of columns and with a repeating pattern. I like to insert a column(Diff) between each pattern so that this column contains difference of preceding columns. May be I can better describe it as an example:
Existing DF Example:
A_x_y_z_1 A_x_y_z_2 B_a_b_c_1 B_a_b_c_2 C_3_y_w_1 C_3_y_w_2
2 1 7 1 2 3
5 5 9 5 1 4
1 3 1 3 2 2
3 8 0 2 3 1
Expected DF:
A_x_y_z_1 A_x_y_z_2 diff B_a_b_c_1 B_a_b_c_2 diff C_3_y_w_1 C_3_y_w_2 diff
2 1 -1 7 1 -6 2 3 1
5 5 0 9 5 -4 4 5 1
1 3 2 1 3 2 2 7 5
3 8 5 0 2 2 1 4 3
pd.concat
pd.concat([ # concat all groups
d.assign(**{f'{k}_Diff': d[f'{k}_2'] - d[f'{k}_1']}) # New Col with 'Diff'
for k,d in df.groupby(lambda x: x.split('_', 1)[0], axis=1) # Group w/Callable
], axis=1)
A_1 A_2 A_Diff B_1 B_2 B_Diff C_1 C_2 C_Diff
0 2 1 -1 7 1 -6 2 3 1
1 5 5 0 9 5 -4 1 4 3
2 1 3 2 1 3 2 2 2 0
3 3 8 5 0 2 2 3 1 -2
We can do split with columns then get groupby diff
df1=df.copy()
df1.columns=df1.columns.str.split('_').str[0]
df=pd.concat([df,df1.groupby(level=0,axis=1).diff().dropna(1).add_suffix('_Diff')],1).sort_index(axis=1)
df
Out[115]:
A_1 A_2 A_Diff B_1 B_2 B_Diff C_1 C_2 C_Diff
0 2 1 -1.0 7 1 -6.0 2 3 1.0
1 5 5 0.0 9 5 -4.0 1 4 3.0
2 1 3 2.0 1 3 2.0 2 2 0.0
3 3 8 5.0 0 2 2.0 3 1 -2.0
Assuming that column names are sorted as in your example you can do it using some numpy like below
df = pd.DataFrame([[2, 1, 7, 1, 2, 3], [5, 5, 9, 5, 1, 4], [1, 3, 1, 3, 2, 2], [3, 8, 0, 2, 3, 1]], columns=('A_1', 'A_2', 'B_1', 'B_2', 'C_1', 'C_2'))
columns = np.unique(df.columns.str.split("_").str[0]) + np.array(["_1", "_2", "_diff"]).reshape(-1,1)
df = df.assign(**{key:0 for key in columns[2, :]}).sort_index(axis=1)
df[columns[2,:]] = df[columns[1,:]].values - df[columns[0,:]].values
df
Something a bit easier to read (for me) without pd.concat with the use of f strings
for col in ['A','B','C']:
df[f'{col}_diff'] = df[f'{col}_2'] - df[f'{col}_1']
A_1 A_2 B_1 B_2 C_1 C_2 A_diff B_diff C_diff
0 2 1 7 1 2 3 -1 -6 1
1 5 5 9 5 1 4 0 -4 3
2 1 3 1 3 2 2 2 2 0
3 3 8 0 2 3 1 5 2 -2

How do you add an array to each previous row in pandas?

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Categories

Resources