I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)
I want to add a DataFrame a (containing a loadprofile) to some of the columns of another DataFrame b (also containing one load profile per column). So some columns (load profiles) of b should be overlaid withe the load profile of a.
So lets say my DataFrames look like:
a:
P[kW]
0 0
1 0
2 0
3 8
4 8
5 0
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 4 4
4 2 2 2
5 2 2 2
Now I want to overlay some colums of b:
b.iloc[:, [1]] += a.iloc[:, 0]
I would expect this:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
but what I actually get:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 nan 2
1 3 nan 3
2 3 nan 3
3 4 nan 4
4 2 nan 2
5 2 nan 2
That's not exactly what my code and data look like, but the principle is the same as in this abstract example.
Any guesses, what could be the problem?
Many thanks for any help in advance!
EDIT:
I actually have to overlay more than one column.Another example:
load = [0,0,0,0,0,0,0]
data = pd.DataFrame(load)
for i in range(1, 10):
data[i] = data[0]
data
overlay = pd.DataFrame([0,0,0,0,6,6,0])
overlay
data.iloc[:, [1,2,4,5,7,8]] += overlay.iloc[:, 0]
data
WHAT??! The result is completely crazy. Columns 1 and 2 aren't changed at all. Columns 4 and 5 are changed, but in every row. Columns 7 and 8 are nans. What am I missing?
That is what I would expect the result to look like:
Please do not pass the column index '1' of dataframe 'b' as a list but as an element.
Code
b.iloc[:, 1] += a.iloc[:, 0]
b
Output
P1[kW] P2[kW] Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
Edit
Seems like this what we are looking for i.e to sum certain columns of data df with overlay df
Two Options
Option 1
cols=[1,2,4,5,7,8]
data[cols] = data[cols] + overlay.values
data
Option 2, if we want to use iloc
cols=[1,2,4,5,7,8]
data[cols] = data.iloc[:,cols] + overlay.iloc[:].values
data
Output
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 6 6 0 6 6 0 6 6 0
5 0 6 6 0 6 6 0 6 6 0
6 0 0 0 0 0 0 0 0 0 0
How to filter tolist or values below to get only non-zeros
import pandas as pd
df = pd.DataFrame({'A':[0,2,3],'B':[2,0,4], 'C': [3,4,0]})
df['D']=df[['A','B','C']].values.tolist()
df.explode('D')
Data
A B C
0 0 2 3
1 2 0 4
2 3 4 0
On Explode on Column D rows now becomes 9. But i want 4 rows in the output
Expected result
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
I got list(filter(None, [1,0,2,3,0])) to return only non-zeros. But not sure how to apply it in the above code
index.repeat
m = df.ne(0)
df.loc[df.index.repeat(m.sum(1))].assign(D=df.values[m])
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
Simpliest is query:
df['D']=df[['A','B','C']].values.tolist()
df.explode('D').query('D != 0')
Output:
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0
I have following dataset in pandas Dataframe.
group_id sub_group_id
0 0
0 1
1 0
2 0
2 1
2 2
3 0
3 0
But the I want to those group ids and form a consolidated group id
group_id sub_group_id consolidated_group_id
0 0 0
0 1 1
1 0 2
2 0 3
2 1 4
2 2 5
2 2 5
3 0 6
3 0 6
Is there any generic or mathematical way to do it?
cols = ['group_id', 'sub_group_id']
df.assign(
consolidated_group_id=pd.factorize(
pd.Series(list(zip(*df[cols].values.T.tolist())))
)[0]
)
group_id sub_group_id consolidated_group_id
0 0 0 0
1 0 1 1
2 1 0 2
3 2 0 3
4 2 1 4
5 2 2 5
6 3 0 6
7 3 0 6
You need convert values to tuples and then use factorize:
df['consolidated_group_id'] = pd.factorize(df.apply(tuple,axis=1))[0]
print (df)
group_id sub_group_id consolidated_group_id
0 0 0 0
1 0 1 1
2 1 0 2
3 2 0 3
4 2 1 4
5 2 2 5
6 3 0 6
7 3 0 6
Numpy solutions are a bit modify this answer - change ordering by [::-1] with selecting by [0] for return array (numpy.unique):
a = df.values
def unique_return_inverse_2D(a): # a is array
a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1))
return np.unique(a1D, return_inverse=1)[::-1][0]
def unique_return_inverse_2D_viewbased(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
return np.unique(a.view(void_dt).ravel(), return_inverse=1)[::-1][0]
df['consolidated_group_id'] = unique_return_inverse_2D(a)
df['consolidated_group_id1'] = unique_return_inverse_2D_viewbased(a)
print (df)
group_id sub_group_id consolidated_group_id consolidated_group_id1
0 0 0 0 0
1 0 1 1 1
2 1 0 2 2
3 2 0 3 3
4 2 1 4 4
5 2 2 5 5
6 3 0 6 6
7 3 0 6 6