Pad dataframe discontinuous column - python

I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.

You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4

Related

Pandas new column with values from same id, following codition

I have a DataFrame with multiple columns I'll provide code to a artificial df for reproduction:
import pandas as pd
from itertools import product
df = pd.DataFrame(data=list(product([0,1,2], [0,1,2], [0,1,2])), columns=['A', 'B','C'])
df['D'] = range(len(df))
This results in the following dataframe:
A B C D
0 0 0 0 0
1 0 0 1 1
2 0 0 2 2
3 0 1 0 3
4 0 1 1 4
5 0 1 2 5
6 0 2 0 6
7 0 2 1 7
8 0 2 2 8
9 1 0 0 9
I want to get a new column new_C That takes the C value where B fullfills a condition and spreads it over all matching values in Column A.
The following code does exactly that:
new_df = df[['A','B', 'D']].loc[df['C'] == 0]
new_df.columns = ['A', 'B','new_D']
df = df.merge(new_df, on=['A', 'B'], how= 'outer')
However, I a strongly believe there is a better solution to this, where I do not have to introduce a whole new DataFrame and merging it back together.
Preferable a oneliner.
Thanks in advance.
Desired Output:
A B C D new_D
0 0 0 0 0 0
1 0 0 1 1 0
2 0 0 2 2 0
3 0 1 0 3 3
4 0 1 1 4 3
5 0 1 2 5 3
6 0 2 0 6 6
7 0 2 1 7 6
8 0 2 2 8 6
9 1 0 0 9 9
EDIT:
Adding other example:
A B C D
A B C D
0 0 4 foo 0
1 0 4 bar 1
2 0 4 baz 2
3 0 5 foo 3
4 0 5 bar 4
5 0 5 baz 5
6 0 6 foo 6
7 0 6 bar 7
8 0 6 baz 8
9 1 4 foo 9
Should be turned into the following with the condition being:df['C'] == 'bar'
A B C D new_D
0 0 4 foo 0 1
1 0 4 bar 1 1
2 0 4 baz 2 1
3 0 5 foo 3 4
4 0 5 bar 4 4
5 0 5 baz 5 4
6 0 6 foo 6 7
7 0 6 bar 7 7
8 0 6 baz 8 7
9 1 4 foo 9 10
Meaning all numbers are arbetrary. Order is also not the same, it just happens to work to take the first number.
If you want to get a new baseline every time C equals zero, you can use:
df['new_D'] = df['D'].where(df['C'].eq(0)).ffill(downcast='infer')
old answer
What you want is not fully clear, but it looks like you want to repeat the first item per group of A and B. You can easily achieve this with:
df['new_D'] = df.groupby(['A', 'B'])['D'].transform('first')
Even simpler, if your data is really composed of consecutive integers:
df['D'] = df['D']//3*3

pandas : groupby + condition + iterate over a column

I have been stuck for 3 hours on this problem.
I have a DF like that :
p = product
order = number of sales
I don't have the release date of the product so I assume that the release date is the first date with some sales.
Here is my dataframe :
p order
A 0
A 0
A 1
A 1
A 2
B 0
B 0
B 1
B 1
this is what I would like : an incrementation of days since release on columns d_s_r (days since release).
p order d_s_r
A 0 0
A 0 0
A 1 1
A 1 2
A 2 3
B 0 0
B 0 0
B 1 1
B 1 2
What would be your recommendation :
I tried :
for i, row in data[data.order > 0].groupby('p') :
list_rows = row.index.tolist()
for m, k in enumerate(list_rows):
data.loc[k,'s_d_r'] = m +1
seems to be working but it takes too much time....
i'm sure there is an easy way but can't find id.
thanks in advance...
Edit :
Here's my df :
df = pd.DataFrame([['A',0,0],['A',0,0],['A',12,1],['A',23,5],['A',25,7]
,['B',0,0],['B',2,0],['B',8,5],['B',15,12],['B',0,3],['B',0,3],['B',5,4]], columns=['prod','order','order_2'])
with the df.groupby('prod')['order'].transform(lambda x : x.cumsum().factorize()[0])
I get :
prod order order_2 d_s_r
0 A 0 0 0
1 A 0 0 0
2 A 12 1 1
3 A 23 5 2
4 A 25 7 3
5 B 0 0 0
6 B 2 0 1
7 B 8 5 2
8 B 15 12 3
9 B 0 3 3
10 B 0 3 3
11 B 5 4 4
When I would like :
prod order order_2 d_s_r
0 A 0 0 0
1 A 0 0 0
2 A 12 1 1
3 A 23 5 2
4 A 25 7 3
5 B 0 0 0
6 B 2 0 1
7 B 8 5 2
8 B 15 12 3
9 B 0 3 4
10 B 0 3 5
11 B 5 4 6
generally have 0's at the beginning of each groupby.('p') but i could eventually have directly some actual values.
And I can, have 0 order some day(which put's back the counter to 0 here), but still want my counter since release date of product
I actually managed to get my results by adding a dummy column with only "1" and by doing df[df.o' > 0].groupby('p').cumsum() but I don't think it's really interesting...
groupby on p + cumsum on order with factorize
df['d_s_r'] = df.groupby('p')['order'].cumsum().factorize()[0]
print(df)
p order d_s_r
0 A 0 0
1 A 0 0
2 A 1 1
3 A 1 2
4 A 2 3
5 B 0 0
6 B 0 0
7 B 1 1
8 B 1 2

Pandas take the line value below

There is such a model of real data:
C S E D
1 1 3 0 0
2 1 5 0 0
3 1 6 0 0
4 2 1 0 0
5 2 3 0 0
6 2 7 0 0
ะก - category, S - start, E - end, D - delta
Using pandas, you need to enter the value of column S with the condition id = id+1 in column E, and the last value of category E is equal to the value from column S of the same row
It turns out:
C S E D
1 1 3 5 0
2 1 5 6 0
3 1 6 6 0
4 2 1 3 0
5 2 3 7 0
6 2 7 7 0
And then subtract S from E and put it in D. This, in principle, is easy. The difficulty is filling in column E
The result is this:
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
Use DataFrameGroupBy.shift with replace last missing values by original with Series.fillna and then only subtract for column D:
df['E'] = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int)
df['D'] = df['E'] - df['S']
Or if use DataFrame.assign is necessary use lambda function for use counted values of E column:
df = df.assign(E = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int),
D = lambda x: x['E'] - x['S'])
print (df)
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0

Padding and reshaping pandas dataframe

I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0

Creating a column that assigns max value of set of rows by condition to all rows in that group

I have a dataframe that looks like this:
data metadata
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
A 0
A 1
B 0
A 0
A 1
B 0
df.data contains two different categories, A and B. df.metadata stores a running count the number of times a category appears consecutively before the category changes. I want to create a column consecutive_count that assigns the max value of metadata per consecutive group to every row in that group. It should look like this:
data metadata consecutive_count
A 0 4
A 1 4
A 2 4
A 3 4
A 4 4
B 0 2
B 1 2
B 2 2
A 0 1
A 1 1
B 0 0
A 0 1
A 1 1
B 0 0
Please advise. Thank you.
Method 1:
You may try transform max on groupby of each group of data
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df.groupby(s).metadata.transform('max')
Out[96]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0
Method 2:
Since metadata is sorted per group, you may reverse dataframe and do groupby cummax
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df[::-1].groupby(s).metadata.cummax()
Out[101]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0

Categories

Resources