Pandas: Add rows in the groups of a dataframe - python

I have a data frame as follows:
df = pd.DataFrame({"date": [1,2,5,6,2,3,4,5,1,3,4,5,6,1,2,3,4,5,6],
"variable": ["A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","D","D","D","D"]})
date variable
0 1 A
1 2 A
2 5 A
3 6 A
4 2 B
5 3 B
6 4 B
7 5 B
8 1 C
9 3 C
10 4 C
11 5 C
12 6 C
13 1 D
14 2 D
15 3 D
16 4 D
17 5 D
18 6 D
In this data frame, there are 4 values in the variable column: A, B, C, D. My goal is that each of the variables needs to contain 1 to 6 dates in the date column.
But currently, a few values in the date column are missing for some variable. I tried grouping them and filling each value with a counter but sometimes there are more than one dates missing (For example, in variable A, the dates 4 and 5 are missing). Also, the counter made my code terribly slow as I have a couple of thousand of rows.
Is there a faster and smarter way to do this without using a counter?
The desired output should be as follows:
date variable
0 1 A
1 2 A
2 3 A
3 4 A
4 5 A
5 6 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 6 B
12 1 C
13 2 C
14 3 C
15 4 C
16 5 C
17 6 C
18 1 D
19 2 D
20 3 D
21 4 D
22 5 D
23 6 D

itertools.product
from itertools import product
pd.DataFrame([*product(
range(df.date.min(), df.date.max() + 1),
sorted({*df.variable})
)], columns=df.columns)
date variable
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 B
6 2 C
7 2 D
8 3 A
9 3 B
10 3 C
11 3 D
12 4 A
13 4 B
14 4 C
15 4 D
16 5 A
17 5 B
18 5 C
19 5 D
20 6 A
21 6 B
22 6 C
23 6 D

Using grpupby + reindex
df.groupby('variable', as_index=False).apply(
lambda g: g.set_index('date').reindex([1,2,3,4,5,6]).ffill().bfill())
.reset_index(level=1)
Output:
date variable
0 1 A
0 2 A
0 3 A
0 4 A
0 5 A
0 6 A
1 1 B
1 2 B
1 3 B
1 4 B
1 5 B
1 6 B
2 1 C
2 2 C
2 3 C
2 4 C
2 5 C
2 6 C
3 1 D
3 2 D
3 3 D
3 4 D
3 5 D
3 6 D

This is more of a work around but it should work
df.groupby(by=['variable']).agg({'date': range(6)}).explode('date')

Related

Pandas - replicate rows with new column value from a list for each replication

So I have a data frame that has two columns, State and Cost, and a separate list of new "what-if" costs
State Cost
A 2
B 9
C 8
D 4
New_Cost_List = [1, 5, 10]
I'd like to replicate all the rows in my data set for each value of New_Cost, adding a new column for each New_Cost for each state.
State Cost New_Cost
A 2 1
B 9 1
C 8 1
D 4 1
A 2 5
B 9 5
C 8 5
D 4 5
A 2 10
B 9 10
C 8 10
D 4 10
I thought a for loop might be appropriate to iterate through, replicating my dataset for the length of the list and adding the values of the list as a new column:
for v in New_Cost_List:
df_new = pd.DataFrame(np.repeat(df.values, len(New_Cost_List), axis=0))
df_new.columns = df.columns
df_new['New_Cost'] = v
The output of this gives me the correct replication of State and Cost but the New_Cost value is 10 for each row. Clearly I'm not connecting how to get it to run through the list for each replicated set, so any suggestions? Or is there a better way to approach this?
EDIT 1
Reducing the number of values in the New_Cost_List from 4 to 3 so there's a difference in row count and length of the list.
Here is a way using the keys paramater of pd.concat():
(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))
Output:
New_Cost State Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4
If i understand your question correctly, this should solve your problem.
df['New Cost'] = new_cost_list
df = pd.concat([df]*len(new_cost_list), ignore_index=True)
Output:
State Cost New Cost
0 A 2 1
1 B 9 5
2 C 8 10
3 D 4 15
4 A 2 1
5 B 9 5
6 C 8 10
7 D 4 15
8 A 2 1
9 B 9 5
10 C 8 10
11 D 4 15
12 A 2 1
13 B 9 5
14 C 8 10
15 D 4 15
You can use index.repeat and numpy.tile:
df2 = (df
.loc[df.index.repeat(len(New_Cost_List))]
.assign(**{'New_Cost': np.repeat(New_Cost_List, len(df))})
)
or, simply, with a cross merge:
df2 = df.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
output:
State Cost New_Cost
0 A 2 1
0 A 2 5
0 A 2 10
1 B 9 1
1 B 9 5
1 B 9 10
2 C 8 1
2 C 8 5
2 C 8 10
3 D 4 1
3 D 4 5
3 D 4 10
For the provided order:
(df
.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
.sort_values(by='New_Cost', kind='stable')
.reset_index(drop=True)
)
output:
State Cost New_Cost
0 A 2 1
1 B 9 1
2 C 8 1
3 D 4 1
4 A 2 5
5 B 9 5
6 C 8 5
7 D 4 5
8 A 2 10
9 B 9 10
10 C 8 10
11 D 4 10

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.
You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

On the basis of record in a single column, update a new date column [duplicate]

What I have:
df = pd.DataFrame({'SERIES1':['A','A','A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','C','C'],
'SERIES2':[1,1,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1],
'SERIES3':[10,12,20,10,12,4,8,8,1,10,12,12,13,13,9,8,7,7,7]})
SERIES1 SERIES2 SERIES3
0 A 1 10
1 A 1 12
2 A 1 20
3 A 1 10
4 A 2 12
5 A 2 4
6 B 1 8
7 B 1 8
8 B 1 1
9 B 1 10
10 B 1 12
11 B 1 12
12 B 1 13
13 B 1 13
14 C 1 9
15 C 1 8
16 C 1 7
17 C 1 7
18 C 1 7
What I need is to group by SERIES1 and SERIES2 and to convert the values in SERIES3 to the minimum of that group. i.e.:
df2 = pd.DataFrame({'SERIES1':['A','A','A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','C','C'],
'SERIES2':[1,1,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1],
'SERIES3':[10,10,10,10,4,4,1,1,1,1,1,1,1,1,7,7,7,7,7]})
SERIES1 SERIES2 SERIES3
0 A 1 10
1 A 1 10
2 A 1 10
3 A 1 10
4 A 2 4
5 A 2 4
6 B 1 1
7 B 1 1
8 B 1 1
9 B 1 1
10 B 1 1
11 B 1 1
12 B 1 1
13 B 1 1
14 C 1 7
15 C 1 7
16 C 1 7
17 C 1 7
18 C 1 7
I have a feeling this can be done with .groupby(), but I'm not sure how to replace it in the existing DataFrame, or to add it as new series.
I'm able to get:
df.groupby(['SERIES1', 'SERIES2']).min()
SERIES3
SERIES1 SERIES2
A 1 10
2 4
B 1 1
C 1 7
which are the correct minimums per group, but I cant figure out a simple way to pop that back into the original dataframe.
You can use groupby.transform, which gives back a same length series that you can assign back to the data frame:
df['SERIES3'] = df.groupby(['SERIES1', 'SERIES2']).SERIES3.transform('min')
df

How does pandas convert one column of data into another?

I have a dataframe generated by pandas, as follows:
NO CODE
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 a
10 a
11 a
12 a
13 b
14 a
15 a
16 a
I want to convert the CODE column data to get the NUM column. The encoding rules are as follows:
NO CODE NUM
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 a 1
15 a 2
16 a 3
thank you!
Try:
a_group = df.CODE.eq('a')
df['NUM'] = np.where(a_group,
df.groupby(a_group.ne(a_group.shift()).cumsum())
.CODE.cumcount()+1,
df.CODE)
on
df = pd.DataFrame({'CODE':list('baaaaaabbaaaabbaa')})
yields
CODE NUM
-- ------ -----
0 b b
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 b b
15 a 1
16 a 2
IIUC
s=df.CODE.eq('b').cumsum()
df['NUM']=df.CODE.where(df.CODE.eq('b'),s[~df.CODE.eq('b')].groupby(s).cumcount()+1)
df
Out[514]:
NO CODE NUM
0 1 a 1
1 2 a 2
2 3 a 3
3 4 a 4
4 5 a 5
5 6 a 6
6 7 b b
7 8 b b
8 9 a 1
9 10 a 2
10 11 a 3
11 12 a 4
12 13 b b
13 14 a 1
14 15 a 2
15 16 a 3

Add a name to pandas dataframe index

As the picture shows , how can I add a name to index in pandas dataframe?And when added it should be like this:
You need set index name:
df.index.name = 'code'
Or rename_axis:
df = df.rename_axis('code')
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10,size=(5,5)),columns=list('ABCDE'),index=list('abcde'))
print (df)
A B C D E
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df.index.name = 'code'
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df = df.rename_axis('code')
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4

Categories

Resources