I have a dataset where I would like to pivot the entire dataframe, using certain columns as values.
Data
id date sun moon stars total pcp base final status space galaxy
aa Q1 21 5 1 2 8 0 200 41 5 1 1
aa Q2 21 4 1 2 7 1 200 50 6 2 1
Desired
id date type pcp base final final2 status type2 final3
aa Q1 21 sun 0 200 41 5 5 space 1
aa Q1 21 moon 0 200 41 1 5 galaxy 1
aa Q1 21 stars 0 200 41 2 5 space 1
aa Q2 21 sun 1 200 50 4 6 space 2
aa Q2 21 moon 1 200 50 1 6 galaxy 1
aa Q2 21 stars 1 200 50 2 6 space 2
Doing
df.drop(columns='total').melt(['id','date','final','final2','base','ppp'],var_name='type',value_name='ppp')
This works well in pivoting the first set of values (sun, moon etc) however, not sure how to incorporate the second 'set' space and galaxy.
Any suggestion is appreciated
This is a partial answer:
cols = ['id', 'date', 'pcp', 'base', 'final', 'status']
df = df.drop(columns='total')
df1 = df.melt(id_vars=cols, value_vars=['sun', 'moon', 'stars'], var_name='type')
df2 = df.melt(id_vars=cols, value_vars=['galaxy', 'space'], var_name='type2')
out = pd.merge(df1, df2, on=cols)
At this point, your dataframe looks like:
>>> out
id date pcp base final status type value_x type2 value_y
0 aa Q1 21 0 200 41 5 sun 5 galaxy 1
1 aa Q1 21 0 200 41 5 sun 5 space 1
2 aa Q1 21 0 200 41 5 moon 1 galaxy 1
3 aa Q1 21 0 200 41 5 moon 1 space 1
4 aa Q1 21 0 200 41 5 stars 2 galaxy 1
5 aa Q1 21 0 200 41 5 stars 2 space 1
6 aa Q2 21 1 200 50 6 sun 4 galaxy 1
7 aa Q2 21 1 200 50 6 sun 4 space 2
8 aa Q2 21 1 200 50 6 moon 1 galaxy 1
9 aa Q2 21 1 200 50 6 moon 1 space 2
10 aa Q2 21 1 200 50 6 stars 2 galaxy 1
11 aa Q2 21 1 200 50 6 stars 2 space 2
Now the question is how total3 is set to reduce the dataframe?
Related
I have the following pandas dataframe:
temp stage issue_datetime
20 1 2022/11/30 19:20
21 1 2022/11/30 19:21
20 1 None
25 1 2022/11/30 20:10
30 2 None
22 2 2022/12/01 10:00
22 2 2022/12/01 10:01
31 3 2022/12/02 11:00
32 3 2022/12/02 11:01
19 1 None
20 1 None
I want to get the following result:
temp stage num_issues
20 1 3
21 1 3
20 1 3
25 1 3
30 2 2
22 2 2
22 2 2
31 3 2
32 3 2
19 1 0
20 1 0
Basically, I need to calculate the number of non-None per continuous value of stage and create a new column called num_issues.
How can I do it?
You can find the blocks of continuous value with cumsum on the diff, then groupby that and transform the non-null`
blocks = df['stage'].ne(df['stage'].shift()).cumsum()
df['num_issues'] = df['issue_datetime'].notna().groupby(blocks).transform('sum')
# or
# df['num_issues'] = df['issue_datetime'].groupby(blocks).transform('count')
Output:
temp stage issue_datetime num_issues
0 20 1 2022/11/30 19:20 3
1 21 1 2022/11/30 19:21 3
2 20 1 None 3
3 25 1 2022/11/30 20:10 3
4 30 2 None 2
5 22 2 2022/12/01 10:00 2
6 22 2 2022/12/01 10:01 2
7 31 3 2022/12/02 11:00 2
8 32 3 2022/12/02 11:01 2
9 19 1 None 0
10 20 1 None 0
I have a dataset where I would like to sum two columns and then perform a subtraction while displaying a cumulative sum.
Data
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2
a q122 4 1 5 50 25 20 55 21 1 1
a q222 1 1 2 50 25 20 57 22 0 0
a q322 0 0 0 50 25 20 57 22 5 5
b q122 5 5 10 100 30 40 110 27 4 4
b q222 2 2 4 100 30 70 114 29 5 1
b q322 3 4 7 100 30 70 121 33 0 1
Desired
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2 finalt1
a q122 4 1 5 50 25 20 55 21 1 1 28
a q222 1 1 2 50 25 20 57 22 0 0 29
a q322 0 0 0 50 25 20 57 22 5 5 24
b q122 5 5 10 100 30 40 110 27 4 4 31
b q222 2 2 4 100 30 70 114 29 5 1 28
b q322 3 4 7 100 30 70 121 33 0 1 31
Logic
Create 'finalt1' column by summing 't1' and 'cur_t1'
initially and then subtracting 'de_t1' cumulatively and grouping by 'id' and 'date'
Doing
df['finalt1'] = df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
I am still researching on how to subtract the 'de_t1' column cumulatively.
I can't test right now, but logically:
(df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
.sub(df.groupby('id')['de_t1'].cumsum())
)
Of note, there was also this possibility to avoid grouping twice (it is calculating both cumsums at once and computing the difference), but it is actually slower:
df['cur_t1'].add(df.groupby('id')[['de_t1', 't1']].cumsum().diff(axis=1)['t1'])
I having a data csv file containing some data. In which i have some data within semi colons. In these semi colon there is some specific id numbers and i need to replace it with the specific location name.
Available data
24CFA4A-12L - GF Electrical corridor
Replacing data within semicolons of id number
1;1;35;;1/2/1/37 24CFA4A;;;0;;;
Files with data - https://gofile.io/d/bQDppz
Thank you if anyone have solution.
[![Main data to replaced after finding id number and replacing with location ][3]][3]
Supposing you have dataframes:
df1 = pd.read_excel("ID_list.xlsx", header=None)
df2 = pd.read_excel("location.xlsx", header=None)
df1:
0
0 1;1;27;;1/2/1/29 25BAB3D;;;0;;;
1 1;1;27;1;;;;0;;;
2 1;1;28;;1/2/1/30 290E6D2;;;0;;;
3 1;1;28;1;;;;0;;;
4 1;1;29;;1/2/1/31 28BA737;;;0;;;
5 1;1;29;1;;;;0;;;
6 1;1;30;;1/2/1/32 2717823;;;0;;;
7 1;1;30;1;;;;0;;;
8 1;1;31;;1/2/1/33 254DEAA;;;0;;;
9 1;1;31;1;;;;0;;;
10 1;1;32;;1/2/1/34 28AE041;;;0;;;
11 1;1;32;1;;;;0;;;
12 1;1;33;;1/2/1/35 254DE82;;;0;;;
13 1;1;33;1;;;;0;;;
14 1;1;34;;1/2/1/36 2539D70;;;0;;;
15 1;1;34;1;;;;0;;;
16 1;1;35;;1/2/1/37 24CFA4A;;;0;;;
17 1;1;35;1;;;;0;;;
18 1;1;36;;1/2/1/39 28F023E;;;0;;;
19 1;1;36;1;;;;0;;;
20 1;1;37;;1/2/1/40 2717831;;;0;;;
21 1;1;37;1;;;;0;;;
22 1;1;38;;1/2/1/41 2397D75;;;0;;;
23 1;1;38;1;;;;0;;;
24 1;1;39;;1/2/1/42 287844C;;;0;;;
25 1;1;39;1;;;;0;;;
26 1;1;40;;1/2/1/43 28784F0;;;0;;;
27 1;1;40;1;;;;0;;;
28 1;1;41;;1/2/1/44 2865B67;;;0;;;
29 1;1;41;1;;;;0;;;
30 1;1;42;;1/2/1/45 2865998;;;0;;;
31 1;1;42;1;;;;0;;;
32 1;1;43;;1/2/1/46 287852F;;;0;;;
33 1;1;43;1;;;;0;;;
34 1;1;44;;1/2/1/47 287AC43;;;0;;;
35 1;1;44;1;;;;0;;;
36 1;1;45;;1/2/1/48 287ACF8;;;0;;;
37 1;1;45;1;;;;0;;;
38 1;1;46;;1/2/1/49 2878586;;;0;;;
39 1;1;46;1;;;;0;;;
40 1;1;47;;1/2/1/50 2878474;;;0;;;
41 1;1;47;1;;;;0;;;
42 1;1;48;;1/2/1/51 2846315;;;0;;;
df2:
0 1
0 GF General Dining TC 254DEAA-02L
1 GF General Dining TC 2717823-26L
2 GF General Dining FC 28BA737-50L
3 GF Preparation FC 25BAB3D-10L
4 GF Preparation TC 290E6D2-01M
5 GF Hospital Kitchen FC 25BAB2F-10L
6 GF Hospital Kitchen TC 2906F5C-01M
7 GF Food Preparation FC 25F5723-10L
8 GF Food Preparation TC 29070D6-01M
9 GF KITCHEN Corridor 254DF5D-02L
Then:
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
df1:
0 1 2 3 4 5 6 7 8 9 10
0 1 1 27 25BAB3D 0
1 1 1 27 1 0
2 1 1 28 290E6D2 0
3 1 1 28 1 0
4 1 1 29 28BA737 0
5 1 1 29 1 0
6 1 1 30 2717823 0
7 1 1 30 1 0
8 1 1 31 254DEAA 0
9 1 1 31 1 0
10 1 1 32 28AE041 0
11 1 1 32 1 0
12 1 1 33 254DE82 0
13 1 1 33 1 0
14 1 1 34 2539D70 0
15 1 1 34 1 0
16 1 1 35 24CFA4A 0
17 1 1 35 1 0
18 1 1 36 28F023E 0
19 1 1 36 1 0
20 1 1 37 2717831 0
21 1 1 37 1 0
22 1 1 38 2397D75 0
23 1 1 38 1 0
24 1 1 39 287844C 0
25 1 1 39 1 0
26 1 1 40 28784F0 0
27 1 1 40 1 0
28 1 1 41 2865B67 0
29 1 1 41 1 0
30 1 1 42 2865998 0
31 1 1 42 1 0
32 1 1 43 287852F 0
33 1 1 43 1 0
34 1 1 44 287AC43 0
35 1 1 44 1 0
36 1 1 45 287ACF8 0
37 1 1 45 1 0
38 1 1 46 2878586 0
39 1 1 46 1 0
40 1 1 47 2878474 0
41 1 1 47 1 0
42 1 1 48 2846315 0
df2:
0 1
0 GF General Dining TC 254DEAA
1 GF General Dining TC 2717823
2 GF General Dining FC 28BA737
3 GF Preparation FC 25BAB3D
4 GF Preparation TC 290E6D2
5 GF Hospital Kitchen FC 25BAB2F
6 GF Hospital Kitchen TC 2906F5C
7 GF Food Preparation FC 25F5723
8 GF Food Preparation TC 29070D6
9 GF KITCHEN Corridor 254DF5D
To replace the values:
m = dict(zip(df2[1], df2[0]))
df1[4] = df1[4].replace(m)
df1:
0 1 2 3 4 5 6 7 8 9 10
0 1 1 27 GF Preparation FC 0
1 1 1 27 1 0
2 1 1 28 GF Preparation TC 0
3 1 1 28 1 0
4 1 1 29 GF General Dining FC 0
5 1 1 29 1 0
6 1 1 30 GF General Dining TC 0
7 1 1 30 1 0
8 1 1 31 GF General Dining TC 0
9 1 1 31 1 0
10 1 1 32 28AE041 0
11 1 1 32 1 0
12 1 1 33 254DE82 0
13 1 1 33 1 0
14 1 1 34 2539D70 0
15 1 1 34 1 0
16 1 1 35 24CFA4A 0
17 1 1 35 1 0
18 1 1 36 28F023E 0
19 1 1 36 1 0
20 1 1 37 2717831 0
21 1 1 37 1 0
22 1 1 38 2397D75 0
23 1 1 38 1 0
24 1 1 39 287844C 0
25 1 1 39 1 0
26 1 1 40 28784F0 0
27 1 1 40 1 0
28 1 1 41 2865B67 0
29 1 1 41 1 0
30 1 1 42 2865998 0
31 1 1 42 1 0
32 1 1 43 287852F 0
33 1 1 43 1 0
34 1 1 44 287AC43 0
35 1 1 44 1 0
36 1 1 45 287ACF8 0
37 1 1 45 1 0
38 1 1 46 2878586 0
39 1 1 46 1 0
40 1 1 47 2878474 0
41 1 1 47 1 0
42 1 1 48 2846315 0
I have the following df1:
id period color size rate
1 01 red 12 30
1 02 red 12 30
2 01 blue 12 35
3 03 blue 12 35
4 01 blue 12 35
4 02 blue 12 35
5 01 pink 10 40
6 01 pink 10 40
I need to create an index column that is basically a group of color-size-rate and count the number of unique id(s) that have this combination. My output df after transformation should look like this df1:
id period color size rate index count
1 01 red 12 30 red-12-30 1
1 02 red 12 30 red-12-30 1
2 01 blue 12 35 blue-12-35 3
2 03 blue 12 35 blue-12-35 3
4 01 blue 12 35 blue-12-35 3
4 02 blue 12 35 blue-12-35 3
5 01 pink 10 40 pink-10-40 2
6 01 pink 10 40 pink-10-40 2
I am able to get the count but it is not counting "unique" ids but instead , just the number of occurrences.
1 01 red 12 30 red-12-30 2
1 02 red 12 30 red-12-30 2
2 01 blue 12 35 blue-12-35 4
2 03 blue 12 35 blue-12-35 4
4 01 blue 12 35 blue-12-35 4
4 02 blue 12 35 blue-12-35 4
This is wrong as it is not actually grouping by id to count unique ones.
Appreciate any pointers in this direction.
Adding an edit here as my requirement changed:
The count need to also be grouped by 'period' ie my final df should be:
index period count
red-12-30 01 1
red-12-30 02 1
blue-12-35 01 2
blue-12-35 03 1
blue-12-35 02 1
pink-10-40 01 2
Solution: from #anky:
When I try adding another groupby['period'], I am getting dimension mismatch error.
Thank you in advance.
You can try aggregating a join for creating the index column, then group on the same and get nunique using groupby+transform
idx = df[['color','size','rate']].astype(str).agg('-'.join,1)
out = df.assign(index=idx,count=df.groupby(idx)['id'].transform('nunique'))
print(out)
id period color size rate index count
0 1 1 red 12 30 red-12-30 1
1 1 2 red 12 30 red-12-30 1
2 2 1 blue 12 35 blue-12-35 3
3 3 3 blue 12 35 blue-12-35 3
4 4 1 blue 12 35 blue-12-35 3
5 4 2 blue 12 35 blue-12-35 3
6 5 1 pink 10 40 pink-10-40 2
7 6 1 pink 10 40 pink-10-40 2
I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22
You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42
We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22