Rearrange dataframe values - python

Let's say I have the following dataframe:
ID stop x y z
0 202 9 20 27 4
1 202 2 23 24 13
2 1756 5 5 41 73
3 1756 3 7 42 72
4 1756 4 3 50 73
5 2153 14 121 12 6
6 2153 3 122.5 2 6
7 3276 1 54 33 -12
8 5609 9 -2 44 -32
9 5609 2 8 44 -32
10 5609 5 102 -23 16
I would like to change the ID values in order to have the smallest being 1, the second smallest being 2 etc.. So for my example, I would get this:
ID stop x y z
0 1 9 20 27 4
1 1 2 23 24 13
2 2 5 5 41 73
3 2 3 7 42 72
4 2 4 3 50 73
5 3 14 121 12 6
6 3 3 122.5 2 6
7 4 1 54 33 -12
8 5 9 -2 44 -32
9 5 2 8 44 -32
10 5 5 102 -23 16
Any idea please?
Thanks in advance!

You can use pd.Series.rank with method='dense'
df['ID'] = df['ID'].rank(method='dense').astype(int)

Related

Categorise hour into four different slots of 15 mins

I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4

Replace lines in pandas

I am working with python to create a new frame starting from two frame by using Pandas.
The first frame (called frame1) is composed by the following line:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
The second frame (called frame2) is:
A B C D E
19 19 19 19 19
24 24 24 24 24
29 29 29 29 29
34 34 34 34 34
39 39 39 39 39
44 44 44 44 44
49 49 49 49 49
54 54 54 54 54
59 59 59 59 59
64 64 64 64 64
69 69 69 69 69
74 74 74 74 74
79 79 79 79 79
84 84 84 84 84
89 89 89 89 89
94 94 94 94 94
99 99 99 99 99
Now i want to create a new dataset with this logic: starting from frame1 substitute every 5 row until the end of the frame1, the row of the frame1 with a random row of the frame2 (and remove the added row from frame2). A possible output should be:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
59 59 59 59 59
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
29 29 29 29 29
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
84 84 84 84 84
How can i do this operation?
It's quite simple:
frame1.loc[4::5] = frame2.sample(frac=1).reset_index(drop=True)
where
df.loc[4::5] selects every fifth element, starting with the fifth one in df, and
df.sample(frac=1).reset_index(drop=True) shuffles a df around randomly
One way is to first obtain the indices where to update (we could also slice assign, but we'd have the problem of the end not being included), and then assign back taking a sample from df2 of the corresponding size:
ix = np.flatnonzero(np.diff(np.arange(df.shape[0]+1)//5))
df1.iloc[ix] = df2.sample(df1.shape[0]//5).to_numpy()
print(df1)
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 84 84 84 84 84
5 6 6 6 6 6
6 7 7 7 7 7
7 8 8 8 8 8
8 9 9 9 9 9
9 89 89 89 89 89
10 11 11 11 11 11
11 12 12 12 12 12
12 13 13 13 13 13
13 14 14 14 14 14
14 99 99 99 99 99

assign a number id for every 4 rows in pandas dataframe

I have a pandas dataframe like this:
pd.DataFrame({'week': ['2019-w01', '2019-w02','2019-w03','2019-w04',
'2019-w05','2019-w06','2019-w07','2019-w08',
'2019-w9','2019-w10','2019-w11','2019-w12'],
'value': [11,22,33,34,57,88,2,9,10,1,76,14],
'period': [1,1,1,1,2,2,2,2,3,3,3,3]})
week value
0 2019-w1 11
1 2019-w2 22
2 2019-w3 33
3 2019-w4 34
4 2019-w5 57
5 2019-w6 88
6 2019-w7 2
7 2019-w8 9
8 2019-w9 10
9 2019-w10 1
10 2019-w11 76
11 2019-w12 14
what I need is like below. I would like to assign a period ID every 4-week interval.
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3
what is the best way to achieve that? Thanks.
try with:
df['period']=(pd.to_numeric(df['week'].str.split('-').str[-1]
.str.replace('w',''))//4).shift(fill_value=0).add(1)
print(df)
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3

How to do a GroupBy without any mean for other features?

I'm working with python pandas DataFrames and I want to group my Data by category and I don't want any mean or median for other features (PriceBucket, success_rate and products_by_number). My DataFrame look like this :
PriceBucket success_rate products_by_number category
0 0 6.890 149837 10
1 1 7.240 105447 10
2 2 7.710 145295 10
3 3 8.090 181323 10
4 4 8.930 57187 10
5 5 8.110 133449 10
6 6 7.920 142858 10
7 7 8.230 115109 10
8 8 8.510 121930 10
9 9 8.340 122510 10
10 0 10.520 28105 20
11 1 9.770 27494 20
12 2 10.080 26758 20
13 3 10.180 29973 20
14 4 9.860 29175 20
15 5 9.950 23807 20
16 6 9.550 30520 20
17 7 9.550 23653 20
18 8 8.990 27514 20
19 9 6.710 26152 20
20 0 11.060 39538 60
21 1 10.740 34479 60
22 2 10.700 36133 60
23 3 10.900 34220 60
24 4 11.290 46001 60
25 5 11.130 26705 60
26 6 11.040 37258 60
27 7 11.150 34561 60
28 8 10.845 35495 60
29 9 10.220 35434 60
30 0 8.380 34134 90
31 1 7.920 32160 90
32 2 8.170 29500 90
33 3 8.270 31688 90
34 4 8.395 38977 90
35 5 8.620 27130 90
36 6 8.440 31007 90
37 7 8.570 31005 90
38 8 8.170 32659 90
39 9 7.290 30227 90
And this is exactly what I want :
PriceBucket success_rate products_by_number
category
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227
What to do ? Many thanks
Assuming you dataframe is df then you want:
print df.set_index(['category', 'PriceBucket'])
success_rate products_by_number
category PriceBucket
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227

Pandas compare 2 dataframes by specific rows in all columns

I have the following Pandas dataframe of some raw numbers:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)
col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']
df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)
It looks like this:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
0 Quantity1 1 1 2 1 2 3 1 1 2 3
1 Quantity2 75 11 0.9 70 60 17 6 3 5 7
2 Quantity3 9 20 17 4 74 12 34 43 92 11
3 Quantity4 7 -17 102 75 41 -89 496 12 17 62
4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
7 NaN 1 6 65 4 2 22 1 34 51 12
8 NaN 2 5 3 3 3 1 6 37 20 71
9 NaN 3 3 6 6 11 6 3 46 1 34
10 NaN 4 7 78 8 55 32 2 918 34 94
11 NaN 8 3 9 3 3 5 6 0 12 1
12 NaN 4 23 2 5 7 6 55 37 59 73
13 NaN 3 27 45 66 33 4 22 91 78 46
14 NaN 8 3 6 32 65 3 6 12 6 51
15 NaN 7 11 7 84 34 898 23 68 101 21
I have a separate dataframe of a processed version of these numbers where:
some of the header rows from above have been deleted,
the column names have been changed
Here is the second dataframe:
df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Common parts of each dataframe:
For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.
Output combination:
I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.
Example:
the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.
Here is the output I am looking to get:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
Quantity1 1 1 2 1 2 3 1 1 2 3
Quantity2 75 11 0.9 70 60 17 6 3 5 7
Quantity3 9 20 17 4 74 12 34 43 92 11
Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3
Quantity4 7 -17 102 75 41 -89 496 12 17 62
Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
My attempts are giving trouble:
print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
gives
ValueError: Can only compare identically-labeled DataFrame objects
and
print (df_raw.ix[7:].values == df_processed.values) #gives False
gives
False
The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.
Question:
Is there a way to select out the output I showed above from these 2 dataframes?
Does this look like the output you're looking for?
Raw dataframe df:
Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 1 6 65 4 2
8 NaN 2 5 3 3 3
9 NaN 3 3 6 6 11
10 NaN 4 7 78 8 55
11 NaN 8 3 9 3 3
12 NaN 4 23 2 5 7
13 NaN 3 27 45 66 33
14 NaN 8 3 6 32 65
15 NaN 7 11 7 84 34
11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 22 1 34 51 12
8 1 6 37 20 71
9 6 3 46 1 34
10 32 2 918 34 94
11 5 6 0 12 1
12 6 55 37 59 73
13 4 22 91 78 46
14 3 6 12 6 51
15 898 23 68 101 21
Processed dataframe dfp:
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Code:
df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)
x = pd.DataFrame()
for col_raw in dfr.columns:
for col_p in dfp.columns:
if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
series = dfr[col_raw].head(7).tolist()
series.append(col_raw)
x[col_p] = series
x = pd.concat([df['Tr_id'].head(7), x], axis=1)
Output:
Tr_id c_1trial 14_1 14_2 8_1 8_2
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
8_3 28_1 24_1 24_2 24_3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
I think the code could be more concise but maybe this does the job.
alternative solution, using DataFrame.isin() method:
In [171]: df1
Out[171]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
3 0 3 3
4 0 4 4
In [172]: df2
Out[172]:
a b c
0 0 3 3
1 1 1 1
2 0 3 4
3 4 2 3
4 0 4 4
In [173]: common = pd.merge(df1, df2)
In [174]: common
Out[174]:
a b c
0 0 3 3
1 0 4 4
In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
a b c
3 0 3 3
4 0 4 4
Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:
select col1, .., colN from tableA
minus
select col1, .., colN from tableB
in Pandas:
In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
I came up with this using loops. It is very disappointing:
holder = []
for randm,pp in enumerate(list(df_processed)):
list1 = df_processed[pp].tolist()
for car,rr in enumerate(list(df_raw)):
list2 = df_raw.loc[7:,rr].tolist()
if list1==list2:
holder.append([rr,pp])
df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)
Output:
Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3
0 Quantity1 1 2 3 1 1 2 1 1 2 3
1 Quantity2 70 60 7 3 75 0.9 11 6 5 17
2 Quantity3 4 74 11 43 9 17 20 34 92 12
3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
4 Quantity4 75 41 62 12 7 102 -17 496 17 -89
5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30
6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1
7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33
The order of the columns is different than what I required, but that is a minor problem.
The real problem with this approach is using loops.
I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

Categories

Resources