Sort pandas dataframe column based on substring - python
I have a pandas dataframe, as shown below:
Timestamp_Start Event_ID Duration
555.54944 Fix_1 0.42248
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
555.97925 Sac_1 0.04577
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
I want to sort the 'Event_ID' column, so that Fix_1,Fix_2,Fix_3... and Sac_1,Sac_2,Sac_3... appear in order, like below:
Timestamp_StartEvent_ID Duration
555.54944 Fix_1 0.42248
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
555.97925 Sac_1 0.04577
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
Any ideas on how to do that? Thanks for your help.
One way using distutils.version:
import numpy as np
from distutils.version import LooseVersion
f = np.vectorize(LooseVersion)
new_df = df.sort_values("Event_ID", key=f)
print(new_df)
Output:
Timestamp_Start Event_ID Duration
0 555.54944 Fix_1 0.42248
3 556.03192 Fix_2 0.16113
4 556.27985 Fix_3 0.24188
5 556.56097 Fix_4 0.04987
6 556.65497 Fix_5 0.10748
7 556.80859 Fix_6 0.75708
8 557.57983 Fix_7 0.11329
9 557.75348 Fix_8 0.65643
10 558.43665 Fix_9 0.27447
1 559.07281 Fix_10 0.01996
2 559.14642 Fix_11 0.00000
11 555.97925 Sac_1 0.04577
14 556.19916 Sac_2 0.07403
15 556.52747 Sac_3 0.02789
16 556.61865 Sac_4 0.02985
17 556.76849 Sac_5 0.03370
18 557.57294 Sac_6 0.00000
19 557.69965 Sac_7 0.04687
20 558.41632 Sac_8 0.01325
21 558.71796 Sac_9 0.34552
12 559.09961 Sac_10 0.04040
13 559.15302 Sac_11 0.00726
Normal sorting on the dataframe will not work, as you need the integer in the string to be treated as int value.
It can be done with extra space though.
You can make two columns like this,
df['event'] = df.Event_ID.str.rsplit("_").str[0]
df['idx'] = df.Event_ID.str.rsplit("_").str[-1].astype(int)
Now, sort on these two columns,
df.sort_values(['event', 'idx'])
Timestamp_Start Event_ID Duration idx event
0 555.54944 Fix_1 0.42248 1 Fix
3 556.03192 Fix_2 0.16113 2 Fix
4 556.27985 Fix_3 0.24188 3 Fix
5 556.56097 Fix_4 0.04987 4 Fix
6 556.65497 Fix_5 0.10748 5 Fix
7 556.80859 Fix_6 0.75708 6 Fix
8 557.57983 Fix_7 0.11329 7 Fix
9 557.75348 Fix_8 0.65643 8 Fix
10 558.43665 Fix_9 0.27447 9 Fix
1 559.07281 Fix_10 0.01996 10 Fix
2 559.14642 Fix_11 0.00000 11 Fix
11 555.97925 Sac_1 0.04577 1 Sac
14 556.19916 Sac_2 0.07403 2 Sac
15 556.52747 Sac_3 0.02789 3 Sac
16 556.61865 Sac_4 0.02985 4 Sac
17 556.76849 Sac_5 0.03370 5 Sac
18 557.57294 Sac_6 0.00000 6 Sac
19 557.69965 Sac_7 0.04687 7 Sac
20 558.41632 Sac_8 0.01325 8 Sac
21 558.71796 Sac_9 0.34552 9 Sac
12 559.09961 Sac_10 0.04040 10 Sac
13 559.15302 Sac_11 0.00726 11 Sac
You can reset_index, drop extra columns as needed
Related
How to Group by the mean of specific columns in Python
In the dataframe below: import pandas as pd import numpy as np df= { 'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'], 'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'], 'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'], 'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'], 'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4], 'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3], 'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8], 'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19], 'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12], 'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33], 'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23], 'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12], 'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33], 'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23], } df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7']) df.info() I want to do the following: Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7) Export all the outputs as one xlsx file (refer to the figure below) The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated) My attempt: # Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7 df1 = df[['AIC_TRX', 'diff', 'series', 'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']] #Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7' df1.groupby('series').Grwth_Time1.agg(['mean']) Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time. df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx") Output: AIC_TRX 1 2 3 4 series 1 150.78 208.07 146.87 229.28 2 162.34 217.76 182.54 244.73 4 188.53 229.48 189.57 269.91 8 197.69 139.51 199.97 249.19 AIC_TRX 1 2 3 4 series 1 250.78 308.07 346.87 329.28 2 262.34 317.70 382.54 347.73 4 288.53 329.81 369.59 369.91 8 297.69 339.15 399.97 349.12 AIC_TRX 1 2 3 4 series 1 240.18 338.07 365.87 429.08 2 232.14 307.74 392.48 448.39 4 258.53 359.16 399.97 465.15 8 276.69 339.25 410.75 469.33 AIC_TRX 1 2 3 4 series 1 270.84 318.73 398.75 494.85 2 282.14 327.47 432.18 509.39 4 298.53 369.63 449.78 515.52 8 306.69 389.59 473.55 539.23 AIC_TRX 1 2 3 4 series 1 25.78 30.07 34.87 29.28 2 22.34 17.70 32.54 34.73 4 28.53 29.81 36.59 36.91 8 27.69 33.15 39.97 34.12 AIC_TRX 1 2 3 4 series 1 240.18 338.07 365.87 429.08 2 232.14 307.74 392.48 448.39 4 258.53 359.16 399.97 465.15 8 276.69 339.25 410.75 469.33 AIC_TRX 1 2 3 4 series 1 27.84 18.73 38.75 13.85 2 28.14 27.47 24.18 9.39 4 29.53 36.63 24.78 15.52 8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping. result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1) Result: series AIC_TRX 1 1 0 120.738 2 4 156.281 3 8 170.285 4 12 196.270 2 1 1 122.358 2 5 152.758 3 9 184.494 4 13 205.175 4 1 2 135.471 2 6 171.968 3 10 187.825 4 14 214.907 8 1 3 142.183 2 7 162.849 3 11 196.851 4 15 216.455 dtype: float64
How to add another category in a DataFrame in python/pandas including only missing values?
I have a dataframe with two columns: 'TotalCharges', and 'Churn' with 7043 rows. In 11 cells of column 'TotalCharges' I have a missing value. What I want is to create 10 categories of TotalCharges plus one category called "MissingValues", but I can't find a way to do it. My DataFrame looks like this: TotalCharges Churn 0 29.85 No 1 1889.5 No 2 108.15 Yes 3 1840.75 No 4 151.65 Yes 5 820.5 Yes 6 1949.4 No 7 301.9 No 8 3046.05 Yes 9 3487.95 No 10 587.45 No 11 326.8 No 12 5681.1 No 13 5036.3 Yes 14 2686.05 No 15 7895.15 No 16 missing No 17 7382.25 No 18 528.35 Yes .... .... .... .... and I want to get something like this: TotalCharges Churn TotalChargesCategories 0 29.85 No (18.799, 84.61] 1 1889.5 No (947.38, 1400.55] 2 108.15 Yes (84.61, 267.37] 3 1840.75 No (1400.55, 2065.52] 4 151.65 Yes (84.61, 267.37] 5 820.5 Yes (552.82, 947.38] 6 1949.4 No (1400.55, 2065.52] 7 301.9 No (267.37, 552.82] 8 3046.05 Yes (2065.52, 3132.75] 9 3487.95 No (3132.75, 4471.44] 10 587.45 No (552.82, 947.38] 11 326.8 No (267.37, 552.82] 12 5681.1 No (4471.44, 5973.69] 13 5036.3 Yes (4471.44, 5973.69] 14 2686.05 No (2065.52, 3132.75] 15 7895.15 No (5973.69, 8684.8] 16 missing No MissingValues 17 7382.25 No (5973.69, 8684.8] 18 528.35 Yes (267.37, 552.82] .... .... .... .... If there wouldn't be missing values it would be easy with this code: width_bin = (pd.qcut(df.TotalCharges,10)) df = df.assign(TotalChargesCat=width_bin) df but since there is 11 missing values I have problems creating categories, and this code leads to error message: TypeError: unsupported operand type(s) for -: 'str' and 'str'
Simply force the missing to NaN (either by explicit replacement or by forcing to numeric dtype), and then use cut as you had: df['TotalChargesCategories'] = pd.cut(pd.to_numeric(df['TotalCharges'], errors='coerce'),10) >>> df TotalCharges Churn TotalChargesCategories 0 29.85 No (21.985, 816.38] 1 1889.5 No (1602.91, 2389.44] 2 108.15 Yes (21.985, 816.38] 3 1840.75 No (1602.91, 2389.44] 4 151.65 Yes (21.985, 816.38] 5 820.5 Yes (816.38, 1602.91] 6 1949.4 No (1602.91, 2389.44] 7 301.9 No (21.985, 816.38] 8 3046.05 Yes (2389.44, 3175.97] 9 3487.95 No (3175.97, 3962.5] 10 587.45 No (21.985, 816.38] 11 326.8 No (21.985, 816.38] 12 5681.1 No (5535.56, 6322.09] 13 5036.3 Yes (4749.03, 5535.56] 14 2686.05 No (2389.44, 3175.97] 15 7895.15 No (7108.62, 7895.15] 16 missing No NaN 17 7382.25 No (7108.62, 7895.15] 18 528.35 Yes (21.985, 816.38]
Scale values of a particular column of python dataframe between 1-10
I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10. Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing? rating 4394029 274358 473691 282858 703750 255967 3298456 136643 796896 2932 220661 48688 4661584 2526119 332176 7189818 322896 188162 157437 1153128 788310 1307902
One possibility is performing a scaling with max. 1 + df / df.max() * 9 rating 0 6.500315 1 1.343433 2 1.592952 3 1.354073 4 1.880933 5 1.320412 6 5.128909 7 1.171046 8 1.997531 9 1.003670 10 1.276217 11 1.060946 12 6.835232 13 4.162121 14 1.415808 15 10.000000 16 1.404192 17 1.235536 18 1.197075 19 2.443451 20 1.986783 21 2.637193 Similar solution by Wen (now deleted): 1 + (df - df.min()) * 9 / (df.max() - df.min()) rating 0 6.498887 1 1.339902 2 1.589522 3 1.350546 4 1.877621 5 1.316871 6 5.126922 7 1.167444 8 1.994266 9 1.000000 10 1.272658 11 1.057299 12 6.833941 13 4.159739 14 1.412306 15 10.000000 16 1.400685 17 1.231960 18 1.193484 19 2.440368 20 1.983514 21 2.634189
Break Existing Dataframe Apart Based on Multi Index
I have an existing dataframe that is sorted like this: In [3]: result_GB_daily_average Out[3]: NREL Avert Month Day 1 1 14.718417 37.250000 2 40.381167 45.250000 3 42.512646 40.666667 4 12.166896 31.583333 5 14.583208 50.416667 6 34.238000 45.333333 7 45.581229 29.125000 8 60.548479 27.916667 9 48.061583 34.041667 10 20.606958 37.583333 11 5.418833 70.833333 12 51.261375 43.208333 13 21.796771 42.541667 14 27.118979 41.958333 15 8.230542 43.625000 16 14.233958 48.708333 17 28.345875 51.125000 18 43.896375 55.500000 19 95.800542 44.500000 20 53.763104 39.958333 21 26.171437 50.958333 22 20.372688 66.916667 23 20.594042 42.541667 24 16.889083 48.083333 25 16.416479 42.125000 26 28.459625 40.125000 27 1.055229 49.833333 28 36.798792 42.791667 29 27.260083 47.041667 30 23.584917 55.750000 ... ... ... 12 2 34.491604 55.916667 3 26.444333 53.458333 4 15.088333 45.000000 5 10.213500 32.083333 6 19.087688 17.000000 7 23.078292 17.375000 8 41.523667 29.458333 9 17.173854 37.833333 10 11.488687 52.541667 11 15.203479 30.000000 12 8.390917 37.666667 13 70.067062 23.458333 14 24.281729 25.583333 15 31.826104 33.458333 16 5.085271 42.916667 17 3.778229 46.916667 18 31.276958 57.625000 19 7.399458 46.916667 20 18.531958 39.291667 21 26.831937 35.958333 22 55.514000 32.375000 23 24.018875 34.041667 24 54.454125 43.083333 25 57.379812 25.250000 26 94.520833 33.958333 27 49.693854 27.500000 28 2.406438 46.916667 29 7.133833 53.916667 30 7.829167 51.500000 31 5.584646 55.791667 I would like to split this dataframe apart into 12 different data frames, one for each month, but the problem is they are all slightly different lengths because the amount of days in a month vary, meaning that attempts at using np.array_split have failed. How can I split this based on the Month index?
One solution : df=result_GB_daily_average [df.iloc[df.index.get_level_values('Month')==i+1] for i in range(12)] or, shorter: [df.ix[i] for i in range(12)]
Nested if loop with DataFrame is very,very slow
I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong I converted the names of my df variables for ease in typing Close=df['Close'] eqId=df['eqId'] date=df['date'] IntDate=df['IntDate'] expiry=df['expiry'] delta=df['delta'] ivMid=df['ivMid'] conf=df['conf'] The below code works fine, just ungodly slow, any suggestions? print(datetime.datetime.now().time()) for i in range(2,1000): if delta[i]==90: if delta[i-1]==50: if delta[i-2]==10: if expiry[i]==expiry[i-2]: df.Skew[i]=ivMid[i]-ivMid[i-2] print(datetime.datetime.now().time()) 14:02:11.014396 14:02:13.834275 df.head(100) Close eqId date IntDate expiry delta ivMid conf Skew 0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663 1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876 2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597 3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438 4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454 5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736 6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400 7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514 8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086 9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246 10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215 11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186 12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479 13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019 14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737 15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218 16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648 17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183 18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411 19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620 20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804 21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077 22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363 23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643 24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646 At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone. Let's break your problem into steps. The Conditions Here, your your first condition can be written like this: df.delta == 90 (Note how this compares the entire column at once. This is much much faster than your loop!). and the second one can be written like this (using shift): df.delta.shift(1) == 50 The rest of your conditions are similar. Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as: (df.delta == 90) & (df.delta.shift(1) == 50) You should be able to now write an expression combining all your conditions. Let's call it cond, i.e., cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ... The assignment To assign things to a new column, use df['skew'] = ... We just need to figure out what to put on the right-hand-sign The Right Hand Side Since we have cond, we can write the right-hand-side as np.where(cond, df.ivMid - df.ivMid.shift(2), 0) What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like). By combining all of this, you should be able to write a very efficient version of your code.