How to remove spesific things from data in Python - python

I have a data like this:
draft_round
0 1st round
1 3rd round
2 1st round
3 16th round
4 2nd round
... ...
4680 1st round
4681 NaN
4682 2nd round
4683 2nd round
4684 1947 BAA Draf
As you can see, each row of data has complex data, a combination of words and numbers. The important thing for me here is to get the numbers in these lines. For example, I want to get the number "1" in a data row named "1st round" and "16" in a data row "16th round". In other words, I want the yield to be as follows:
draft_round
0 1
1 3
2 1
3 16
4 2
... ...
4680 1
4681 NaN
4682 2
4683 20
4684 1947 BAA Draf
I hope I was able to explain my problem, thanks in advance.

You can try .str.replace:
df["draft_round"] = df["draft_round"].str.replace(
r"(\d+).*round", r"\1", regex=True
)
print(df)
Prints:
draft_round
0 1
1 3
2 1
3 16
4 2
4680 1
4681 NaN
4682 2
4683 2
4684 1947 BAA Draf

try str.split :
df['draft_round'] = df['draft_round'].str.split(pat='[a-z]', expand=True)[0]

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

Unwanted white spaces resulting into distorted column

I am trying to import a list of chemicals from a txt file which is spaced (not tabbed).
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 4.9660E-01 -6.3021E-05 -9.0961E-09 200 1500 2 96.65 142.14 572.33
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 4.1343E-01 1.6576E-05 -3.4388E-08 200 1500 2 87.14 127.90 545.46
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 3.5056E-01 6.9163E-05 -4.9571E-08 200 1500 2 79.86 116.73 521.53
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 6.5020E-01 -3.4459E-04 1.0987E-07 230 1500 1,2 123.13 156.61 561.26
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 4.9405E-01 -1.2627E-05 -2.8629E-08 200 1500 2 100.89 148.24 618.87
6 CBr4 carbon tetrabromide 558-13-4 10.6812 3.2869E-01 1.0739E-04 -6.0788E-08 200 1500 2 80.23 116.62 540.18
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 4.7487E-01 -1.3368E-04 2.2485E-08 230 1500 1,2 116.23 144.10 501.22
8 CClN cyanogen chloride 506-77-4 0.8665 3.6619E-01 -2.9975E-05 -1.3191E-08 200 1500 2 72.80 107.03 438.19
When I import with pandas
df = pd.read_csv('trial1.txt', sep='\s')
I get:
For first 5 compounds (index 0-4) name is correctly in Name column but for 6th (index 5) and 8th (index 7) compounds - their name is divided because of space and it goes to CAS. Causing CAS column value to go under No column and values and so on subsequently.
Is there a way to eliminate this issue ? Thank you
I would suggest you put some work on the 'trial1.txt' file before loading it to df. The following code will result to what you finally want to get:
with open ('trial1.txt') as f:
l=f.readlines()
l=[i.split() for i in l]
target=len(l[1])
for i in range(1,len(l)):
if len(l[i])>target:
l[i][2]=l[i][2]+' '+l[i][3]
l[i].pop(3)
l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
l=[i+'\n' for i in l]
with open ('trial2.txt', 'w') as f:
f.writelines(l)
df = pd.read_csv('trial2.txt', sep='#', index_col=0)
Try this:
You basically have to strip out the spaces between words in the name column. So here I first read the file and then strip out the spaces in the NAME column using re.sub.
In this code, I am assuming that words are separated by atleast 5 letters on either sides. You can change that number {5} as you deem fit.
import re
with open('trial1.txt', 'r') as f:
lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)
Prints:
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C.1 ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 0.49660 -0.000063 -9.096100e-09 200 1500 2 96.65 142.14 572.33 NaN NaN
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 0.41343 0.000017 -3.438800e-08 200 1500 2 87.14 127.90 545.46 NaN NaN
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 0.35056 0.000069 -4.957100e-08 200 1500 2 79.86 116.73 521.53 NaN NaN
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 0.65020 -0.000345 1.098700e-07 230 1500 1,2 123.13 156.61 561.26 NaN NaN
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 0.49405 -0.000013 -2.862900e-08 200 1500 2 100.89 148.24 618.87 NaN NaN
6 CBr4 carbontetrabromide 558-13-4 10.6812 0.32869 0.000107 -6.078800e-08 200 1500 2 80.23 116.62 540.18 NaN NaN
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 0.47487 -0.000134 2.248500e-08 230 1500 1,2 116.23 144.10 501.22 NaN NaN
8 CClN cyanogenchloride 506-77-4 0.8665 0.36619 -0.000030 -1.319100e-08 200 1500 2 72.80 107.03 438.19 NaN NaN

How to compare value in Pandas DataFrame against a value in the previous row AND the previous column?

I have a dataframe consisting of two columns filled with float values. I need to calculate all the values of 'h' minus all the values of 'c', at the index previous to the current 'h' value.
So for instance, for 'h' in row 1, I need to calculate 1.17322 - 1.17285 (the value of 'c' in the previous row)
I have tried several different methods to accomplish this, including the use of: .iloc, .shift(), .groupby(), and .diff(), but I cannot get exactly what I'm looking for.
If anybody could help, it would be greatly appreciated
c h
0 1.17285 1.17310
1 1.17287 1.17322
2 1.17298 1.17340
3 1.17346 1.17348
4 1.17478 1.17511
5 1.17595 1.17700
6 1.17508 1.17633
7 1.17474 1.17545
8 1.17463 1.17546
9 1.17224 1.17468
10 1.17437 1.17456
11 1.17552 1.17641
12 1.17750 1.17784
13 1.17694 1.17770
Try this using shift, for as an example:
df['c_shift'] = df['c'].shift()
df['diff'] = df['h'] - df['c_shift']
print(df)
Output:
c h c_shift diff
0 1.17285 1.17310 NaN NaN
1 1.17287 1.17322 1.17285 0.00037
2 1.17298 1.17340 1.17287 0.00053
3 1.17346 1.17348 1.17298 0.00050
4 1.17478 1.17511 1.17346 0.00165
5 1.17595 1.17700 1.17478 0.00222
6 1.17508 1.17633 1.17595 0.00038
7 1.17474 1.17545 1.17508 0.00037
8 1.17463 1.17546 1.17474 0.00072
9 1.17224 1.17468 1.17463 0.00005
10 1.17437 1.17456 1.17224 0.00232
11 1.17552 1.17641 1.17437 0.00204
12 1.17750 1.17784 1.17552 0.00232
13 1.17694 1.17770 1.17750 0.00020
Of course, you can do this in one step:
df['diff'] = df['h'] - df['c'].shift()

Dataframe calculation

I want to do the following calculation and the outcome has to be a new column Calculated trap..
test["calculation trap"] = (( 0.000164 + 0.000415)/2)
so the outcome of this formula has to be 0.0002895.
I tried the following code to do this calculation for the whole column, but i got the outcome in the column below.
test["calculation trap"] = ((test["calculation"][0:]+test["calculation"][1:])/2).reset_index(drop=True)
Temp calculation. calculation trap.
0 90.01 0.000164 NaN
1 91.03 0.000415 0.000415
2 95.06 0.001315 0.001315
3 100.07 0.002896 0.002896
4 103.50 NaN NaN
Use Series.shift with -1:
test["calculation trap"] = ((test["calculation"].shift(-1)+test["calculation"])/2)
print (test)
Temp calculation calculation trap
0 90.01 0.000164 0.000290
1 91.03 0.000415 0.000865
2 95.06 0.001315 0.002106
3 100.07 0.002896 NaN
4 103.50 NaN NaN

Categories

Resources