How to perform calculation in string column pandas - python

df[concentration]
'4.0±0.0'
'2.5±0.2'
'5.8±0.1'
'45.0'
'23'
'26.07'
'64'
I want result as:
4.00
2.70
5.90
45.00
23.00
26.07
64.00
which is (4.0+0.0) taking highest possible value.
However my concentration column is not float then how can I perform calculation on such type of data?
Any kind of help will be appreciated.
Thank you.

Use split(regex, expand), convert outcome digits to float and add row wise using lambda.
Data
df=pd.DataFrame({'concentration':['4.0±0.0','2.5±0.2','5.8±0.1','45.0','23','26.07','64']})
Solution
df['concentration']=df.concentration.str.split('±', expand=True).apply(lambda x: x.astype(float)).sum(1)
concentration
0 4.00
1 2.70
2 5.90
3 45.00
4 23.00
5 26.07
6 64.00

Related

Pandas Dataframe Comparison and Copying

Below I have two dataframes, the first being dataframe det and the second being orig. I need to compare det['Detection'] with orig['Date/Time']. Once the values are found during the comparion, I need to copy values from orig and det to some final dataframe (final). The format that I need the final dataframe in is det['Date/Time'] orig['Lat'] orig['Lon'] orig['Dep'] det['Mag'] I hope that my formatting is adequate for folks. I was not sure how to handle the dataframes so I just placed them in tables. Some additional information that probably won't matter is that det is 3385 rows by 3 columns and orig is 818 rows by 9 columns.
det:
Date/Time
Mag
Detection
2008/12/27T01:06:56.37
0.280
2008/12/27T13:50:07.00
2008/12/27T01:17:39.39
0.485
2008/12/27T01:17:39.00
2008/12/27T01:33:23.00
-0.080
2008/12/27T01:17:39.00
orig:
Date/Time
Lat
Lon
Dep
Ml
Mc
N
Dmin
ehz
2008/12/27T01:17:39.00
44.5112
-110.3742
5.07
-9.99
0.51
5
6
3.2
2008/12/27T04:33:30.00
44.4985
-110.3750
4.24
-9.99
1.63
9
8
0.9
2008/12/27T05:38:22.00
44.4912
-110.3743
4.73
-9.99
0.37
8
8
0.8
final:
det['Date/Time']
orig['Lat']
orig['Lon']
orig['Dep']
det['Mag']
You can merge the two dataframes, since you want to use Detection column from the first data frame and Date/Time column from the second dataframe, you can just rename the column of second dataframe while merging since the column name already exits in the first dataframe:
det.merge(org.rename(columns={'Date/Time': 'Detection'}))
OUTPUT:
Date/Time Mag Detection Lat Lon Dep Ml Mc N Dmin ehz
0 2008/12/27T01:17:39.39 0.485 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
1 2008/12/27T01:33:23.00 -0.080 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
You can then select the columns you want.

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Replace 'D' to 'E' within scientific notation for all values in a dataframe

I have an output file from old fortran code, which outputs values as double precision. Therefore any numbers given in scientific notation are in the form 1.23D+4 for example. I'm saving this csv as a pandas dataframe and with to do data analysis.
I'm looking for a way to convert the D to an E within every entry within the dataframe, dedx. I've tried:
for c in dedx.columns:
for i in dedx[c]:
if isinstance(i, str):
i = float(i.replace('D', 'E'))
This changes the value within the loop, as can be seen using print(i) but does not make changes to the actual dataframe.
A sample of the dataframe is shown below:
ENERGY(MEV) DE/DX(MEV/MM) DE/DX(MEV.CM2/MG) RANGE(MM) RANGE(MG/CM2)
0 0.01 4.908059D+01 4.811823D-02 0.000477 0.486766
1 0.50 4.917734D+02 4.821308D-01 0.002121 2.162930
2 1.00 5.261802D+02 5.158630D-01 0.003088 3.149690
3 1.50 5.105083D+02 5.004984D-01 0.004050 4.130490
4 2.00 4.842530D+02 4.747579D-01 0.005054 5.155440
5 2.50 4.568363D+02 4.478788D-01 0.006117 6.239750
6 3.00 4.309473D+02 4.224973D-01 0.007245 7.389450
7 3.50 4.072914D+02 3.993053D-01 0.008438 8.607170
8 4.00 3.859186D+02 3.783516D-01 0.009700 9.894000
9 4.50 3.666619D+02 3.594725D-01 0.011030 11.250200
10 5.00 3.492947D+02 3.424458D-01 0.012427 12.675800
11 5.50 3.335896D+02 3.270486D-01 0.013892 14.170300
12 6.00 3.193387D+02 3.130772D-01 0.015425 15.733200
13 6.50 3.063596D+02 3.003526D-01 0.017024 17.364200
14 7.00 2.944946D+02 2.887202D-01 0.018689 19.062500
15 7.50 2.836086D+02 2.780477D-01 0.020419 20.827500
16 8.00 2.735860D+02 2.682215D-01 0.022215 22.658800
17 8.50 2.643277D+02 2.591448D-01 0.024074 24.555600
18 9.00 2.557488D+02 2.507341D-01 0.025998 26.517400
19 9.50 2.477762D+02 2.429178D-01 0.027984 28.543700
20 10.00 2.403466D+02 2.356339D-01 0.030033 30.633900
You can do this kind of conversion while reading the file with pandas.read_csv rather than looping. It is more efficient.
d_to_e = lambda x : float(x.replace('D', 'E'))
df = pd.read_csv('yourfilename.csv', converters={'DE/DX(MEV/MM)' : d_to_e, 'DE/DX(MEV.CM2/MG)' : d_to_e})
The converters parameter allows you to apply a function to the data of each column. The result is stored in the dataframe. converters takes a dict with column names and functions to be applied to each column data.
I defined the function d_to_e which does the letter replacement and return a float as you did in your loop.
Try this one (to replace your whole code):
dedx[c]=dedx[c].apply(lambda x: float(x.replace("D", "E")) if isinstance(x, str))
(assuming dedx[c] is reference to dataframe column, that you want to modify)

Preprocessing multivariate time-series

Assume that I have indexed by time observations of a group of variables.
A B C D ... V day
16.50 14.00 53.00 45.70 ... 6.39 1
17.45 16.00 64.00 46.30 ... 6.00 2
18.40 12.00 51.00 47.30 ... 6.57 3
19.35 7.00 42.00 48.40 ... 5.84 4
20.30 9.00 34.00 49.50 ... 6.36 5
20.72 10.00 42.00 50.60 ... 5.78 6
21.14 6.00 45.00 51.90 ... 5.16 7
21.56 9.00 38.00 52.60 ... 5.62 8
21.98 2.00 32.00 53.50 ... 4.94 9
22.78 8.00 29.00 53.80 ... 6.25 10
...
Based on this data frame, I would like to construct a predictive model for a target Variable V with predictors being some of the other variables from previous observations, say with a time-horizon . That is, I would like to fit a model of this type
,
where is the error, and is chosen from a certain class of functions (e.g. a some kind of a neural network or a tree based method).
Question Are there packages that allow me a flexible choice of a learning algorithm, possibly from other packages, while also doing necessary preprocessing? Something in the spirit of what Caret does.
Thoughts Of course I could add a bunch of columns to the data frame, say A1, A2, ..., AK, etc, where say A1 is the value of A on the previous day, A2 is the value of A from 2 days before, etc. After that, I could apply any package I'd like. The problem is that if my initial data frame is large, this new data frame is going to be many times larger (about times, to be precise). This makes such an approach very inefficient if is large.
Caret package does have the function called createTimeSlices, but it seems to be designed for univariate time-series. This question is similar to this older one from 2009. The packages recommended there do not quite seem to do what I described above. I was wondering whether other tools have appeared since then. I would also appreciate any information if something like this exist in python.

Summing rows in Python Dataframe

I just started learning Python so forgive me if this question has already been answered somewhere else. I want to create a new column called "Sum", which will simply be the previous columns added up.
Risk_Parity.tail()
VCIT VCLT PCY RWR IJR XLU EWL
Date
2017-01-31 21.704155 11.733716 9.588649 8.278629 5.061788 7.010918 7.951747
2017-02-28 19.839319 10.748690 9.582891 7.548530 5.066478 7.453951 7.950232
2017-03-31 19.986782 10.754507 9.593623 7.370828 5.024079 7.402774 7.654366
2017-04-30 18.897307 11.102380 10.021139 9.666693 5.901137 7.398604 11.284331
2017-05-31 63.962659 23.670240 46.018698 9.917160 15.234977 12.344524 20.405587
The table columns are a little off but all I need is (21.70 + 11.73...+7.95)
I can only get as far as creating the column Risk_Parity['sum'] = , but then I'm lost.
I'd rather not do have to do Risk_Parity['sum] = Risk_Parity['VCIT'] + Risk_Parity['VCLT']...
After creating the sum column, I want to divide each column by the sum column and make that into a new dataframe, which wouldn't include the sum column.
If anyone could help, I'd greatly appreciate it. Please try to dumb your answers down as much as possible lol.
Thanks!
Tom
Use sum with the parameter axis=1 to specify summation over rows
Risk_Parity['Sum'] = Risk_Parity.sum(1)
To create a new copy of Risk_Parity without writing a new column to the original
Risk_Parity.assign(Sum= Risk_Parity.sum(1))
Notice also, that I named the column Sum and not sum. I did this to avoid colliding with the very same method named sum I used to create the column.
To only include numeric columns... however, sum knows to skip non-numeric columns anyway.
RiskParity.assign(Sum=RiskParity.select_dtypes(['number']).sum(1))
# same as
# RiskParity.assign(Sum=RiskParity.sum(1))
VCIT VCLT PCY RWR IJR XLU EWL Sum
Date
2017-01-31 21.70 11.73 9.59 8.28 5.06 7.01 7.95 71.33
2017-02-28 19.84 10.75 9.58 7.55 5.07 7.45 7.95 68.19
2017-03-31 19.99 10.75 9.59 7.37 5.02 7.40 7.65 67.79
2017-04-30 18.90 11.10 10.02 9.67 5.90 7.40 11.28 74.27
2017-05-31 63.96 23.67 46.02 9.92 15.23 12.34 20.41 191.55
l = ['VCIT' , VCLT' ,PCY' ... 'EWL']
Risk_Parity['sum'] = 0
for item in l:
Risk_Parity['sum'] += Risk_Parity[item]

Categories

Resources