Modifying values in a dataframe - python

I am trying to iterate through the rows of a dataframe and modify some values as I iterate. The dataframe looks like this:
Time WindSpeed SkyCover Temp DewPt RH Press Precip
3 21:53 11 Light Snow -1.7 -6.1 72% 1003.1 0
4 20:53 N 11 Mostly Cloudy -2.2 -6.1 75% 1002.8 0
5 19:53 Calm Mostly Cloudy -2.8 -6.7 75% 1002.7 0
6 18:53 Calm Overcast -1.7 -6.7 69% 1002.4 0
7 17:53 N 5 Overcast -1.7 -7.2 66% 1002.6 0
8 16:53 NE 8 Overcast -1.1 -7.2 64% 1002.5 0
…
I have written the following loop to go through the dataframe and alter the windspeed column. This column is a vector when windspeed is greater than 1 KPH and a text value 'Calm' when below that threshold. I am wanting this loop to look at the column values row by row and if it is calm, put '1' in its place but if it is greater than one, remove the direction and keep only the scalar value.
for i in df.index:
if df.at[i, 2] == 'Calm':
df.at[i, 2] = 1
else:
df.at[i, 2] = re.findall('[0-9]+', df.at[i, 2])[0]
As you can see in the above dataframe, this loop has worked on the first row of data but does not continue past that. I am not receiving any error messages as to why it is stopping after the first row.

Use apply:
df.WindSpeed = df.WindSpeed.apply(lambda x: 1 if x == 'Calm' else re.findall(r'[0-9]+',x)[0])

Adding another way of doing it:
import numpy as np
df['WindSpeed'] = np.where(df['WindSpeed'] == 'Calm', '1', df['WindSpeed'].str.extract('(\d+)'))

df['WindSpeed']=df['WindSpeed'].apply(modify)
def modify(x):
if x=='Calm' :
y=1;
else:
y=re.findall('[0-9]+',x)
return y

Related

How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately

I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this:
Time(us) Voltage (V)
0 32.96554106
0.5 32.9149649
1 32.90484966
1.5 32.86438874
2 32.8542735
2.5 32.76323642
3 32.74300595
3.5 32.65196886
4 32.58116224
4.5 32.51035562
5 32.42943376
5.5 32.38897283
6 32.31816621
6.5 32.28782051
7 32.26759005
7.5 32.21701389
8 32.19678342
8.5 32.16643773
9 32.14620726
9.5 32.08551587
10 32.04505495
10.5 31.97424832
11 31.92367216
11.5 31.86298077
12 31.80228938
12.5 31.78205891
13 31.73148275
13.5 31.69102183
14 31.68090659
14.5 31.67079136
15 31.64044567
15.5 31.59998474
16 31.53929335
16.5 31.51906288
I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below.
I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this:
I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following:
You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data:
# get threshold of gradient
m = df['Voltage (V)'].diff().gt(2)
# group start = value above threshold preceded by value below threshold
group = (m&~m.shift(fill_value=False)).cumsum().add(1)
df2 = (df
.assign(id=group,
t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0])
)
.pivot(index='id', columns='t', values='Voltage (V)')
)
output:
t 0.0 0.5 1.0 1.5 2.0 2.5 \
id
1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236
2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364
3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977
4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899
5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397
6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762
7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153
8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371
9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981
10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096
11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684
...
t 748.5 749.0
id
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 21.059913 21.161065
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
[11 rows x 1499 columns]
plot:
df2.T.plot()

Pandas Dataframe Comparison and Copying

Below I have two dataframes, the first being dataframe det and the second being orig. I need to compare det['Detection'] with orig['Date/Time']. Once the values are found during the comparion, I need to copy values from orig and det to some final dataframe (final). The format that I need the final dataframe in is det['Date/Time'] orig['Lat'] orig['Lon'] orig['Dep'] det['Mag'] I hope that my formatting is adequate for folks. I was not sure how to handle the dataframes so I just placed them in tables. Some additional information that probably won't matter is that det is 3385 rows by 3 columns and orig is 818 rows by 9 columns.
det:
Date/Time
Mag
Detection
2008/12/27T01:06:56.37
0.280
2008/12/27T13:50:07.00
2008/12/27T01:17:39.39
0.485
2008/12/27T01:17:39.00
2008/12/27T01:33:23.00
-0.080
2008/12/27T01:17:39.00
orig:
Date/Time
Lat
Lon
Dep
Ml
Mc
N
Dmin
ehz
2008/12/27T01:17:39.00
44.5112
-110.3742
5.07
-9.99
0.51
5
6
3.2
2008/12/27T04:33:30.00
44.4985
-110.3750
4.24
-9.99
1.63
9
8
0.9
2008/12/27T05:38:22.00
44.4912
-110.3743
4.73
-9.99
0.37
8
8
0.8
final:
det['Date/Time']
orig['Lat']
orig['Lon']
orig['Dep']
det['Mag']
You can merge the two dataframes, since you want to use Detection column from the first data frame and Date/Time column from the second dataframe, you can just rename the column of second dataframe while merging since the column name already exits in the first dataframe:
det.merge(org.rename(columns={'Date/Time': 'Detection'}))
OUTPUT:
Date/Time Mag Detection Lat Lon Dep Ml Mc N Dmin ehz
0 2008/12/27T01:17:39.39 0.485 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
1 2008/12/27T01:33:23.00 -0.080 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
You can then select the columns you want.

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Interpolate between two columns of a Dataframe

I am attempting to interpolate a value based on a number's position in a different column. Take this column for instance:
Coupon Price
9.5 109.04
9.375 108.79
9.25 108.54
9.125 108.29
9 108.04
8.875 107.79
8.75 107.54
8.625 107.29
8.5 107.04
8.375 106.79
8.25 106.54
Lets say I have a number like 107. I want to be able to find 107's relative distance from both 107.04 and 106.79 to interpolate the value that has the same relative distance between 8.5 and 8.375, the coupon values at the same index. Is this possible? I can solve this in excel using the FORECAST method, but want to know if it can be done in Python.
Welcome to Stack Overflow.
We need to make a custom function for this, unless there's a standard library function I'm unaware, which is entirely possible. I'm going to make a function that allows you to enter a bond by price and it will get inserted into the dataframe with the appropriate coupon.
Assuming we are starting with a sorted dataframe.
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.375 106.79
10 8.250 106.54
I've inserted comments into the function.
def add_bond(Price, df):
# Add row
df.loc[df.shape[0]] = [np.NaN, Price]
df = df.sort_values('Price', ascending=False).reset_index(drop=True)
# Get index
idx = df[df['Price'] == Price].head(1).index.tolist()[0]
# Get the distance from Prices from previous row to next row
span = abs(df.iloc[idx-1, 1] - df.iloc[idx +1, 1]).round(4)
# Get the distance and direction from Price from previous row to new value
terp = (df.iloc[idx, 1] - df.iloc[idx-1, 1]).round(4)
# Find the percentage movement from previous in percentage.
moved = terp / span
# Finally calculate the move from the previous for Coupon.
df.iloc[idx, 0] = df.iloc[idx-1,0] + (abs(df.iloc[idx-1,0] - df.iloc[idx+1, 0]) * (moved))
return df
A function to calculate the Coupon of a new bond using Price in the DataFrame.
# Add 107
df = add_bond(107, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.480 107.00
10 8.375 106.79
11 8.250 106.54
Add one more.
# Add 107.9
df = add_bond(107.9, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.930 107.90
6 8.875 107.79
7 8.750 107.54
8 8.625 107.29
9 8.500 107.04
10 8.480 107.00
11 8.375 106.79
12 8.250 106.54
If this answer meets your needs, please remember to select correct answer. Thanks.
Probably there's a function that does the work for you somewhere but my advice is to program it yourself, it's not difficult at all and it's a nice programming excercise. Just find the slope in that segment and use the equation a straight line:
(y-y0) = ((y1-y0)/(x1-x0))*(x-x0) -> y = ((y1-y0)/(x1-x0))*(x-x0) + y0
Where:
x -> Your given value (107)
x1 & x0 -> The values right above and below (107.04 & 106.79)
y1 & y0 -> The corresponding values to x1 & x0 (8.5 & 8.375)
y -> Your target value.
Just basic high-school maths ;-)

How do I iterate over a DataFrame when apply won't work without a for loop?

I am trying to find the best way to apply my function to each individual row of a pandas DataFrame without using iterrows() or itertuples(). Note that I am pretty sure apply() will not work in this case.
Here the first 5 rows of the DataFrame that I'm working with:
In [2470]: home_df.head()
Out[2470]:
GameId GameId_real team FTHG FTAG homeElo awayElo homeGame
0 0 -1 Charlton 1.0 2.0 1500.0 1500.0 1
1 1 -1 Derby 2.0 1.0 1500.0 1500.0 1
2 2 -1 Leeds 2.0 0.0 1500.0 1500.0 1
3 3 -1 Leicester 0.0 5.0 1500.0 1500.0 1
4 4 -1 Liverpool 2.0 1.0 1500.0 1500.0 1
Here is my function and the code that I am currently using:
def wt_goals_elo(df, game_id_row, team_row):
wt_goals = (df[(df.GameId < game_id_row) & (df.team == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
game_id_idx = home_df.columns.get_loc('GameId')
team_idx = home_df.columns.get_loc('team')
wt_goals = [wt_goals_elo(home_df, row[game_id_idx + 1], row[team_idx + 1]) for row in home_df.itertuples()]
FTHG = Full time home goals.
I am basically trying to find the weighted average of full time home goals, weighted by away elo for previous games. I can do this using a for loop but am unable to do it using apply, as I need to refer to the original DataFrame to filter by GameId and team.
Any ideas?
Thanks so much in advance.
I believe need:
def wt_goals_elo(game_id_row, team_row):
print (game_id_row)
wt_goals = (home_df[(home_df.GameId.shift() < game_id_row) &
(home_df.team.shift() == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
home_df['w'] = home_df.apply(lambda x: wt_goals_elo(x['GameId'], x['team']), axis=1)

Categories

Resources