Select value from dataframe based on other dataframe - python
i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a
Related
How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately
I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this: Time(us) Voltage (V) 0 32.96554106 0.5 32.9149649 1 32.90484966 1.5 32.86438874 2 32.8542735 2.5 32.76323642 3 32.74300595 3.5 32.65196886 4 32.58116224 4.5 32.51035562 5 32.42943376 5.5 32.38897283 6 32.31816621 6.5 32.28782051 7 32.26759005 7.5 32.21701389 8 32.19678342 8.5 32.16643773 9 32.14620726 9.5 32.08551587 10 32.04505495 10.5 31.97424832 11 31.92367216 11.5 31.86298077 12 31.80228938 12.5 31.78205891 13 31.73148275 13.5 31.69102183 14 31.68090659 14.5 31.67079136 15 31.64044567 15.5 31.59998474 16 31.53929335 16.5 31.51906288 I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below. I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this: I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following: You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data: # get threshold of gradient m = df['Voltage (V)'].diff().gt(2) # group start = value above threshold preceded by value below threshold group = (m&~m.shift(fill_value=False)).cumsum().add(1) df2 = (df .assign(id=group, t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0]) ) .pivot(index='id', columns='t', values='Voltage (V)') ) output: t 0.0 0.5 1.0 1.5 2.0 2.5 \ id 1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236 2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364 3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977 4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899 5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397 6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762 7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153 8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371 9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981 10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096 11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684 ... t 748.5 749.0 id 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN 6 21.059913 21.161065 7 NaN NaN 8 NaN NaN 9 NaN NaN 10 NaN NaN 11 NaN NaN [11 rows x 1499 columns] plot: df2.T.plot()
Is there a way to do rolling rank in Pandas?
I am trying to rank some values in one column over a rolling period of N days instead of having the ranking done over the entire set. I have seen several methods here using rolling_apply but I have read that this is no longer in python. For example, in the following table; A 01-01-2013 100 02-01-2013 85 03-01-2013 110 04-01-2013 60 05-01-2013 20 06-01-2013 40 For the column A above, how can I have the rank as below for N = 3; A Ranked_A 01-01-2013 100 NaN 02-01-2013 85 Nan 03-01-2013 110 1 04-01-2013 60 3 05-01-2013 20 3 06-01-2013 40 2
Yes we have some work around, still with rolling but need apply df.A.rolling(3).apply(lambda x: pd.Series(x).rank(ascending=False)[-1]) 01-01-2013 NaN 02-01-2013 NaN 03-01-2013 1.0 04-01-2013 3.0 05-01-2013 3.0 06-01-2013 2.0 Name: A, dtype: float64
Group By Median, Percentile and Percent of Total
I have a dataframe that looks like this... ID Acuity TOTAL_ED_LOS 1 2 423 2 5 52 3 5 535 4 1 87 ... I would like to produce a table that looks like this: Acuity Count Median Percentile_25 Percentile_75 % of total 1 234 ... 31% 2 65 ... 8% 3 56 ... 7% 4 345 ... 47% 5 35 ... 5% I already have code that will give me everything I need except for the % of total column def percentile(n): def percentile_(x): return np.percentile(x, n) percentile_.__name__ = 'percentile_%s' % n return percentile_ df_grp = df_merged_v1.groupby(['Acuity']) df_grp['TOTAL_ED_LOS'].agg(['count','median', percentile(25), percentile(75)]).reset_index() Is there an efficient way I can add the percent of total column? The link below contain code on how to obtain the percent of total but I'm unsure how to apply it to my code. I know that I could create two tables and then merge them but am curious if there is a cleaner way. How to calculate count and percentage in groupby in Python
Here's a one way to do it using some pandas builtin tools: # Set random number seeed and create a dummy datafame with two columns np.random.seed(123) df = pd.DataFrame({'activity':np.random.choice([*'ABCDE'], 40), 'TOTAL_ED_LDS':np.random.randint(50, 500, 40)}) # Reshape dataframe to get activit per column # then use the output from describe and transpose df_out = df.set_index([df.groupby('activity').cumcount(),'activity'])['TOTAL_ED_LDS']\ .unstack().describe().T #Calculate percent count of total count df_out['% of Total'] = df_out['count'] / df_out['count'].sum() * 100. df_out Output: count mean std min 25% 50% 75% max % of Total activity A 8.0 213.125000 106.810162 93.0 159.50 200.0 231.75 421.0 20.0 B 10.0 308.200000 116.105125 68.0 240.75 324.5 376.25 461.0 25.0 C 6.0 277.666667 117.188168 114.0 193.25 311.5 352.50 409.0 15.0 D 7.0 370.285714 124.724649 120.0 337.50 407.0 456.00 478.0 17.5 E 9.0 297.000000 160.812002 51.0 233.00 294.0 415.00 488.0 22.5
Python: Imported csv not being split into proper columns
I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame. Examples of what I tried: dataset=pd.read('MLB2016PlayerStats2.csv') dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',') dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',') The each line of code above all returned: Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary 1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2... 2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1... 3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,... 4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1... 5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3... Also tried: dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',',quoting=3) Which returned: "Rk Name Age Tm Lg G GS CG Inn Ch \ 0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4 1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337 2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6 3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97 4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212 E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \ 0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07 1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73 2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22 3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22 4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99 Pos Summary" 0 P" 1 1B" 2 P" 3 1B-OF-2B" 4 SS-2B-3B" Below is what the data looks like in notepad++ "Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary" "1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P" "2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B" "3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P" "4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B" "5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B" "6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P" Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output. My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel) Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.
Create column from multiple dataframes
I need to create some new columns based on the value of a dataframe filed and a look up dataframe with some rates. Having df1 as zone hh hhind 0 14 112.0 3.4 1 15 5.0 4.4 2 16 0.0 1.0 and a look_up df as ind per1 per2 per3 per4 0 1.0 1.000 0.000 0.000 0.000 24 3.4 0.145 0.233 0.165 0.457 34 4.4 0.060 0.114 0.075 0.751 how can i update df1.hh1 by multiplying the look_up.per1 based on df1.hhind and lookup.ind zone hh hhind hh1 0 14 112.0 3.4 16.240 1 15 5.0 4.4 0.300 2 16 0.0 1.0 0.000 at the moment im getting the result by merging the tables and the doing the arithmetic. r = pd.merge(df1, look_up, left_on="hhind", right_on="ind") r["hh1"] = r.hh *r.per1 i'd like to know if there is a more straight way to accomplish this by not merging the tables?
You could first set hhind and ind as the index axis of df1 and look_up dataframes respectively. Then, multiply corresponding elements in hh and per1 element-wise. Map these results to the column hhind and assign these to a new column later as shown: mapper = df1.set_index('hhind')['hh'].mul(look_up.set_index('ind')['per1']) df1.assign(hh1=df1['hhind'].map(mapper))
another solution: df1['hh1'] = (df1['hhind'].map(lambda x: look_up[look_up["ind"]==x]["per1"])) * df1['hh']