How to conditionally create pandas column from other column values - python
I have a dataframe that looks like this:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
4 I agree 12.01 14.00 2
5 but i disagree 14.01 17.00 2
6 thats fine 17.01 19.00 1
7 however you are 19.01 22.00 1
8 like this 22.01 24.00 1
9 and 24.01 25.00 1
I want to create two new columns, df.speaker_1 and df.speaker_2. When df.speaker == 2, I want df.speaker_2 to contain the values of df.word. When df.speaker != 2, I want it to contain an empty string. The same will be repeated for the other speaker value. It should look as below:
word start stop speaker speaker_2 speaker_1
0 but that's alright 2.72 3.47 2 but that's alright
1 we'll have to 8.43 9.07 1 we'll have to
2 okay sure 9.19 11.01 2 okay sure
3 what? 11.02 12.00 1 what?
4 I agree 12.01 14.00 2 I agree
5 but i disagree 14.01 17.00 2 but i disagree
6 thats fine 17.01 19.00 1 thats fine
7 however you are 19.01 22.00 1 however you are
8 like this 22.01 24.00 1 like this
9 and 24.01 25.00 1 and
Any advice would be appreciated, thanks.
You can copy values from your column word then replace with empty strings as needed:
df['speaker_1'] = df['word']
df['speaker_2'] = df['word']
df.loc[df['speaker'] != 1, 'speaker_1'] = ''
df.loc[df['speaker'] != 2, 'speaker_2'] = ''
Alternatively, you could use apply, but I find this is more straightforward in your case.
You could use pd.DataFrame.mask():
df['speaker_1'] = df.word.mask(df.speaker!=1, '')
df['speaker_2'] = df.word.mask(df.speaker!=2, '')
# word start ... speaker_1 speaker_2
# 0 but that's alright 2.72 ... but that's alright
# 1 we'll have to 8.43 ... we'll have to
# 2 okay sure 9.19 ... okay sure
# 3 what? 11.02 ... what?
# 4 I agree 12.01 ... I agree
# 5 but i disagree 14.01 ... but i disagree
# 6 thats fine 17.01 ... thats fine
# 7 however you are 19.01 ... however you are
# 8 like this 22.01 ... like this
# 9 and 24.01 ... and
Related
Create new columns based on previous columns with multiplication
I want to create a list of columns where the new columns are based on previous columns times 1.5. It will roll until Year 2020. I tried to use previous and current but it didn't work as expected. How can I make it work as expected? df = pd.DataFrame({ 'us2000':[5,3,6,9,2,4], }); df a = [] for i in range(1, 21): a.append("us202" + str(i)) for previous, current in zip(a, a[1:]): df[current] = df[previous] * 1.5
IIUC you can fix you code with: a = [] for i in range(0, 21): a.append(f'us20{i:02}') for previous, current in zip(a, a[1:]): df[current] = df[previous] * 1.5 Another, vectorial, approach with numpy would be: df2 = (pd.DataFrame(df['us2000'].to_numpy()[:,None]*1.5**np.arange(21), columns=[f'us20{i:02}' for i in range(21)])) output: us2000 us2001 us2002 us2003 us2004 us2005 us2006 us2007 ... 0 5 7.5 11.25 16.875 25.3125 37.96875 56.953125 85.429688 1 3 4.5 6.75 10.125 15.1875 22.78125 34.171875 51.257812 2 6 9.0 13.50 20.250 30.3750 45.56250 68.343750 102.515625 3 9 13.5 20.25 30.375 45.5625 68.34375 102.515625 153.773438 4 2 3.0 4.50 6.750 10.1250 15.18750 22.781250 34.171875 5 4 6.0 9.00 13.500 20.2500 30.37500 45.562500 68.343750
Try: for i in range(1, 21): df[f"us{int(2000+i):2d}"] = df[f"us{int(2000+i-1):2d}"].mul(1.5) >>> df us2000 us2001 us2002 ... us2018 us2019 us2020 0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650 1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190 2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380 3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571 4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460 5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920 [6 rows x 21 columns]
pd.DataFrame(df.to_numpy()*[1.5**i for i in range(0,21)])\ .rename(columns=lambda x:str(x).rjust(2,'0')).add_prefix("us20") out us2000 us2001 us2002 ... us2018 us2019 us2020 0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650 1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190 2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380 3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571 4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460 5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920 [6 rows x 21 columns]
Select value from dataframe based on other dataframe
i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration. Dataframe 1 contains the measurement data: ms force ... ... ... 1 5 20 2 10 20 3 15 25 4 20 30 5 25 20 ..... (~ 6000 lines) Dataframe 2 contains "positioning data" ms speed (m/s) 1 0 0.66 2 4500 0.66 3 8000 1.3 4 16000 3.0 5 20000 3.0 .....(~300 lines) Now I want to calculate the position of the first dataframe with the data from secound dataframe In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2. My idea is to make something like this: if In the end I want to display a graph "force <-> way" and not "force <-> time" Thank you in andvance ========================================================================== Update: In the meantime I could almost solve my issue. Now my Data look like this: Dataframe 2 (Speed Data): pos v a t t-end t-start 0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000 1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000 2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287 3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531 ... 15 0.055 0.686667 0.5 0.064904 23.0 20.0 ... 28 0.055 0.686667 0.6 0.064904 35.0 34.0 ... 30 0.055 0.686667 0.9 0.064904 44.0 39.0 And Dataframe 1 (time based measurement): Fx Fy Fz abs_t expected output ('a' from DF1) 0 -13.9 170.3 45.0 0.005 0.000000 1 -14.1 151.6 38.2 0.010 0.000000 ... 200 -14.1 131.4 30.4 20.015 0.5 ... 300 -14.3 111.9 21.1 34.01 0.6 ... 400 -14.5 95.6 13.2 40.025 So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2 So somthing like this (pseudo code): if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']): DF1['a'] = DF2['a'] I could make two for loops but it looks like the wrong way and is very very slow. I hope you understand my problem; to provide a running sample is very hard. In Excel I did like this:
I found a very slow solution but atleast its working :( df1['a'] = 0 for index, row in df2.iterrows(): start = row['t-start'] end = row ['t-end'] a = row ['a'] df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a
Python: Imported csv not being split into proper columns
I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame. Examples of what I tried: dataset=pd.read('MLB2016PlayerStats2.csv') dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',') dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',') The each line of code above all returned: Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary 1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2... 2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1... 3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,... 4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1... 5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3... Also tried: dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',',quoting=3) Which returned: "Rk Name Age Tm Lg G GS CG Inn Ch \ 0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4 1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337 2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6 3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97 4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212 E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \ 0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07 1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73 2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22 3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22 4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99 Pos Summary" 0 P" 1 1B" 2 P" 3 1B-OF-2B" 4 SS-2B-3B" Below is what the data looks like in notepad++ "Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary" "1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P" "2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B" "3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P" "4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B" "5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B" "6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P" Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output. My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel) Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.
Nested if loop with DataFrame is very,very slow
I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong I converted the names of my df variables for ease in typing Close=df['Close'] eqId=df['eqId'] date=df['date'] IntDate=df['IntDate'] expiry=df['expiry'] delta=df['delta'] ivMid=df['ivMid'] conf=df['conf'] The below code works fine, just ungodly slow, any suggestions? print(datetime.datetime.now().time()) for i in range(2,1000): if delta[i]==90: if delta[i-1]==50: if delta[i-2]==10: if expiry[i]==expiry[i-2]: df.Skew[i]=ivMid[i]-ivMid[i-2] print(datetime.datetime.now().time()) 14:02:11.014396 14:02:13.834275 df.head(100) Close eqId date IntDate expiry delta ivMid conf Skew 0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663 1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876 2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597 3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438 4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454 5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736 6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400 7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514 8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086 9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246 10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215 11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186 12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479 13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019 14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737 15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218 16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648 17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183 18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411 19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620 20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804 21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077 22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363 23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643 24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646 At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone. Let's break your problem into steps. The Conditions Here, your your first condition can be written like this: df.delta == 90 (Note how this compares the entire column at once. This is much much faster than your loop!). and the second one can be written like this (using shift): df.delta.shift(1) == 50 The rest of your conditions are similar. Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as: (df.delta == 90) & (df.delta.shift(1) == 50) You should be able to now write an expression combining all your conditions. Let's call it cond, i.e., cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ... The assignment To assign things to a new column, use df['skew'] = ... We just need to figure out what to put on the right-hand-sign The Right Hand Side Since we have cond, we can write the right-hand-side as np.where(cond, df.ivMid - df.ivMid.shift(2), 0) What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like). By combining all of this, you should be able to write a very efficient version of your code.
Appending data row from one dataframe to another with respect to date
I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below: Example of first dataframe (lets call it df_ls): DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW 0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018 1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595 2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502 3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445 4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521 5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054 6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823 7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357 8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427 9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809 10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322 11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398 12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290 13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936 14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328 Example of second dataframe (let's call it df_1) Sample Date Value 0 2000-05-09 1.68 1 2000-05-09 1.68 2 2000-05-18 1.75 3 2000-05-18 1.75 4 2000-05-31 1.40 5 2000-05-31 1.40 6 2000-06-13 1.07 7 2000-06-13 1.07 8 2000-06-27 1.49 9 2000-06-27 1.49 10 2000-07-11 2.29 11 2000-07-11 2.29 In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly): Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW 0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290 1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290 2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936 3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936 4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328 5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328 6 2000-06-13 1.07 ETC.... ETC.... ETC ... 7 2000-06-13 1.07 8 2000-06-27 1.49 9 2000-06-27 1.49 10 2000-07-11 2.29 11 2000-07-11 2.29 Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me. Thanks