iterate over certain columns in data frame - python
Hi I have data frame like this:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 -
2 GWRE 416.06 9.80 5.33 45.62 -
3 PEGA 129.02 4.41 9.85 285.10 0.28%
4 BLKB 87.68 4.96 14.36 41.81 0.62%
Firstly, I want to convert values in columns that contain numbers (which are string currently) to a float values. So here I would have the 4 middle columns that need the conversion to float. Would simple loop work with this case?
Second thing, there is a problem with the last column 'Dividend' where there is a percentage value as string. As a matter of fact I can convert it to decimals, however I was thinking if there is a way to still retaining the % and the values would still be calculable.
Any ideas for those two issues?
plan
Take out 'Ticker' because it isn't numeric
use assign to overwrite Dividend by striping off %
use apply with pd.to_numeric to convert all the columns
use eval to get Dividend to proper decimal space
df[['Ticker']].join(
df.assign(
Dividend=df.Dividend.str.strip('%')
).drop('Ticker', 1).apply(
pd.to_numeric, errors='coerce'
)
).eval('Dividend = Dividend / 100', inplace=False)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
more lines
more readable
nums = df.drop('Ticker', 1).assign(Dividend=df.Dividend.str.strip('%'))
nums = nums.apply(pd.to_numeric, errors='coerce')
nums = nums.assign(Dividend=nums.Dividend / 100)
df[['Ticker']].join(nums)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
Assuming that all P/... columns contain proper numbers:
In [47]: df.assign(Dividend=pd.to_numeric(df.Dividend.str.replace(r'\%',''), errors='coerce')
...: .div(100)) \
...: .set_index('Ticker', append=True) \
...: .astype('float') \
...: .reset_index('Ticker')
...:
Out[47]:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
Related
Slicing pandas dataframe by ordered values into clusters
I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together Time Value 0 56610.41341 8.55 1 56587.56394 5.27 2 56590.62965 6.81 3 56598.63790 5.47 4 56606.52203 6.71 5 56980.44206 4.75 6 56592.53327 6.53 7 57335.52837 0.74 8 56942.59094 6.96 9 56921.63669 9.16 10 56599.52053 6.14 11 56605.50235 5.20 12 57343.63828 3.12 13 57337.51641 3.17 14 56593.60374 5.69 15 56882.61571 9.50 I tried sorting this and taking time difference of two consecutive points with df = df.sort_values("Time") df['t_dif'] = df['Time'] - df['Time'].shift(-1) And it gives Time Value t_dif 1 56587.56394 5.27 -3.06571 2 56590.62965 6.81 -1.90362 6 56592.53327 6.53 -1.07047 14 56593.60374 5.69 -5.03416 3 56598.63790 5.47 -0.88263 10 56599.52053 6.14 -5.98182 11 56605.50235 5.20 -1.01968 4 56606.52203 6.71 -3.89138 0 56610.41341 8.55 -272.20230 15 56882.61571 9.50 -39.02098 9 56921.63669 9.16 -20.95425 8 56942.59094 6.96 -37.85112 5 56980.44206 4.75 -355.08631 7 57335.52837 0.74 -1.98804 13 57337.51641 3.17 -6.12187 12 57343.63828 3.12 NaN Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this? I could loop the rows but this is frowned upon so is there a smarter solution? Edit: Here is a example: df1: Time Value t_dif 1 56587.56394 5.27 -3.06571 2 56590.62965 6.81 -1.90362 6 56592.53327 6.53 -1.07047 14 56593.60374 5.69 -5.03416 3 56598.63790 5.47 -0.88263 10 56599.52053 6.14 -5.98182 11 56605.50235 5.20 -1.01968 4 56606.52203 6.71 -3.89138 df2: 0 56610.41341 8.55 -272.20230 df3: 15 56882.61571 9.50 -39.02098 9 56921.63669 9.16 -20.95425 8 56942.59094 6.96 -37.85112 ... etc.
I think you can just df1 = df[df['t_dif']<30] df2 = df[df['t_dif']>=30]
def split_dataframe(df, value): df = df.sort_values("Time") df = df.reset_index() df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs() indxs = df.index[df['t_dif'] > value].tolist() indxs.append(-1) indxs.append(len(df)) indxs.sort() frames = [] for i in range(1, len(indxs)): val = df.iloc[indxs[i] + 1: indxs[i]] frames.append(val) return frames Returns the correct dataframes as a list
Dataframe split columns value, how to solve error message?
I have a panda dataframe with the following columns: Stock ROC5 ROC20 ROC63 ROCmean 0 IBGL.SW -0.59 3.55 6.57 3.18 0 EHYA.SW 0.98 4.00 6.98 3.99 0 HIGH.SW 0.94 4.22 7.18 4.11 0 IHYG.SW 0.56 2.46 6.16 3.06 0 HYGU.SW 1.12 4.56 7.82 4.50 0 IBCI.SW 0.64 3.57 6.04 3.42 0 IAEX.SW 8.34 18.49 14.95 13.93 0 AGED.SW 9.45 24.74 28.13 20.77 0 ISAG.SW 7.97 21.61 34.34 21.31 0 IAPD.SW 0.51 6.62 19.54 8.89 0 IASP.SW 1.08 2.54 12.18 5.27 0 RBOT.SW 10.35 30.53 39.15 26.68 0 RBOD.SW 11.33 30.50 39.69 27.17 0 BRIC.SW 7.24 11.08 75.60 31.31 0 CNYB.SW 1.14 4.78 8.36 4.76 0 FXC.SW 5.68 13.84 19.29 12.94 0 DJSXE.SW 3.11 9.24 6.44 6.26 0 CSSX5E.SW -0.53 5.29 11.85 5.54 How can I write in the dataframe a new columns "Symbol" with the stock without ".SW". Example first row result should be IBGL (modified value IBGL.SW). Example last row result should be CSSX5E (splited value SSX5E.SW). If I send the following command: new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0] Than I receive an error message: :3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0] How can I solve this problem? Thanks a lot for your support.
METHOD 1: You can do a vectorized operation by str.get(0) - df['SYMBOL'] = df['Stock'].str.split('.').str.get(0) METHOD 2: You can do another vectorized operation by using expand=True in str.split() and then getting the first column. df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0] METHOD 3: Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF. df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution. edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following: new_df = new_df.copy(deep=False) And then proceed to solve it with: new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy() new_df['Symbol'] = new_df.Stock.str.replace('.SW','')
Can't replace string symbol "-" in dataframe on jupyter notebook
I import data from "url = ("http://finviz.com/quote.ashx?t=" + symbol.lower())" and got the table: P/B P/E Forward P/E PEG Debt/Eq EPS (ttm) Dividend % ROE \ AMZN 18.73 92.45 56.23 2.09 1.21 16.25 - 26.70% GOOG 4.24 38.86 - 2.55 - 26.65 - - PG 4.47 22.67 19.47 3.45 0.61 4.05 3.12% 18.80% KO 11.04 30.26 21.36 4.50 2.45 1.57 3.29% 15.10% IBM 5.24 9.28 8.17 9.67 2.37 12.25 5.52% 30.90% ROI EPS Q/Q Insider Own AMZN 3.50% 1026.20% 16.20% GOOG - 36.50% 5.74% PG 13.10% 15.50% 0.10% KO 12.50% 56.80% 0.10% IBM 17.40% 0.70% 0.10% Then I was trying to convert string to float: df = df[(df['P/E'].astype(float)<20) & (df['P/B'].astype(float) < 3)] and got "ValueError: could not convert string to float:" I think that values 0.70% and sign "-" is the problem. I tried: df.replace("-","0") df.replace('-', 0) df.replace('-', nan) But nothing works.
You may need to assign it back df=df.replace("-","0") And I recommend to_numeric df['P/E']=pd.to_numeric(df['P/E'],errors = 'coerce') df['P/B']=pd.to_numeric(df['P/B'],errors = 'coerce')
You should use numpy: import numpy as np then the next replacement: df = df.replace('-', np.nan) Next, change the datatype: df = df['Forward P/E'].astype(float) Lastly, you can test if the datatype is float64.
Columns error while converting to float
df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND') df.fillna(0, inplace=True) df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']] new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']] nnew_list = [float(i) for i in new_list] print(type(df['ID'])) #--> <class 'pandas.core.series.Series'> print('ID' in df.columns) #-->True Some data from the CSV: ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g] A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400 C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800 C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000 E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500 A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100 A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200 A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000 When I run this, I get the error: ValueError: could not convert string to float: 'Elevation' It seems as if it's trying to convert the headlines, but I only wanted to convert the list.
To convert columns to numeric, you should use pd.to_numeric: cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]'] df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings. Update 1 Try also using the options below: import pandas as pd from io import StringIO mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \ 1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21 2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30 3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95 """) df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True) # 0 1 2 3 4 5 6 7 8 9 # 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21 # 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30 # 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95 Update 2 import pandas as pd from io import StringIO mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g] A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400 C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800 C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000 E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500 A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100 A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200 A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000 """) df = pd.read_csv(mystr, decimal=',', delimiter=';') # ID VegType Elevation C N C/N Soil13C Soil15N pH \ # 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88 # 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21 # 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30 # 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95 # 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35 # 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28 # 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97 # Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g] # 0 0,0005 0.0772 0.0646 0.3588 13.84 # 1 0,0010 0.0639 0.0286 0.0601 0.68 # 2 0,0046 0.0854 0.1169 0.7753 5.70 # 3 ND 0.0766 0.0441 0.0978 8.85 # 4 0,0032 0.1119 0.2356 1.8050 14.71 # 5 0,0025 0.0983 0.0770 0.3777 5.42 # 6 ND 0.0696 0.0729 0.5736 9.40
Cannot convert input to Timestamp, bday_range(...) - Pandas/Python
Looking to generate a number for the days in business days between current date and the end of the month of a pandas dataframe. E.g. 26/06/2017 - 4, 23/06/2017 - 5 I'm having trouble as I keep getting a Type Error: TypeError: Cannot convert input to Timestamp From line: result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values) I have a Data Frame result with the column dte in a date format and I'm trying to create a new column (bdaterange) as a simple integer/float that I can use to see how far from month end in business days it has. Sample data: bid ask spread dte day bdate 01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 2016-12-30 02:38:00 2.2 3.8 1.60 2016-12-20 20.716667 2016-12-30 22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 2016-12-30 03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 2016-12-30 07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 2016-12-30 I've tried BDay() and using that the day cannot be 6 & 7 in the calculation but have not got anywhere. I came across bdate_range which I believe will be exactly what I'm looking for, but the closest I've got gives me the error Cannot convert input to Timestamp. My attempt is: result['bdate'] = pd.to_datetime(result['dte']) + BMonthEnd(0) result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values) print(result['bdaterange']) Not sure how to solve the error though.
I think you need length of bdate_range for each row, so need custom function with apply: #convert only once to datetime result['dte'] = pd.to_datetime(result['dte']) f = lambda x: len(pd.bdate_range(x['dte'], x['dte'] + pd.offsets.BMonthEnd(0))) result['bdaterange'] = result.apply(f, axis=1) print (result) bid ask spread dte day bdaterange 01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 9 02:38:00 2.20 3.80 1.60 2016-12-20 20.716667 9 22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 9 03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 8 07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 8