iterate over certain columns in data frame

iterate over certain columns in data frame - python

Hi I have data frame like this:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 -
2 GWRE 416.06 9.80 5.33 45.62 -
3 PEGA 129.02 4.41 9.85 285.10 0.28%
4 BLKB 87.68 4.96 14.36 41.81 0.62%
Firstly, I want to convert values in columns that contain numbers (which are string currently) to a float values. So here I would have the 4 middle columns that need the conversion to float. Would simple loop work with this case?
Second thing, there is a problem with the last column 'Dividend' where there is a percentage value as string. As a matter of fact I can convert it to decimals, however I was thinking if there is a way to still retaining the % and the values would still be calculable.
Any ideas for those two issues?

plan
Take out 'Ticker' because it isn't numeric
use assign to overwrite Dividend by striping off %
use apply with pd.to_numeric to convert all the columns
use eval to get Dividend to proper decimal space
df[['Ticker']].join(
df.assign(
Dividend=df.Dividend.str.strip('%')
).drop('Ticker', 1).apply(
pd.to_numeric, errors='coerce'
)
).eval('Dividend = Dividend / 100', inplace=False)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
more lines
more readable
nums = df.drop('Ticker', 1).assign(Dividend=df.Dividend.str.strip('%'))
nums = nums.apply(pd.to_numeric, errors='coerce')
nums = nums.assign(Dividend=nums.Dividend / 100)
df[['Ticker']].join(nums)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062

Assuming that all P/... columns contain proper numbers:
In [47]: df.assign(Dividend=pd.to_numeric(df.Dividend.str.replace(r'\%',''), errors='coerce')
...: .div(100)) \
...: .set_index('Ticker', append=True) \
...: .astype('float') \
...: .reset_index('Ticker')
...:
Out[47]:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062

Related

Slicing pandas dataframe by ordered values into clusters

I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.

I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]

def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list

Dataframe split columns value, how to solve error message?

I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.

METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])

This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]

new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')

Can't replace string symbol "-" in dataframe on jupyter notebook

I import data from "url = ("http://finviz.com/quote.ashx?t=" + symbol.lower())"
and got the table:
P/B P/E Forward P/E PEG Debt/Eq EPS (ttm) Dividend % ROE \
AMZN 18.73 92.45 56.23 2.09 1.21 16.25 - 26.70%
GOOG 4.24 38.86 - 2.55 - 26.65 - -
PG 4.47 22.67 19.47 3.45 0.61 4.05 3.12% 18.80%
KO 11.04 30.26 21.36 4.50 2.45 1.57 3.29% 15.10%
IBM 5.24 9.28 8.17 9.67 2.37 12.25 5.52% 30.90%
ROI EPS Q/Q Insider Own
AMZN 3.50% 1026.20% 16.20%
GOOG - 36.50% 5.74%
PG 13.10% 15.50% 0.10%
KO 12.50% 56.80% 0.10%
IBM 17.40% 0.70% 0.10%
Then I was trying to convert string to float:
df = df[(df['P/E'].astype(float)<20) & (df['P/B'].astype(float) < 3)]
and got "ValueError: could not convert string to float:"
I think that values 0.70% and sign "-" is the problem.
I tried:
df.replace("-","0")
df.replace('-', 0)
df.replace('-', nan)
But nothing works.

You may need to assign it back
df=df.replace("-","0")
And I recommend to_numeric
df['P/E']=pd.to_numeric(df['P/E'],errors = 'coerce')
df['P/B']=pd.to_numeric(df['P/B'],errors = 'coerce')

You should use numpy:
import numpy as np
then the next replacement:
df = df.replace('-', np.nan)
Next, change the datatype:
df = df['Forward P/E'].astype(float)
Lastly, you can test if the datatype is float64.

Columns error while converting to float

df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.

To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40

Cannot convert input to Timestamp, bday_range(...) - Pandas/Python

Looking to generate a number for the days in business days between current date and the end of the month of a pandas dataframe.
E.g. 26/06/2017 - 4, 23/06/2017 - 5
I'm having trouble as I keep getting a Type Error:
TypeError: Cannot convert input to Timestamp
From line:
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
I have a Data Frame result with the column dte in a date format and I'm trying to create a new column (bdaterange) as a simple integer/float that I can use to see how far from month end in business days it has.
Sample data:
bid ask spread dte day bdate
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 2016-12-30
02:38:00 2.2 3.8 1.60 2016-12-20 20.716667 2016-12-30
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 2016-12-30
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 2016-12-30
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 2016-12-30
I've tried BDay() and using that the day cannot be 6 & 7 in the calculation but have not got anywhere. I came across bdate_range which I believe will be exactly what I'm looking for, but the closest I've got gives me the error Cannot convert input to Timestamp.
My attempt is:
result['bdate'] = pd.to_datetime(result['dte']) + BMonthEnd(0)
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
print(result['bdaterange'])
Not sure how to solve the error though.

I think you need length of bdate_range for each row, so need custom function with apply:
#convert only once to datetime
result['dte'] = pd.to_datetime(result['dte'])
f = lambda x: len(pd.bdate_range(x['dte'], x['dte'] + pd.offsets.BMonthEnd(0)))
result['bdaterange'] = result.apply(f, axis=1)
print (result)
bid ask spread dte day bdaterange
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 9
02:38:00 2.20 3.80 1.60 2016-12-20 20.716667 9
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 9
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 8
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

iterate over certain columns in data frame - python

Related

Slicing pandas dataframe by ordered values into clusters

Dataframe split columns value, how to solve error message?

Can't replace string symbol "-" in dataframe on jupyter notebook

Columns error while converting to float

Cannot convert input to Timestamp, bday_range(...) - Pandas/Python

Categories

Resources