Can't replace string symbol "-" in dataframe on jupyter notebook - python

I import data from "url = ("http://finviz.com/quote.ashx?t=" + symbol.lower())"
and got the table:
P/B P/E Forward P/E PEG Debt/Eq EPS (ttm) Dividend % ROE \
AMZN 18.73 92.45 56.23 2.09 1.21 16.25 - 26.70%
GOOG 4.24 38.86 - 2.55 - 26.65 - -
PG 4.47 22.67 19.47 3.45 0.61 4.05 3.12% 18.80%
KO 11.04 30.26 21.36 4.50 2.45 1.57 3.29% 15.10%
IBM 5.24 9.28 8.17 9.67 2.37 12.25 5.52% 30.90%
ROI EPS Q/Q Insider Own
AMZN 3.50% 1026.20% 16.20%
GOOG - 36.50% 5.74%
PG 13.10% 15.50% 0.10%
KO 12.50% 56.80% 0.10%
IBM 17.40% 0.70% 0.10%
Then I was trying to convert string to float:
df = df[(df['P/E'].astype(float)<20) & (df['P/B'].astype(float) < 3)]
and got "ValueError: could not convert string to float:"
I think that values 0.70% and sign "-" is the problem.
I tried:
df.replace("-","0")
df.replace('-', 0)
df.replace('-', nan)
But nothing works.

You may need to assign it back
df=df.replace("-","0")
And I recommend to_numeric
df['P/E']=pd.to_numeric(df['P/E'],errors = 'coerce')
df['P/B']=pd.to_numeric(df['P/B'],errors = 'coerce')

You should use numpy:
import numpy as np
then the next replacement:
df = df.replace('-', np.nan)
Next, change the datatype:
df = df['Forward P/E'].astype(float)
Lastly, you can test if the datatype is float64.

Related

Combine a row with column in dataFrame and show the corresponding values

So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
​
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]

Index - Match using Pandas

I have the following 2 data frames:
df1 = pd.DataFrame({
'dates': ['02-Jan','03-Jan','30-Jan'],
'currency': ['aud','gbp','eur'],
'amount': [100,330,500]
})
df2 = pd.DataFrame({
'dates': ['01-Jan','02-Jan','03-Jan','30-Jan'],
'aud': [0.72,0.73,0.74,0.71],
'gbp': [1.29,1.30,1.4,1.26],
'eur': [1.15,1.16,1.17,1.18]
})
I want to obtain the intersection of df1.dates & df1.currency. For eg: Looking up the prevalent 'aud' exchange rate on '02-Jan'
It can be solved using the Index + Match functionality of excel. What shall be the best way to replicate it in Pandas.
Desired Output: add a new column 'price'
dates currency amount price
02-Jan aud 100 0.73
03-Jan gbp 330 1.4
30-Jan eur 500 1.18
The best equivalent of INDEX MATCH is DataFrame.lookup:
df2 = df2.set_index('dates')
df1['price'] = df2.lookup(df1['dates'], df1['currency'])
Reshaping your df2 makes it a lot easier to do a straightforward merge:
In [42]: df2.set_index("dates").unstack().to_frame("value")
Out[42]:
value
dates
aud 01-Jan 0.72
02-Jan 0.73
03-Jan 0.74
30-Jan 0.71
gbp 01-Jan 1.29
02-Jan 1.30
03-Jan 1.40
30-Jan 1.26
eur 01-Jan 1.15
02-Jan 1.16
03-Jan 1.17
30-Jan 1.18
In this form, you just need to match the df1 fields with df2's new index as such:
In [43]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True)
Out[43]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
You can also left merge it if you don't want to lose missing data (I had to fix your df1 a little for this:
In [44]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True, how="left")
Out[44]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
2 04-Jan eur 500 NaN

Columns error while converting to float

df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.
To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40

Cannot convert input to Timestamp, bday_range(...) - Pandas/Python

Looking to generate a number for the days in business days between current date and the end of the month of a pandas dataframe.
E.g. 26/06/2017 - 4, 23/06/2017 - 5
I'm having trouble as I keep getting a Type Error:
TypeError: Cannot convert input to Timestamp
From line:
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
I have a Data Frame result with the column dte in a date format and I'm trying to create a new column (bdaterange) as a simple integer/float that I can use to see how far from month end in business days it has.
Sample data:
bid ask spread dte day bdate
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 2016-12-30
02:38:00 2.2 3.8 1.60 2016-12-20 20.716667 2016-12-30
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 2016-12-30
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 2016-12-30
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 2016-12-30
I've tried BDay() and using that the day cannot be 6 & 7 in the calculation but have not got anywhere. I came across bdate_range which I believe will be exactly what I'm looking for, but the closest I've got gives me the error Cannot convert input to Timestamp.
My attempt is:
result['bdate'] = pd.to_datetime(result['dte']) + BMonthEnd(0)
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
print(result['bdaterange'])
Not sure how to solve the error though.
I think you need length of bdate_range for each row, so need custom function with apply:
#convert only once to datetime
result['dte'] = pd.to_datetime(result['dte'])
f = lambda x: len(pd.bdate_range(x['dte'], x['dte'] + pd.offsets.BMonthEnd(0)))
result['bdaterange'] = result.apply(f, axis=1)
print (result)
bid ask spread dte day bdaterange
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 9
02:38:00 2.20 3.80 1.60 2016-12-20 20.716667 9
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 9
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 8
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 8

iterate over certain columns in data frame

Hi I have data frame like this:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 -
2 GWRE 416.06 9.80 5.33 45.62 -
3 PEGA 129.02 4.41 9.85 285.10 0.28%
4 BLKB 87.68 4.96 14.36 41.81 0.62%
Firstly, I want to convert values in columns that contain numbers (which are string currently) to a float values. So here I would have the 4 middle columns that need the conversion to float. Would simple loop work with this case?
Second thing, there is a problem with the last column 'Dividend' where there is a percentage value as string. As a matter of fact I can convert it to decimals, however I was thinking if there is a way to still retaining the % and the values would still be calculable.
Any ideas for those two issues?
plan
Take out 'Ticker' because it isn't numeric
use assign to overwrite Dividend by striping off %
use apply with pd.to_numeric to convert all the columns
use eval to get Dividend to proper decimal space
df[['Ticker']].join(
df.assign(
Dividend=df.Dividend.str.strip('%')
).drop('Ticker', 1).apply(
pd.to_numeric, errors='coerce'
)
).eval('Dividend = Dividend / 100', inplace=False)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
more lines
more readable
nums = df.drop('Ticker', 1).assign(Dividend=df.Dividend.str.strip('%'))
nums = nums.apply(pd.to_numeric, errors='coerce')
nums = nums.assign(Dividend=nums.Dividend / 100)
df[['Ticker']].join(nums)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
Assuming that all P/... columns contain proper numbers:
In [47]: df.assign(Dividend=pd.to_numeric(df.Dividend.str.replace(r'\%',''), errors='coerce')
...: .div(100)) \
...: .set_index('Ticker', append=True) \
...: .astype('float') \
...: .reset_index('Ticker')
...:
Out[47]:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062

Categories

Resources