Python: convert str object to dataframe - python

I have a string object as follows:
time,exchange,fsym,tsym,close,high,low,open,volumefrom,volumeto
1660003200,NYSE,BTC,USD,100.1,103,99.1,100,30,10000
1660003260,NYSE,BTC,USD,101.3,104,100.1,102,39,12000
1660003320,NYSE,BTC,USD,100.9,103.2,98,100,32,100230
I am trying to convert this string object to a DataFrame. I have tried adding brackets "[]" around the data but that still didn't work. Any suggestions would be greatly appreciated.

Looks like your string is in CSV format. You can convert this into a Pandas data frame using the StringIO module:
from io import StringIO
import pandas as pd
data = StringIO("""time,exchange,fsym,tsym,close,high,low,open,volumefrom,volumeto
1660003200,NYSE,BTC,USD,100.1,103,99.1,100,30,10000
1660003260,NYSE,BTC,USD,101.3,104,100.1,102,39,12000
1660003320,NYSE,BTC,USD,100.9,103.2,98,100,32,100230""")
df = pd.read_csv(data)
print(df)

import pandas as pd
string = """time,exchange,fsym,tsym,close,high,low,open,volumefrom,volumeto
1660003200,NYSE,BTC,USD,100.1,103,99.1,100,30,10000
1660003260,NYSE,BTC,USD,101.3,104,100.1,102,39,12000
1660003320,NYSE,BTC,USD,100.9,103.2,98,100,32,100230"""
str_list_with_comma = string.split("\n")
columns = []
data = []
for idx, item in enumerate(str_list_with_comma):
if(idx == 0):
columns = item.split(",")
else:
data.append(item.split(","))
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
time exchange fsym tsym close high low open volumefrom volumeto
0 1660003200 NYSE BTC USD 100.1 103 99.1 100 30 10000
1 1660003260 NYSE BTC USD 101.3 104 100.1 102 39 12000
2 1660003320 NYSE BTC USD 100.9 103.2 98 100 32 100230

Related

Add the date from the URL as column in a dataframe formed from scraping html tables from a list of URLs

I need to download currency rates from xe.com for a set of particular dates and store them all in one table, to which end I have written the following code.
import pandas as pd
# list of dates to suffix to the url
dates = ['2020-01-31','2020-02-29','2020-03-31','2020-04-30','2020-05-31','2020-06-30','2020-07-31','2020-08-31','2020-09-30',
'2020-10-31','2020-11-30','2020-12-31','2021-01-31','2021-02-28','2021-03-31','2021-04-30','2021-05-31','2021-06-30',
'2021-07-01',#'2021-08-01','2021-09-01','2021-10-01','2021-11-01','2021-12-01'
]
# the constant part of the url
link = "https://www.xe.com/currencytables/?from=EUR&date="
# append each date from the list to the constant part of the url and then read each table of currency rates from each url and save them in a dataframe
df_list = []
urls = []
for date in dates:
urls.append(link+date)
urls
for url in urls:
df_list.append(pd.read_html(url)[0])
df_list
# concatenate all dataframes in one
rates_table = pd.concat(df_list, axis=0, ignore_index=True)
print(rates_table)
What I cannot figure out how to do is also add the date from each particular URL (or in other words each date from my date list) to each respective dataframe as a column so that I know from what date each currency rate is when I concatenate all dataframes in one. How can I achieve this?
Also I realize this could probably be done easier with BeautifulSoup but I still haven't researched that library, but I'd be interested in seeing such a solution so I can study it.
Thanks
IIUC, try this.
import pandas as pd
dates = [...]
link = "https://www.xe.com/currencytables/?from=EUR&date="
pd.concat(pd.read_html(link+d)[0].assign(Date=d) for d in dates)
Out[*]:
Currency Name Units per EUR EUR per unit Date
0 USD US Dollar 1.108085 0.902458 2020-01-31
1 EUR Euro 1.000000 1.000000 2020-01-31
2 GBP British Pound 0.839899 1.190619 2020-01-31
3 INR Indian Rupee 79.252091 0.012618 2020-01-31
4 AUD Australian Dollar 1.654370 0.604460 2020-01-31
.. ... ... ... ... ...
166 ZMW Zambian Kwacha 20.419028 0.048974 2020-04-30
167 CLF CLF 0.033268 30.059155 2020-04-30
168 CNH CNH 7.728135 0.129397 2020-04-30
169 MXV MXV 4.119131 0.242770 2020-04-30
170 XBT Bitcoin 0.000124 8062.078999 2020-04-30

Alternatives to pandas.merge because of "array is too big" value error?

I'm creating a function that inputs a dataframe (pulled from a csv file) with the format:
Date
Stock1
Date
Stock2
Date
Stock3
01/01/2000
100
01/01/2000
12.1
01/01/2000
54
03/01/2000
101
02/01/2000
12.5
03/01/2000
50
04/01/2000
104
03/01/2000
12.2
05/01/2000
49
for any number of stocks and any number of dates, and outputs a dataframe as follows:
Date
Stock1
Stock2
Stock3
01/01/2000
100
12.1
54
03/01/2000
101
12.2
50
i.e. all the dates that do not exist in all the stock price time series are discarded.
The code I have so far is as follows:
import pandas as pd
import numpy as np
def collectdata(df):
n = (len(df.columns))
data = df.iloc[:,0:2].copy()
data.columns = ['Date', data.columns[1]]
for i in range(2,n,2):
x = df.iloc[:,i:i+2].copy()
x.columns=['Date', x.columns[1]]
data = pd.merge(data, x)
return data
# example: we have a file called test.csv:
dataset = collectdata(pd.read_csv('test.csv'))
print(dataset.head())
The code works fine for smaller data sets, but when I attempted with a data set of 15 stocks with and 5000 rows I got:
ValueError: array is too big
I believe this problem is arising because I am using pd.merge (as I can pull the dataset with no problem). Is there an alternative to this that I can use that will eliminate this error? (Or equally a more efficient way around this problem?)

Pandas DataFrame Output Format

I need help reformatting my DataFrame output for stock closing prices.
Currently my output has the Stock Symbols as Headers where I would like to have them displayed in rows. df_output = 1: https://i.stack.imgur.com/u4jEk.png
I would like to have it displayed as below:
results
This is my current df_output code (not sure if this is the reason):
prices_df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in data[a]['prices']} for a in assets})
excel_list
FULL CODE:
import pandas as pd
import numpy as np
import yfinance as yf
from yahoofinancials import YahooFinancials
from datetime import datetime
import time
start_time = time.time()
df = pd.read_excel(r'C:\Users\Ryan\Desktop\Stock Portfolio\\My Portfolio.xlsx', sheet_name=0, skiprows=2)
list1 = list(df['Stock Code'])
assets = list1
yahoo_financials = YahooFinancials(assets)
data = yahoo_financials.get_historical_price_data(start_date=str(datetime.now().date().replace(month=1, day=1)),
end_date=str(datetime.now().date().replace(month=12, day=31)),
time_interval='daily')
prices_df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in data[a]['prices']} for a in assets})
Check pandas functions such as https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html for operations for converting between long and wide formats.
Try this:
prices_df.rename_axis('Date').reset_index().melt('Date', var_name='Symbol', value_name='Price')
Output:
Date Symbol Price
0 2020-01-02 FB 209.779999
1 2020-01-03 FB 208.669998
2 2020-01-06 FB 212.600006
3 2020-01-07 FB 213.059998
4 2020-01-08 FB 215.220001
.. ... ... ...
973 2020-08-18 CDNS 109.150002
974 2020-08-19 CDNS 108.529999
975 2020-08-20 CDNS 111.260002
976 2020-08-21 CDNS 110.570000
977 2020-08-24 CDNS 111.260002
[978 rows x 3 columns]

Trying to create a new dataframe column in pandas based on a dataframe related if statement

I'm learning Python & pandas and practicing with different stock calculations. I've tried to search help with this but just haven't found a response similar enough or then didn't understand how to deduce the correct approach based on the previous responses.
I have read stock data of a given time frame with datareader into dataframe df. In df I have Date Volume and Adj Close columns which I want to use to create a new column "OBV" based on given criteria. OBV is a cumulative value that adds or subtracts the value of the volume today to the previous' days OBV depending on the adjusted close price.
The calculation of OBV is simple:
If Adj Close is higher today than Adj Close of yesterday then add the Volume of today to the (cumulative) volume of yesterday.
If Adj Close is lower today than Adj Close of yesterday then substract the Volume of today from the (cumulative) volume of yesterday.
On day 1 the OBV = 0
This is then repeated along the time frame and OBV gets accumulated.
Here's the basic imports and start
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
from pandas_datareader import data, wb
start = datetime.date(2012, 4, 16)
end = datetime.date(2017, 4, 13)
# Reading in Yahoo Finance data with DataReader
df = data.DataReader('GOOG', 'yahoo', start, end)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#This is what I cannot get to work, and I've tried two different ways.
#ATTEMPT1
def obv1(column):
if column["Adj Close"] > column["Adj close"].shift(-1):
val = column["Volume"].shift(-1) + column["Volume"]
else:
val = column["Volume"].shift(-1) - column["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
#ATTEMPT 2
def obv1(df):
if df["Adj Close"] > df["Adj close"].shift(-1):
val = df["Volume"].shift(-1) + df["Volume"]
else:
val = df["Volume"].shift(-1) - df["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
Both give me an error.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Volume=np.random.randint(100, 200, 10),
AdjClose=np.random.rand(10)
))
print(df)
AdjClose Volume
0 0.951710 111
1 0.346711 198
2 0.289758 174
3 0.662151 190
4 0.171633 115
5 0.018571 155
6 0.182415 113
7 0.332961 111
8 0.150202 113
9 0.810506 126
Multiply the Volume by -1 when change in AdjClose is negative. Then cumsum
(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum()
0 111
1 -87
2 -261
3 -71
4 -186
5 -341
6 -228
7 -117
8 -230
9 -104
dtype: int64
Include this along side the rest of the df
df.assign(new=(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum())
AdjClose Volume new
0 0.951710 111 111
1 0.346711 198 -87
2 0.289758 174 -261
3 0.662151 190 -71
4 0.171633 115 -186
5 0.018571 155 -341
6 0.182415 113 -228
7 0.332961 111 -117
8 0.150202 113 -230
9 0.810506 126 -104

Pandas Frequency Conversion

I'm trying to find if is possible to use data.asfreq(MonthEnd()) with no date_range created data.
What I'm trying to achive. I run csv query with the following code:
import numpy as np
import pandas as pd
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True)
data.columns = ["period", "integ"]
data['period'] = pd.to_datetime(data['period'], infer_datetime_format=True)
Then I want to assign frequency to my 'period' column by doing this:
tdelta = data.period[1] - data.period[0]
data.period.freq = tdelta
And some print comands:
print(data)
print(data.period.freq)
print(data.dtypes)
Returns:
..........
270 1948-07-01 2033.2
271 1948-04-01 2021.9
272 1948-01-01 1989.5
273 1947-10-01 1960.7
274 1947-07-01 1930.3
275 1947-04-01 1932.3
276 1947-01-01 1934.5
[277 rows x 2 columns]
-92 days +00:00:00
period datetime64[ns]
integ float64
dtype: object
I can also parse the original 'DATE' column by making it 'index':
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True, index_col='DATE')
What I want to do is just to covert the quarterly data in to monthly rows. For example:
270 1948-07-01 2033.2
271 1948-06-01 NaN
272 1948-05-01 NaN
273 1948-04-01 2021.9
274 1948-03-01 NaN
275 1948-02-01 NaN
276 1948-01-01 1989.5
......and so on.......
I'm eventually trying to do this by using ts.asfreq(MonthBegin()) and , ts.asfreq(MonthBegin(), method='pad'). So far unsuccessfully. I have the following error:
NameError: name 'MonthBegin' is not defined
My question is can I use asfreq if I don't use date_range to create the frame? Somehow to 'pass' my date column to the function. If this is not the solution is it there any other easy way to convert quarterly to monthly frequency?
Use a TimeGrouper:
import pandas as pd
periods = ['1948-07-01', '1948-04-01', '1948-01-01', '1947-10-01',
'1947-07-01', '1947-04-01', '1947-01-01']
integs = [2033.2, 2021.9, 1989.5, 1960.7, 1930.3, 1932.3, 1934.5]
df = pd.DataFrame({'period': pd.to_datetime(periods), 'integ': integs})
df = df.set_index('period')
df = df.groupby(pd.TimeGrouper('MS')).sum().sort_index(ascending=False)
EDIT: You can also use resample instead of a TimeGrouper:
df.resample('MS').sum().sort_index(ascending=False)

Categories

Resources