Pandas dataframe output formatting - python

I'm importing a trade list and trying to consolidate it into a position file with summed quantities and average prices. I'm grouping based on (ticker, type, expiration and strike). Two questions:
Output has the index group (ticker, type, expiration and strike) in the first column. How can I change this so that each index column outputs to its own column so the output csv is formatted the same way as the input data?
I currently force the stock trades to have values ("1") because leaving the cells blank will cause an error, but this adds bad data since "1" is not meaningful. Is there a way to preserve "" without causing a problem?
Dataframe:
GM stock 1 1 32 100
AAPL call 201612 120 3.5 1000
AAPL call 201612 120 3.25 1000
AAPL call 201611 120 2.5 2000
AAPL put 201612 115 2.5 500
AAPL stock 1 1 117 100
Code:
import pandas as pd
import numpy as np
df = pd.read_csv(input_file, index_col=['ticker', 'type', 'expiration', 'strike'], names=['ticker', 'type', 'expiration', 'strike', 'price', 'quantity'])
df_output = df.groupy(df.index).agg({'price':np.mean, 'quantity':np.sum})
df_output.to_csv(output_file, sep=',')
csv output comes out in this format:
(ticker, type, expiration, strike), price, quantity
desired format:
ticker, type, expiration, strike, price, quantity

For the first question, you should use groupby(df.index_col) instead of groupby(df.index)
For the second, I am not sure why you couldn't preserve "", is that numeric?
I mock some data like below:
import pandas as pd
import numpy as np
d = [
{'ticker':'A', 'type':'M', 'strike':'','price':32},
{'ticker':'B', 'type':'F', 'strike':100,'price':3.5},
{'ticker':'C', 'type':'F', 'strike':'', 'price':2.5}
]
df = pd.DataFrame(d)
print df
#dgroup = df.groupby(['ticker', 'type']).agg({'price':np.mean})
df.index_col = ['ticker', 'type', 'strike']
dgroup = df.groupby(df.index_col).agg({'price':np.mean})
#dgroup = df.groupby(df.index).agg({'price':np.mean})
print dgroup
print type(dgroup)
dgroup.to_csv('check.csv')
output in check.csv:
ticker,type,strike,price
A,M,,32.0
B,F,100,3.5
C,F,,2.5

Related

Save API filtered data in Excel file

I have a problem with some data from Binance API.
What I want is to keep a list of the USDT paired coins from binance. The problem is Binance give me a complete list of all kind paired coins, and I'm unable to filter only the USDT paired.
I need to save in Excel file only the USDT paired coins.
I have write the code to keep all the coin list:
import requests, json
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
print(data)
What you need is just pandas module. You can try the code below:
import requests
import json
import pandas as pd
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
dataframe = pd.DataFrame(data)
dataframe.to_excel("my_data.xlsx")
Your file will be saved in the same directory as the script and is named my_data.xlsx
Note that the dataframe variable is something like what follows:
symbol
bidPrice
bidQty
askPrice
askQty
0
ETHBTC
0.068918
1.7195
0.068919
0.0219
1
LTCBTC
0.002926
7.943
0.002927
4.368
2
BNBBTC
0.009438
4.493
0.009439
3.072
3
NEOBTC
0.000499
385.33
0.0005
793.74
4
QTUMETH
0.002231
304.3
0.002235
60.9
As per your comment, you need the pair of coins ending with USDT. Therefore what you need is to filter the dataframe out using a regex statement:
import requests
import json
import pandas as pd
data=requests.get('https://api.binance.com' + '/api/v1/ticker/allBookTickers')
data=json.loads(data.content)
dataframe = pd.DataFrame(data)
dataframe = dataframe[dataframe["symbol"].str.contains("USDT$")]
dataframe.to_excel("my_data.xlsx")
dataframe
which results in an output such as what follows:
symbol
bidPrice
bidQty
askPrice
askQty
11
BTCUSDT
44260
0.11608
44260
1.56671
12
ETHUSDT
3116.59
5.0673
3116.6
12.3602
98
BNBUSDT
428.2
124.404
428.3
45.021
125
BCCUSDT
0
0
0
0
Note that I have shown just the first four rows of the dataframe.
Explanation
The regex USDT$ points to the strings which end (the dollar sign) with USDT.

Column Lost when Concatenating Dataframes with Pandas?

I'm writing a program to scrape through a folder of PDFs, extracting a table from each that contains the same data fields. Sample screenshot of the table in one of the PDFs:
The goal of the program is to produce a spreadsheet with all of the data from each PDF in a single row with the date of the PDF, and the common fields as the column headers. The date in the first column should be the date in the PDF filename. It should look like this:
When I extract out the data into a dataframe and add column headers for "Field" and the date of the report, it looks like this:
Field 2021-12-04
0 Radiation: 5.22 kWh/m2
1 Energy: 116356 kWh
2 PR: 0.79
3 Month to Date NaN
4 Total radiation: 21.33 kWh/m2
5 Total energy: 464478 kWh
6 Max daily energy: 116478 kWh
7 Max daily occurred on: 2021-12-03
Then I set the index as the first column, since those are the common fields that I'll concat based on. When I do that, the date column header seems to be at a different level as the Field header? I'm not sure what happens here:
2021-12-20
Field
Radiation: 3.76 kWh/m2
Energy: 89175 kWh
PR: 0.84
Month to Date NaN
Total radiation: 84.66 kWh/m2
Total energy: 1960868 kWh
Max daily energy: 126309 kWh
Max daily occurred on: 2021-12-17
Then I transpose, and the result looks OK:
Field Radiation: Energy: ... Max daily energy: Max daily occurred on:
2021-12-13 0.79 kWh/m2 19193 kWh ... 124933 kWh 2021-12-12
Then I concatenate, and the result looks good except for some reason the first column with the dates is lost. Any suggestions?
import tabula as tb
import os
import glob
import pandas as pd
import datetime
import re
begin_time = datetime.datetime.now()
User_Profile = os.environ['USERPROFILE']
Canadian_Combined = User_Profile + '\Combined.csv'
CanadianReportsPDF = User_Profile + '\Canadian Reports (PDF)'
CanadianDailySummaryTable = (72,144,230,465)
CanadianDailyDF = pd.DataFrame()
def CanadianScrape():
global CanadianDailyDF
for pdf in glob.glob(CanadianReportsPDF + '/*Daily*'):
# try:
dfs = tb.read_pdf(os.path.abspath(pdf), area=CanadianDailySummaryTable, lattice=True, pages=1)
df = dfs[0]
date = re.search("([0-9]{4}\-[0-9]{2}\-[0-9]{2})", pdf)
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
# print(df.columns)
print(df)
df_t = df.transpose()
# print(df_t)
CanadianDailyDF = pd.concat([df_t,CanadianDailyDF], ignore_index=False)
# print(CanadianDailyDF)
# except:
# continue
# print(CanadianDailyDF)
CanadianDailyDF.to_csv(Canadian_Combined, index=False)
CanadianScrape()
print(datetime.datetime.now() - begin_time)
EDIT**
Added a insert() line after the transpose to add back in the date column per per Ezer K suggestion and that seems to have solved it.
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))
It is hard to say exactly since your examples are hard to reproduce, but it seems that instead of adding a field you are changing the column names.
try switching these rows in your function:
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
with these:
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))

Make a function to return length based on a condition

I have 2 DataFrames - 1 contains Stock Tickers and a maximum/minimum price range along with other columns.
The other DataFrame has dates as indexes and is grouped by tickers with various metrics like open,close,high low, etc. Now, I want to take a count of days from this DataFrame when for a given stock, close price was higher than minimum price.
I am stuck here:Now I want to find for example how many days AMZN was trading below the period max price.
I want to take count of days from the second dataframe based on values from the first dataframe,count of days where closing price was lesser/greater than Max/Min period price.
I have added the code to reproduce the DataFrames.
Please check the screen shots.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
import yfinance as yf
start=datetime.datetime.today()-relativedelta(years=2)
end=datetime.datetime.today()
us_stock_list='FB AMZN BABA'
data_metric = yf.download(us_stock_list, start=start, end=end,group_by='column',auto_adjust=True)
data_ticker= yf.download(us_stock_list, start=start, end=end,group_by='ticker',auto_adjust=True)
stock_list=[stock for stock in data_ticker.stack()]
# max_price
max_values=pd.DataFrame(data_ticker.max().unstack()['High'])
# min_price
min_values=pd.DataFrame(data_ticker.min().unstack()['Low'])
# latest_price
latest_day=pd.DataFrame(data_ticker.tail(1).unstack())
latest_day=latest_day.unstack().unstack().unstack().reset_index()
# latest_day=latest_day.unstack().reset_index()
latest_day=latest_day.drop(columns=['level_0','Date'])
latest_day.set_index('level_3',inplace=True)
latest_day.rename(columns={0:'Values'},inplace=True)
latest_day=latest_day.groupby(by=['level_3','level_2']).max().unstack()
latest_day.columns=[ '_'.join(x) for x in latest_day.columns ]
latest_day=latest_day.join(max_values,how='inner')
latest_day=latest_day.join(min_values,how='inner')
latest_day.rename(columns={'High':'Period_High_Max','Low':'Period_Low_Min'},inplace=True)
close_price_data=pd.DataFrame(data_metric['Close'].unstack().reset_index())
close_price_data= close_price_data.rename(columns={'level_0':'Stock',0:'Close_price'})
close_price_data.set_index('Stock',inplace=True)
Use this to reproduce:
{"Values_Close":{"AMZN":2286.0400390625,"BABA":194.4799957275,"FB":202.2700042725},"Values_High":{"AMZN":2362.4399414062,"BABA":197.3800048828,"FB":207.2799987793},"Values_Low":{"AMZN":2258.1899414062,"BABA":192.8600006104,"FB":199.0500030518},"Values_Open":{"AMZN":2336.8000488281,"BABA":195.75,"FB":201.6000061035},"Values_Volume":{"AMZN":9754900.0,"BABA":22268800.0,"FB":30399600.0},"Period_High_Max":{"AMZN":2475.0,"BABA":231.1399993896,"FB":224.1999969482},"Period_Low_Min":{"AMZN":1307.0,"BABA":129.7700042725,"FB":123.0199966431},"%_Position":{"AMZN":0.8382192115,"BABA":0.6383544892,"FB":0.7832576338}}
{"Stock":{
"0":"AMZN",
"1":"AMZN",
"2":"AMZN",
"3":"AMZN",
"4":"AMZN",
"5":"AMZN",
"6":"AMZN",
"7":"AMZN",
"8":"AMZN",
"9":"AMZN",
"10":"AMZN",
"11":"AMZN",
"12":"AMZN",
"13":"AMZN",
"14":"AMZN",
"15":"AMZN",
"16":"AMZN",
"17":"AMZN",
"18":"AMZN",
"19":"AMZN"},
"Date":{
"0":1525305600000,
"1":1525392000000,
"2":1525651200000,
"3":1525737600000,
"4":1525824000000,
"5":1525910400000,
"6":1525996800000,
"7":1526256000000,
"8":1526342400000,
"9":1526428800000,
"10":1526515200000,
"11":1526601600000,
"12":1526860800000,
"13":1526947200000,
"14":1527033600000,
"15":1527120000000,
"16":1527206400000,
"17":1527552000000,
"18":1527638400000,
"19":1527724800000 },
"Close_price":{
"0":1572.0799560547,
"1":1580.9499511719,
"2":1600.1400146484,
"3":1592.3900146484,
"4":1608.0,
"5":1609.0799560547,
"6":1602.9100341797,
"7":1601.5400390625,
"8":1576.1199951172,
"9":1587.2800292969,
"10":1581.7600097656,
"11":1574.3699951172,
"12":1585.4599609375,
"13":1581.4000244141,
"14":1601.8599853516,
"15":1603.0699462891,
"16":1610.1500244141,
"17":1612.8699951172,
"18":1624.8900146484,
"19":1629.6199951172}}
Do a merge between both dataframes, groupby company (index level=0) and apply a custom function:
df_merge = close_price_data.merge(
latest_day[['Period_High_Max', 'Period_Low_Min']],
left_index=True,
right_index=True)
def fun(df):
d = {}
d['days_above_min'] = (df.Close_price > df.Period_Low_Min).sum()
d['days_below_max'] = (df.Close_price < df.Period_High_Max).sum()
return pd.Series(d)
df_merge.groupby(level=0).apply(fun)
Period_Low_Min and Period_High_Max are the min and max respectively so all closing prices will be in that range, if this is not what you are trying to accomplish let me know.

How to append two dataframe objects containing same column data but different column names?

I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement

How to add automatically data to the missing days in historical stock prices?

I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31

Categories

Resources