I'm writing a program to scrape through a folder of PDFs, extracting a table from each that contains the same data fields. Sample screenshot of the table in one of the PDFs:
The goal of the program is to produce a spreadsheet with all of the data from each PDF in a single row with the date of the PDF, and the common fields as the column headers. The date in the first column should be the date in the PDF filename. It should look like this:
When I extract out the data into a dataframe and add column headers for "Field" and the date of the report, it looks like this:
Field 2021-12-04
0 Radiation: 5.22 kWh/m2
1 Energy: 116356 kWh
2 PR: 0.79
3 Month to Date NaN
4 Total radiation: 21.33 kWh/m2
5 Total energy: 464478 kWh
6 Max daily energy: 116478 kWh
7 Max daily occurred on: 2021-12-03
Then I set the index as the first column, since those are the common fields that I'll concat based on. When I do that, the date column header seems to be at a different level as the Field header? I'm not sure what happens here:
2021-12-20
Field
Radiation: 3.76 kWh/m2
Energy: 89175 kWh
PR: 0.84
Month to Date NaN
Total radiation: 84.66 kWh/m2
Total energy: 1960868 kWh
Max daily energy: 126309 kWh
Max daily occurred on: 2021-12-17
Then I transpose, and the result looks OK:
Field Radiation: Energy: ... Max daily energy: Max daily occurred on:
2021-12-13 0.79 kWh/m2 19193 kWh ... 124933 kWh 2021-12-12
Then I concatenate, and the result looks good except for some reason the first column with the dates is lost. Any suggestions?
import tabula as tb
import os
import glob
import pandas as pd
import datetime
import re
begin_time = datetime.datetime.now()
User_Profile = os.environ['USERPROFILE']
Canadian_Combined = User_Profile + '\Combined.csv'
CanadianReportsPDF = User_Profile + '\Canadian Reports (PDF)'
CanadianDailySummaryTable = (72,144,230,465)
CanadianDailyDF = pd.DataFrame()
def CanadianScrape():
global CanadianDailyDF
for pdf in glob.glob(CanadianReportsPDF + '/*Daily*'):
# try:
dfs = tb.read_pdf(os.path.abspath(pdf), area=CanadianDailySummaryTable, lattice=True, pages=1)
df = dfs[0]
date = re.search("([0-9]{4}\-[0-9]{2}\-[0-9]{2})", pdf)
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
# print(df.columns)
print(df)
df_t = df.transpose()
# print(df_t)
CanadianDailyDF = pd.concat([df_t,CanadianDailyDF], ignore_index=False)
# print(CanadianDailyDF)
# except:
# continue
# print(CanadianDailyDF)
CanadianDailyDF.to_csv(Canadian_Combined, index=False)
CanadianScrape()
print(datetime.datetime.now() - begin_time)
EDIT**
Added a insert() line after the transpose to add back in the date column per per Ezer K suggestion and that seems to have solved it.
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))
It is hard to say exactly since your examples are hard to reproduce, but it seems that instead of adding a field you are changing the column names.
try switching these rows in your function:
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
with these:
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))
Related
I am extracting a data frame in pandas and want to only extract rows where the date is after a variable.
I can do this in multiple steps but would like to know if it is possible to apply all logic in one call for best practice.
Here is my code
import pandas as pd
self.min_date = "2020-05-01"
#Extract DF from URL
self.df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
self.df = self.df.loc[["Date of case" < self.min_date], ["Subject","Reference","Date of case"]]
return(self.df)
I keep getting the error: "IndexError: Boolean index has wrong length: 1 instead of 100"
I cannot find the solution online because every answer is too specific to the scenario of the person that asked the question.
e.g. this solution only works for if you are calling one column: How to select rows from a DataFrame based on column values?
I appreciate any help.
Replace this:
["Date of case" < self.min_date]
with this:
self.df["Date of case"] < self.min_date
That is:
self.df = self.df.loc[self.df["Date of case"] < self.min_date,
["Subject","Reference","Date of case"]]
You have a slight syntax issue.
Keep in mind that it's best practice to convert string dates into pandas datetime objects using pd.to_datetime.
min_date = pd.to_datetime("2020-05-01")
#Extract DF from URL
df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
df['Date of case'] = pd.to_datetime(df['Date of case'])
df = df.loc[df["Date of case"] > min_date, ["Subject","Reference","Date of case"]]
Output:
Subject Reference Date of case
0 Salmonella enterica ser. Enteritidis (presence... 2020.2145 2020-05-22
1 migration of primary aromatic amines (0.4737 m... 2020.2131 2020-05-22
2 celery undeclared on green juice drink from Ge... 2020.2118 2020-05-22
3 aflatoxins (B1 = 29.4 µg/kg - ppb) in shelled ... 2020.2146 2020-05-22
4 too high content of E 200 - sorbic acid (1772 ... 2020.2125 2020-05-22
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement
I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31
I am trying to calculate the diff_chg of S&P sectors for 4 different dates (given in start_return) :
start_return = [-30,-91,-182,-365]
for date in start_return:
diff_chg = closingprices[-1].divide(closingprices[date])
for i in sectors: #Sectors is XLK, XLY , etc
diff_chg[i] = diff_chg[sectordict[i]].mean() #finds the % chg of all sectors
diff_df = diff_chg.to_frame
My expected output is to have 4 columns in the df, each one with the returns of each sector for the given period (-30,-91, -182,-365.) .
As of now when I run this code, it returns the sum of the returns of all 4 periods in the diff_df. I would like it to create a new column in the df for each period.
my code returns:
XLK 1.859907
XLI 1.477272
XLF 1.603589
XLE 1.415377
XLB 1.526237
but I want it to return:
1mo (-30) 3mo (-61) 6mo (-182) 1yr (-365
XLK 1.086547 values here etc etc
XLI 1.0334
XLF 1.07342
XLE .97829
XLB 1.0281
Try something like this:
start_return = [-30,-91,-182,-365]
diff_chg = pd.DataFrame()
for date in start_return:
diff_chg[date] = closingprices[-1].divide(closingprices[date])
What this does is to add columns for each date in start_return to a single DataFrame created at the beginning.
I'm importing a trade list and trying to consolidate it into a position file with summed quantities and average prices. I'm grouping based on (ticker, type, expiration and strike). Two questions:
Output has the index group (ticker, type, expiration and strike) in the first column. How can I change this so that each index column outputs to its own column so the output csv is formatted the same way as the input data?
I currently force the stock trades to have values ("1") because leaving the cells blank will cause an error, but this adds bad data since "1" is not meaningful. Is there a way to preserve "" without causing a problem?
Dataframe:
GM stock 1 1 32 100
AAPL call 201612 120 3.5 1000
AAPL call 201612 120 3.25 1000
AAPL call 201611 120 2.5 2000
AAPL put 201612 115 2.5 500
AAPL stock 1 1 117 100
Code:
import pandas as pd
import numpy as np
df = pd.read_csv(input_file, index_col=['ticker', 'type', 'expiration', 'strike'], names=['ticker', 'type', 'expiration', 'strike', 'price', 'quantity'])
df_output = df.groupy(df.index).agg({'price':np.mean, 'quantity':np.sum})
df_output.to_csv(output_file, sep=',')
csv output comes out in this format:
(ticker, type, expiration, strike), price, quantity
desired format:
ticker, type, expiration, strike, price, quantity
For the first question, you should use groupby(df.index_col) instead of groupby(df.index)
For the second, I am not sure why you couldn't preserve "", is that numeric?
I mock some data like below:
import pandas as pd
import numpy as np
d = [
{'ticker':'A', 'type':'M', 'strike':'','price':32},
{'ticker':'B', 'type':'F', 'strike':100,'price':3.5},
{'ticker':'C', 'type':'F', 'strike':'', 'price':2.5}
]
df = pd.DataFrame(d)
print df
#dgroup = df.groupby(['ticker', 'type']).agg({'price':np.mean})
df.index_col = ['ticker', 'type', 'strike']
dgroup = df.groupby(df.index_col).agg({'price':np.mean})
#dgroup = df.groupby(df.index).agg({'price':np.mean})
print dgroup
print type(dgroup)
dgroup.to_csv('check.csv')
output in check.csv:
ticker,type,strike,price
A,M,,32.0
B,F,100,3.5
C,F,,2.5