Parsing idx file or string to Pandas DataFrame - python

I would like to parse the following idx file: https://www.sec.gov/Archives/edgar/daily-index/2022/QTR1/company.20220112.idx into Pandas DataFrame.
I use the following code to check how it would look like as a text file:
import os, requests
base_path = '/Users/GunardiLin/Desktop/Insider_Ranking/temp/'
current_dirs = os.listdir(path=base_path)
local_filename = f'20200102'
local_file_path = '/'.join([base_path, local_filename])
if local_filename in base_path:
print(f'Skipping index file for {local_filename} because it is already saved.')
url = f'https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/company.20200102.idx'
r = requests.get(url, stream=True, headers= {'user-agent': 'MyName myname#outlook.com'})
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=10240):
f.write(chunk)
Next I would like to build a parser that is fault tollerance, because it should parse daily a new idx file into pd.DataFrame.
My idea was to use string manipulation, but it would be very complicated and not fault tollerance.
I would be thankful if someone can show the best practice to parse and give a boilerplate code.

Since this is mostly a fixed width file you could use pandas read_fwf to read this file. You can skip over the leading information (via skiprows=) and get straight to the data. The column names are predefined and assigned when read:
idx_path = 'company.20220112.idx'
names = ['Company Name','Form Type','CIK','Date Filed','File Name']
df = pd.read_fwf(idx_path, colspecs=[(0,61),(62,74),(74,84),(86,94),(98,146)], names=names, skiprows=11)
df.head(10)
Company Name Form Type CIK Date Filed File Name
0 005 - Series of IPOSharks Venture Master Fund,... D 1888451 20220112 edgar/data/1888451/0001888451-22-000002.txt
1 10X Capital Venture Acquisition Corp. III EFFECT 1848948 20220111 edgar/data/1848948/9999999995-22-000102.txt
2 110 White Partners LLC D 1903845 20220112 edgar/data/1903845/0001884293-22-000001.txt
3 15 Beach, MHC 3 1903509 20220112 edgar/data/1903509/0001567619-22-001073.txt
4 15 Beach, MHC SC 13D 1903509 20220112 edgar/data/1903509/0000943374-22-000014.txt
5 170 Valley LLC D 1903913 20220112 edgar/data/1903913/0001903913-22-000001.txt
6 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000003.txt
7 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000004.txt
8 215 BF Associates LLC D 1904145 20220112 edgar/data/1904145/0001904145-22-000001.txt
9 2401 Midpoint Drive REIT, LLC D 1903337 20220112 edgar/data/1903337/0001903337-22-000001.txt

Related

Split CSV file which contains multiple tables into different pandas dataFrames (Python)

I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.
Example:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
NUCLEAR 70.308020 33.374568
HYDRO_DAM_EXISTING 0.000000 195.051200
HYDRO_DAM 67.717942 1.271600
HYDRO_RIVER_EXISTING 0.000000 204.820000
IND_BOILER_OIL 2.053610 0.532362
IND_BOILER_COAL 4.179935 1.081855
IND_BOILER_WASTE 11.010126 2.849652
DEC_HP_ELEC 554.174644 320.791276
DEC_THERMAL_HP_GAS 77.077291 33.717477
DEC_BOILER_GAS 105.586089 41.161335
DEC_BOILER_OIL 33.514266 25.948450
H2_FROM_GAS 145.185290 59.178082
PYROLYSIS 132.200818 112.392123
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
HYDRO_STORAGE 0.000000 0.000000
Resource C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE 702.000000
DIESEL 96.390000
OIL 267.787558
NG 1648.527242
WOOD 592.110000
COAL 84.504083
URANIUM 18.277626
WASTE 0.000000
All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.
Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.
I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip' which allows avoiding errors also skips useful lines:
Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]']
Cost_bd.columns = colnames
I believe it might be better to scan the .txt file and split it but I'm unsure how to do this in the best way.
I have also tried to use the solution provided in this feed
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]
df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
but it doesn't work:
tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])
EDIT : Final solution based on #Rabinzel:
import re
def make_df(group,dict_of_dfs):
header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
if len(header) != len(data[0]): # If missing columns list, take former
header = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
return dict_of_dfs
def Read_csv_as_df(path, file_name):
with open(path+file_name) as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
print(dict_of_dfs)
dict_of_dfs = make_df(group,dict_of_dfs)
group = []
dict_of_dfs = make_df(group,dict_of_dfs)
return dict_of_dfs
I would do it the following way.
Iterate through each row, append each chunk seperated by a newline to a list and build dataframes from the lists. The problem with the column names with spaces, use re.split and split only if there are two or more spaces.
Save the different df's in a dictionary where the key is the first element of the header of each df.
import re
def make_df(group):
header, data = re.split(r'\s\s+',group[0]), list(map(str.split, group[1:]))
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
with open('your_csv_file.csv') as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
make_df(group)
group = []
make_df(group)
for key, value in dict_of_dfs.items():
print(f"{key=}\ndf:\n{value}\n---------------------")
Output:
key='Technology'
df:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
0 NUCLEAR 70.308020 33.374568
1 HYDRO_DAM_EXISTING 0.000000 195.051200
2 HYDRO_DAM 67.717942 1.271600
3 HYDRO_RIVER_EXISTING 0.000000 204.820000
4 IND_BOILER_OIL 2.053610 0.532362
5 IND_BOILER_COAL 4.179935 1.081855
6 IND_BOILER_WASTE 11.010126 2.849652
7 DEC_HP_ELEC 554.174644 320.791276
8 DEC_THERMAL_HP_GAS 77.077291 33.717477
9 DEC_BOILER_GAS 105.586089 41.161335
10 DEC_BOILER_OIL 33.514266 25.948450
11 H2_FROM_GAS 145.185290 59.178082
12 PYROLYSIS 132.200818 112.392123
---------------------
key='Storage technology'
df:
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0 HYDRO_STORAGE 0.000000 0.000000
---------------------
key='Resource'
df:
Resource C_op [MCHF/y]
0 ELECTRICITY 1174.452848
1 GASOLINE 702.000000
2 DIESEL 96.390000
3 OIL 267.787558
4 NG 1648.527242
5 WOOD 592.110000
6 COAL 84.504083
7 URANIUM 18.277626
8 WASTE 0.000000
---------------------

How do I modify parameters to exclude newline via camelot?

I am trying to parse a pdf into dataframe using camelot
import camelot
import pandas as pd
file = 'foo.pdf'
tables = camelot.read_pdf(file, pages='2', flavor='stream')
v = []
for i, table in enumerate(tables):
v.append(table.df)
w = pd.concat(v)
print(w)
however, its reading as below:
7 Customer No. Document Date Customer PO No. External Doc. No.\nPayment Terms
8 126207 28/02/22 STRICTLY 14 DAYS
9 PO No./Docket Unit Price \nAmount \nGST Amount Incl.
10 Description TASK DATE Quantity UOM
11 No. Excl. GST\nExcl. GST\nAmount GST
12 BOC GAS & GEAR
13 11 SNOW STREET
14 SOUTH LISMORE, NSW 2480
15 CLEAR: FL 1.5M3 BIN-CARDBOARD 02/02/22 1\nEA\n9.18\n9.18\n0.92 10.10
16 CLEAR: FL 1.5M3 BIN-CARDBOARD 16/02/22 1\nEA\n9.18\n9.18\n0.92 10.10
How do I avoid the newline \n when reading the pdf?

Scrape Embedded Google Sheet from HTML in Python

This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.
Here is the link
I do not own the sheet but it is publicly available.
here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
Well if you would like to get the data into a DataFrame, you could load it directly:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
This will give you:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO $68.79 $70.00 $1.21 1.76% 7/9/2019
1 WM $32.93 $33.15 $0.22 0.67% 4/15/2019
2 ABBV $197.05 $200.22 $3.17 1.61% 6/25/2019
3 CLF $2.98 $3.02 $0.04 1.34% 12/3/2019
4 SCHW $49.31 $51.27 $1.96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
Note read_html returns a list. In this case there is only
1 DataFrame, so we can refer to the first and only index location [0]

How to create a table from .txt file?

I am trying to create e dataframe (table with three columns) from a .txt file.
I prepared the txt file so it has the format:
Car
Audi A4 10000
Audi A6 12000
....
Bus
VW Transporter 15000
...
Camper
VW California 20000
...
This is the whole code:
cars = ""
with open("cars.txt", "r", encoding = "utf-8") as f:
cars = f.read()
print(cars)
def generate_car_table(table):
table = pd.DataFrame(columns = ['category', 'model','price'])
return table
cars_table = generate_car_table(cars)
I expect a table with three columns - category, which will show whether the vehicle is car/bus/camper, model and price.
Thank you in advance!
Update:
Having your comments in mind, I see that I misunderstood your question.
If you're text-file (cars.txt) looks like follows:
Car
Audi A4 10000
Audi A6 12000
Bus
VW Transporter 15000
Camper
VW California 20000
so that after every category a line break is made and between the model and the price is a tab, you could run the following code:
# Read the file
data = pd.read_csv('cars.txt', names=['Model','Price','Category'], sep='\t')
# Transform the unstructured data
data.loc[(data['Price'].isnull() == True), 'Category'] = data['Model']
data['Category'].fillna(method='ffill', inplace=True)
data.dropna(axis=0, subset=['Price'], inplace = True)
# Clean the dataframe
data.reset_index(drop=True, inplace=True)
data = data[['Category', 'Model', 'Price']]
print(data)
This does result in the following table:
Category Model Price
0 Car Audi A4 10000.0
1 Car Audi A6 12000.0
2 Bus VW Transporter 15000.0
3 Camper VW California 20000.0
Old Answer:
Your text-file needs a fixed structure (for example all values are separated by a tabulate or a line break).
Then you can use the pd.read_csv method and define the separator by hand with pd.read_csv('yourFileName', sep='yourseperator').
Tabs are \t and line breaks \n, for example.
The following cars.txt (link) for example is structured using tabs and can be read with:
import pandas as pd
pd.read_csv('cars.txt', sep = '\t')
It is likely far easier to create a table from a CSV file than from a text file, as it will make the job of parsing much easier, and also provide the benefit of being easily viewed in table format in spreadsheet applications such as Excel.
You create the file so that it looks something like this
category,model,price
Car,Audi A4,10000
Car,Audi A6,12000
...
And then use the csv package to easily read/write the data into tabular formats

How do I add a blank line between merged files

I have several CSV files that I have managed to merge. However, I need to add a blank row between each files as they merge so I know a different file starts at that point. Tried everything. Please help.
import os
import glob
import pandas
def concatenate(indir="C:\\testing", outfile="C:\\done.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=["Creation Date","Author","Tweet","Language","Location","Country","Continent"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename, header=None)
ins=df.insert(len(df),'\n')
dfList.append(ins)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)
Here is an example script. You can use the loc method with a non-existent key to enlarge the DataFrame and set the value of the new row.
The simplest solution seems to be to create a template DataFrame to use as a separator with the values set as desired. Then just insert it into the list of data frames to concatenate at appropriate positions.
Lastly, I removed the chdir, since glob can search in any path.
import glob
import pandas
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
# Create a separator template
separator = pandas.DataFrame(columns=column_names)
separator.loc[0] = [""]*7
dataframes = []
for file_name in file_list:
print(file_name)
if len(dataframes):
# The list is not empty, so we need to add a separator
dataframes.append(separator)
dataframes.append(pandas.read_csv(file_name))
concatenated = pandas.concat(dataframes, axis=0)
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
concatenate("input", ".out.csv")
An alternative, even shorter, way is to build the concatenated DataFrame iteratively, using the append method.
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
concatenated = pandas.DataFrame(columns=column_names)
for file_name in file_list:
print(file_name)
if len(concatenated):
# The list is not empty, so we need to add a separator
concatenated.loc[len(concatenated)] = [""]*7
concatenated = concatenated.append(pandas.read_csv(file_name))
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
I tested the script with 3 input CSV files:
input/1.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
input/2.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
input/3.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2018-12-12,who,Hello,EN,Delhi,India,Asia
When I ran it, the following output was written to console:
Console Output (using concat)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
0
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
0
0 2018-12-12 who Hello EN Delhi India Asia
The console output of the shorter variant is slightly different (note the indices in the first column), however this has no effect on the generated CSV file.
Console Output (using append)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
3
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
6
0 2018-12-12 who Hello EN Delhi India Asia
Finally, this is what the output CSV file it generated looks like:
out.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
,,,,,,
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
,,,,,,
2018-12-12,who,Hello,EN,Delhi,India,Asia

Categories

Resources