How to convert messy html table to pandas dataframe

How to convert messy html table to pandas dataframe - python

I'm trying to scrape SEC 10-Q and 10-K filings. Though I'm able to extract the tables, the CSV output is a bit messy. Is there any way that I can merge the columns with similar header names with pandas? Or any libraries that can help me export SEC filing data tables as csv?
[user#server sec_parser]$ /usr/bin/python3 /home/user/work_files/sec_parser/parser.py --file 10-Q-cmcsa-3312017x10q.htm
0 1 2 3 4 5 6 7 8 9 10 11
0 (in millions) NaN 2017 2017 2017 NaN 2016 2016 2016 NaN NaN NaN
1 Revenue NaN $ 20463 NaN NaN $ 18790 NaN NaN 8.9 %
2 Costs and Expenses: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Programming and production NaN 6074 6074 NaN NaN 5431 5431 NaN NaN 11.8 NaN
4 Other operating and administrative NaN 5827 5827 NaN NaN 5526 5526 NaN NaN 5.4 NaN
5 Advertising, marketing and promotion NaN 1530 1530 NaN NaN 1466 1466 NaN NaN 4.4 NaN
6 Depreciation NaN 1915 1915 NaN NaN 1785 1785 NaN NaN 7.3 NaN
7 Amortization NaN 587 587 NaN NaN 493 493 NaN NaN 19.0 NaN
8 Operating income NaN 4530 4530 NaN NaN 4089 4089 NaN NaN 10.8 NaN
9 Other income (expense) items, net NaN (625 (625 ) NaN (554 (554 ) NaN 13.0 NaN
10 Income before income taxes NaN 3905 3905 NaN NaN 3535 3535 NaN NaN 10.4 NaN
11 Income tax expense NaN (1,258 (1,258 ) NaN (1,311 (1,311 ) NaN (4.1 )
12 Net income NaN 2647 2647 NaN NaN 2224 2224 NaN NaN 19.0 NaN
13 Net (income) loss attributable to noncontrolli... NaN (81 (81 ) NaN (90 (90 ) NaN (10.2 )
14 Net income attributable to Comcast Corporation NaN $ 2566 NaN NaN $ 2134 NaN NaN 20.2 %
The sample table that I'm trying to convert as CSV https://edgartable.netlify.app/.
Here's my code
import os
import argparse
import sys
from bs4 import BeautifulSoup
import argparse
import pandas as pd
args = argparse.ArgumentParser()
args.add_argument('--file', type=str)
args.add_argument('--list', type=str)
opts = args.parse_args()
def parse_file(file):
data_map = []
div = []
tables = []
soup = BeautifulSoup(open(file, 'r'), 'html.parser')
for div in soup.find_all('div'):
if 'Consolidated Operating Results' not in str(div.find('font')): continue
table = div.find('table')
dataset = pd.read_html(str(table), skiprows=3)
print(dataset[0])
for i, data in enumerate(dataset):
data.to_csv(f'test{i}.csv', '|', index=False, header=False)
def main():
parse_file(opts.file)
if __name__ == "__main__": main()

Try this:
import pandas as pd
df = pd.read_html('https://edgartable.netlify.app/')
df = df[0]
df.to_csv('test.csv')

Related

extract specific information from unstructured table. How can I fix this error and structure my data and get my desired table?

I've converted my pdf file to excel and now I got table like this
I would like to extract the information from the table, like "name", "Exam ID", "ID", "Test Address", "theory test time" and "Skill test time". And write these information to a new excel file. Does anyone know how I do this with Python?
And I try the dataraframe by pandas
import pandas as pd
df = pd.read_excel (r’Copy of pdf-v1.xlsx')
print (df)
but the dataframe is very unstructured and "Exam ID" and people's name distributed in various places on the table. Like this
0 Exam ID
1 ID
2 Company
3 Job
4 subject of examination
5 Address
6 theory test time
7 Skill test address
8 Skill test time
9 NaN
10 NaN
11 P.S.: \n1. xxxx\n2. dfr\n3.\n
12 examination ...
13 name
14 Exam ID
15 ID
16 Company
17 Job
18 subject of examination
19 Address
20 theory test time
21 Skill test address
22 Skill test time
23 NaN
24 NaN
25 P.S.: \n1. xxxx\n2. dfr\n3.\n
26 examination ...
27 name
28 Exam ID
29 ID
30 Company
31 Job
32 subject of examination
33 Address
34 theory test time
35 Skill test address
36 Skill test time
37 NaN
38 NaN
39 P.S.: \n1. xxxx\n2. dfr\n3.\n
40 examination ...
41 name
42 Exam ID
43 ID
44 Company
45 Job
46 subject of examination
47 Address
48 theory test time
49 Skill test address
50 Skill test time
51 NaN
52 NaN
53 P.S.: \n1. xxxx\n2. dfr\n3.\n
Smith gender male Unnamed: 4 \
0 78610 NaN NaN NaN
1 108579352 NaN NaN NaN
2 NaN NaN NaN NaN
3 Police NaN Level Level Five
4 Theory, Skill NaN Degree high school
5 dffef NaN NaN NaN
6 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
7 cdwgrgtr NaN NaN NaN
8 2021-10-19 09:30 NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 Charles gender male NaN
14 74308 NaN NaN NaN
15 733440627 NaN NaN NaN
16 NaN NaN NaN NaN
17 engeneer NaN Level Level Five
18 Theory, Skill NaN Degree Master
19 dffef NaN NaN NaN
20 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
21 cdwgrgtr NaN NaN NaN
22 2021-10-19 09:30 NaN NaN NaN
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN NaN
26 NaN NaN NaN NaN
27 Gary gender male NaN
28 77564 NaN NaN NaN
29 392096759 NaN NaN NaN
30 NaN NaN NaN NaN
31 driver NaN Level Level Five
32 Theory, Skill NaN Degree High school
33 dffef NaN NaN NaN
34 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
35 cdwgrgtr NaN NaN NaN
36 2021-10-19 09:30 NaN NaN NaN
37 NaN NaN NaN NaN
38 NaN NaN NaN NaN
39 NaN NaN NaN NaN
40 NaN NaN NaN NaN
41 Whitney gender female NaN
42 78853 NaN NaN NaN
43 207628593 NaN NaN NaN
44 NaN NaN NaN NaN
45 data scientist NaN Level Level Five
46 Theory, Skill NaN Degree Master
47 dffef NaN NaN NaN
48 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
49 cdwgrgtr NaN NaN NaN
50 2021-10-19 09:30 NaN NaN NaN
51 NaN NaN NaN NaN
52 NaN NaN NaN NaN
53 NaN NaN NaN NaN
Unnamed: 5 Unnamed: 6 Unnamed: 7 name.1 \
0 NaN NaN NaN Exam ID
1 NaN NaN NaN ID
2 NaN NaN NaN Company
3 NaN NaN NaN Job
4 NaN NaN NaN subject of examination
5 NaN NaN NaN Address
6 NaN NaN NaN theory test time
7 NaN NaN NaN Skill test address
8 NaN NaN NaN Skill test time
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN P.S.: \n1. xxxx\n2. dfr\n3.\n
12 NaN NaN NaN NaN
13 NaN NaN NaN name
14 NaN NaN NaN Exam ID
15 NaN NaN NaN ID
16 NaN NaN NaN Company
17 NaN NaN NaN Job
18 NaN NaN NaN subject of examination
19 NaN NaN NaN Address
20 NaN NaN NaN theory test time
21 NaN NaN NaN Skill test address
22 NaN NaN NaN Skill test time
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN P.S.: \n1. xxxx\n2. dfr\n3.\n
26 NaN NaN NaN NaN
27 NaN NaN NaN name
28 NaN NaN NaN Exam ID
29 NaN NaN NaN ID
30 NaN NaN NaN Company
31 NaN NaN NaN Job
32 NaN NaN NaN subject of examination
33 NaN NaN NaN Address
34 NaN NaN NaN theory test time
35 NaN NaN NaN Skill test address
36 NaN NaN NaN Skill test time
37 NaN NaN NaN NaN
38 NaN NaN NaN NaN
39 NaN NaN NaN P.S.: \n1. xxxx\n2. dfr\n3.\n
40 NaN NaN NaN NaN
41 NaN NaN NaN name
42 NaN NaN NaN Exam ID
43 NaN NaN NaN ID
44 NaN NaN NaN Company
45 NaN NaN NaN Job
46 NaN NaN NaN subject of examination
47 NaN NaN NaN Address
48 NaN NaN NaN theory test time
49 NaN NaN NaN Skill test address
50 NaN NaN NaN Skill test time
51 NaN NaN NaN NaN
52 NaN NaN NaN NaN
53 NaN NaN NaN P.S.: \n1. xxxx\n2. dfr\n3.\n
Philip gender .1 male.1 Unnamed: 12
0 80425 NaN NaN NaN
1 121246038 NaN NaN NaN
2 NaN NaN NaN NaN
3 driver NaN Level Level Five
4 Theory, Skill NaN Degree Bachelor
5 dffef NaN NaN NaN
6 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
7 cdwgrgtr NaN NaN NaN
8 2021-10-19 09:30 NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 Bill gender male NaN
14 74788 NaN NaN NaN
15 332094931 NaN NaN NaN
16 NaN NaN NaN NaN
17 driver NaN Level Level Five
18 Theory, Skill NaN Degree High school
19 dffef NaN NaN NaN
20 2021-10-18 10:00~2021-10-18 11:30 NaN NaN NaN
21 cdwgrgtr NaN NaN NaN
22 2021-10-19 09:30 NaN NaN NaN
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN NaN
26 NaN NaN NaN NaN
27 Betty gender female NaN
28 61311 NaN NaN NaN
Also, I try to read the label and extract the next column cell to get desired results but the loop also gets error which the codes like:
for index, row in df.iterrows():
if row['A2'] == 'Exam Id':
exam_id=row['B2']
Then the python said key error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/opt/anaconda3/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'A2'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-36-03e11799668c> in <module>
1 for index, row in df.iterrows():
----> 2 if row['A2'] == 'Exam Id':
3 exam_id=row['B2']
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
940
941 elif key_is_scalar:
--> 942 return self._get_value(key)
943
944 if is_hashable(key):
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
1049
1050 # Similar to Index.get_value, but we do not fall back to positional
-> 1051 loc = self.index.get_loc(label)
1052 return self.index._get_values_for_loc(self, loc, label)
1053
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'A2'
How can I fix it?
The new excel table I want is :
Name
ID
Exam Id
Theory Test time
Address
Skill test Time
Skill test Address

Your best bet is probably to read the excel file yourself with openpyxl. You have to go through your lines (jumping 14 lines each time) and get the values in columns B and J. Then you make a Dataframe out of your Dictionary list:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = 'Copy of pdf-v1.xlsx.xlsx')
data_sheet = wb['Sheet1'] # enter the name of the Sheet
candidates = []
for i in range(1, data_sheet.max_row, 14):
candidate1 = dict()
candidate1['name'] = data_sheet[f'B{i}'].value
candidate1['ID'] = data_sheet[f'B{i+1}'].value
candidate1['Exam id'] = data_sheet[f'B{i+2}'].value
candidate1['Theory Test Time'] = data_sheet[f'B{i+7}'].value
candidate1['Address'] = data_sheet[f'B{i+6}'].value
candidate1['Skill Test Time'] = data_sheet[f'B{i+9}'].value
candidate1['Skill Test Address'] = data_sheet[f'B{i+8}'].value
candidates.append(candidate1)
candidate2 = dict()
candidate2['name'] = data_sheet[f'J{i}'].value
candidate2['ID'] = data_sheet[f'J{i+1}'].value
candidate2['Exam id'] = data_sheet[f'J{i+2}'].value
candidate2['Theory Test Time'] = data_sheet[f'J{i+7}'].value
candidate2['Address'] = data_sheet[f'J{i+6}'].value
candidate2['Skill Test Time'] = data_sheet[f'J{i+9}'].value
candidate2['Skill Test Address'] = data_sheet[f'J{i+8}'].value
candidates.append(candidate2)
table = pd.DataFrame(candidates)
print(table)

Grouping by year through months: pivot table

I need to group values by Year, from my dataset:
Date Freq Year Month
0 2020-03-19 32 2020 3
1 2020-03-25 31 2020 3
2 2020-03-23 28 2020 3
3 2020-03-04 26 2020 3
4 2020-08-04 26 2020 8
... ... ... ... ...
2516 2011-09-02 1 2011 9
2517 2013-04-25 1 2013 4
2518 2020-09-02 1 2020 9
2519 2013-09-03 1 2013 9
2520 2015-01-01 1 2015 1
The table below was found as follows:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
try_this=pd.pivot_table(df, values = 'Freq', index=['Date','Year'], columns = 'Month')
Month 1 2 3 4 5 6 7 8 9 10 11 12
Date Year
2010-03-04 2010 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-03-07 2010 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-07-31 2010 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
2010-10-07 2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
2010-12-20 2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-12-05 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0
2020-12-06 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10.0
2020-12-08 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18.0
2020-12-09 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0
2020-12-10 2020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14.0
I am trying to get something like this:
Year 1 2 3 4 5 6 7 8 9 10 11 12
2020 ... 61.0
2019 ...
2018 ...
...
i.e. a table where group by year the frequency through months.
What I tried (code above) is not giving me this output.
I would appreciated more help on how to figure it out.
References:
Plot through time setting specific filtering
How to pivot a dataframe in Pandas?

Have you tried using aggfunc in pivot_table:
df = df[['Year', 'Month', 'Freq']]
df = df.pivot_table(values=['Freq'], columns=['Month'], index=['Year'], aggfunc='sum')
print(df)
Freq
Month 1 3 4 8 9
Year
2011 NaN NaN NaN NaN 1.0
2013 NaN NaN 1.0 NaN 1.0
2015 1.0 NaN NaN NaN NaN
2020 NaN 117.0 NaN 26.0 1.0

Extract clear data in json from table in a pdf

I have the following PDF file from which I want to get the the data inside it so as i can integrate with my app.
Example i want to get 1 for Monday and 10 and 14 for the columns having white boxes
Here is what I have tried:
import tabula
df = tabula.read_pdf("IT.pdf",multiple_tables=True)
for col in df:
print(col)
The output comes like
07:00 08:00 08:00 09:00 Unnamed: 0 Unnamed: 1 ... Unnamed: 10 07:00 08:00.1 Unnamed: 11 08:00 09:00.1
0 Tutorial Tutorial NaN NaN ... NaN Tutorial NaN NaN
1 G1_MSU G1G2G3_M NaN NaN ... NaN SPU_07410 NaN NaN
2 07201 TU 07203 NaN NaN ... NaN 110 NaN NaN
3 110 110, 115, NaN NaN ... NaN Andaray, N NaN NaN
4 Lema, F (Mr) 117 NaN NaN ... NaN (Mr) NaN NaN
5 BscIRM__1 Farha, M NaN NaN ... NaN BIRM__2PT NaN NaN
6 C (Mrs), NaN NaN ... NaN NaN NaN NaN
7 NaN Mandia, A NaN NaN ... NaN NaN NaN NaN
8 NaN (Ms), NaN NaN ... NaN NaN NaN NaN
9 NaN Wilberth, N NaN NaN ... NaN NaN NaN NaN
10 NaN (Ms) NaN NaN ... NaN NaN NaN NaN
11 NaN BscIRM__1 NaN NaN ... NaN NaN NaN NaN
12 NaN C NaN NaN ... NaN NaN NaN NaN
13 Tutorial Tutorial NaN NaN ... NaN Tutorial NaN Tutorial
14 G4_MSU G3_MTU NaN NaN ... NaN AFT_05204 NaN BFT_05202
15 07201 07203 NaN NaN ... NaN 110 NaN 110

use camelot package. That will help you.

Export a list of tables separately using Pandas

I am trying to extract all tables from a text file like https://www.sec.gov/Archives/edgar/data/1961/0001264931-18-000031.txt. I want to iterate over each table and if it contains a certain string (income tax), I want to export it using pandas dataframes. However, I keep getting the error that I cannot export a list. I know I am overlooking something simple, but how does my code not export every table separately
for filename, text in tqdm(dìctionary.items()):
soup = BeautifulSoup(text, "lxml")
tables = soup.find_all('table')
for i, table in enumerate(tables):
if ('income tax' in str(table)) or ('Income tax' in str(table)):
df = pd.read_html(str(table))
nametxt = filename.strip('.txt')
name = nametxt.replace("/", "")
df.to_csv('mypath\\' + name + '_%s.csv' %i)
else:
pass
0% 0/6547 [00:00<?, ?it/s]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-99-e7794eac8da6> in <module>()
7 nametxt = filename.strip('.txt')
8 name = nametxt.replace("/", "")
----> 9 df.to_csv('mypath\\' + name + '_%s.csv' %i)
10 else:
11 pass
AttributeError: 'list' object has no attribute 'to_csv'
df looks like this:
[ 0 1 2 \
0 Worlds Inc. NaN NaN
1 Statements of Cash Flows NaN NaN
2 Year Ended December 31, 2017 and 2016 NaN NaN
3 NaN NaN Audited
4 NaN NaN 12/31/17
5 Cash flows from operating activities: NaN NaN
6 Net gain/(loss) NaN $
7 Adjustments to reconcile net loss to net cash ... NaN NaN
8 Loss on settlement of convertible notes NaN NaN
9 Fair value of stock options issued NaN NaN
10 Fair value of warrants issued NaN NaN
11 Amortization of discount to note payable NaN NaN
12 Changes in fair value of derivative liabilities NaN NaN
13 Accounts payable and accrued expenses NaN NaN
14 Due from/to related party NaN NaN
15 Net cash (used in) operating activities: NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 Cash flows from financing activities NaN NaN
19 Proceeds from issuance of note payable NaN NaN
20 Proceeds from issuance of convertible note pay... NaN NaN
21 Cash paid to repurchase convertible note payable NaN NaN
22 Proceeds from issuance of common stock NaN NaN
23 Proceeds from exercise of warrants NaN NaN
24 Issuance of common stock as payment for accoun... NaN NaN
25 Net cash provided by financing activities NaN NaN
26 NaN NaN NaN
27 Net increase/(decrease) in cash and cash equiv... NaN NaN
28 NaN NaN NaN
29 Cash and cash equivalents, including restricte... NaN NaN
30 NaN NaN NaN
31 Cash and cash equivalents, including restricte... NaN $
32 NaN NaN NaN
33 Non-cash financing activities NaN NaN
34 Issuance of 54,963,098 shares of common stock ... NaN NaN
35 NaN NaN NaN
36 Supplemental disclosure of cash flow information: NaN NaN
37 Cash paid during the year for: NaN NaN
38 Interest NaN $
39 Income taxes NaN $
40 NaN NaN NaN
41 The accompanying notes are an integral part of... NaN NaN
3 4 5 6 7 8
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN Audited NaN NaN NaN NaN
4 NaN 12/31/16 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 (2,746,968 ) NaN $ (1,132,906 )
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 246413 NaN
9 1041264 NaN NaN NaN — NaN
10 1215240 NaN NaN NaN — NaN
11 — NaN NaN NaN 5000 NaN
12 — NaN NaN NaN (6,191 )
13 267983 NaN NaN NaN 237577 NaN
14 (21,051 ) NaN NaN (31,257 )
15 (243,532 ) NaN NaN (681,364 )
16 NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN
19 — NaN NaN NaN 290000 NaN
20 — NaN NaN NaN 156500 NaN
21 NaN NaN NaN NaN (175,257 )
22 NaN NaN NaN NaN 350000 NaN
23 292800 NaN NaN NaN 127200 NaN
24 25582 NaN NaN NaN — NaN
25 318382 NaN NaN NaN 748443 NaN
26 NaN NaN NaN NaN NaN NaN
27 74849 NaN NaN NaN 67079 NaN
28 NaN NaN NaN NaN NaN NaN
29 93378 NaN NaN NaN 26298 NaN
30 NaN NaN NaN NaN NaN NaN
31 168229 NaN NaN $ 93379 NaN
32 NaN NaN NaN NaN NaN NaN
33 NaN NaN NaN NaN NaN NaN
34 — NaN NaN NaN 384159 NaN
35 NaN NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN NaN
37 NaN NaN NaN NaN NaN NaN
38 — NaN NaN $ (34,916 )
39 — NaN NaN $ — NaN
40 NaN NaN NaN NaN NaN NaN
41 NaN NaN NaN NaN NaN NaN ]

Here is a similar question with solution Get HTML table into pandas Dataframe, not list of dataframe objects
Essentially, pd.read_html returns a list of dataframes and you need to select one to call to_csv.

Pandas read text file with special regex

I need to get ad dataframe from files built up like this:
MANDT#|#BWKEY#|#BUKRS#|#BWMOD#|#XBKNG#|#MLBWA#|#MLBWV#|#XVKBW
150#|#2000#|#1001#|##|##|##|##|#
150#|#2001#|#1001#|##|##|##|##|#
150#|#2002#|#1001#|##|##|##|##|#
150#|#4000#|#1000#|##|##|##|##|#
150#|#4001#|#1000#|##|##|##|##|#
150#|#4002#|#1000#|##|##|##|##|#
150#|#4003#|#1000#|##|##|##|##|#
150#|#4005#|#1000#|##|##|##|##|#
What would be the right python regex for separation (#|#) in read_csv?
Thank´s!

Escape the vertical bar, which has a special meaning, with \|
df = pd.read_clipboard(sep=r'#\|#')
print(df)
MANDT BWKEY BUKRS BWMOD XBKNG MLBWA MLBWV XVKBW
0 150 2000 1001 NaN NaN NaN NaN NaN
1 150 2001 1001 NaN NaN NaN NaN NaN
2 150 2002 1001 NaN NaN NaN NaN NaN
3 150 4000 1000 NaN NaN NaN NaN NaN
4 150 4001 1000 NaN NaN NaN NaN NaN
5 150 4002 1000 NaN NaN NaN NaN NaN
6 150 4003 1000 NaN NaN NaN NaN NaN
7 150 4005 1000 NaN NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert messy html table to pandas dataframe - python

Try this: import pandas as pd df = pd.read_html('https://edgartable.netlify.app/') df = df[0] df.to_csv('test.csv')

Related

extract specific information from unstructured table. How can I fix this error and structure my data and get my desired table?

Grouping by year through months: pivot table

Extract clear data in json from table in a pdf

Export a list of tables separately using Pandas

Pandas read text file with special regex

Categories

Resources