I am trying to scrape a keyword in an xml document with BeautifulSoup but am unsure how to do so.
The xml document contains "Central Index Key," which changes each time for each document scraped. How would I be able to log the central index key for every unique xml document I scrape?
A sample is below. I want to log the string "0001773427" in this example:
<SEC-DOCUMENT>0001104659-22-079974.txt : 20220715
<SEC-HEADER>0001104659-22-079974.hdr.sgml : 20220715
<ACCEPTANCE-DATETIME>20220715060341
ACCESSION NUMBER: 0001104659-22-079974
CONFORMED SUBMISSION TYPE: 8-K
PUBLIC DOCUMENT COUNT: 14
CONFORMED PERIOD OF REPORT: 20220714
ITEM INFORMATION: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20220715
DATE AS OF CHANGE: 20220715
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: SpringWorks Therapeutics, Inc.
CENTRAL INDEX KEY: 0001773427
STANDARD INDUSTRIAL CLASSIFICATION: BIOLOGICAL PRODUCTS (NO DIAGNOSTIC SUBSTANCES) [2836]
IRS NUMBER: 000000000
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: 8-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-39044
FILM NUMBER: 221084206
BUSINESS ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
BUSINESS PHONE: 203-883-9490
MAIL ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
</SEC-HEADER>
The problem is that your document is SGML, not XML. XPath requires that your document be XML.
Use a different tool or convert your document to XML if you wish to use XPath. An example of a SGML to XML converter is sx by James Clark.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact#<sample company domain>.com'}
r = requests.get('https://www.sec.gov/Archives/edgar/data/1773427/000110465922079974/0001104659-22-079974.txt', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
print([x.strip().replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0])
This will return:
CENTRAL INDEX KEY: 0001773427
If you only want the key:
print([x.replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0].split(':')[1].strip())
Related
I am looking to use Beautiful Soup to scrape the Fujitsu news update page: https://www.fujitsu.com/uk/news/pr/2020/
I only want to extract the information under the headings of the current month and previous month.
For a particular month (e.g. November), I am trying to extract into a list
the Title
the URL
the text
for each news briefing (so a list of lists).
My attempt so far is as follow (showing only previous month for simplicity):
today = datetime.datetime.today()
year_str = str(today.year)
current_m = today.month
previous_m = current_m - 1
current_m_str = calendar.month_name[current_m]
previous_m_str = calendar.month_name[previous_m]
URL = 'https://www.fujitsu.com/uk/news/pr/' + year_str + '/'
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, 'lxml')
previous_m_body = soup.find('h3', text=previous_m_str)
if previous_m_body is not None:
for sib in previous_m_body.find_next_siblings():
if sib.name == "h3":
break
else:
previous_m_text = str(sib.text)
print(previous_m_text)
However, this generates one long string with newlines, and no separation between Title, text, url:
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution London, United Kingdom, November 30, 2020 - Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local...
Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive Loads Hoofddorp, EMEA, November 11, 2020 - Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring.......
I have attached an image of the page DOM.
Try this:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.fujitsu.com/uk/news/pr/2020/").text
all_lists = BeautifulSoup(html, "html.parser").find_all("ul", class_="filterlist")
news = []
for unordered_list in all_lists:
for list_item in unordered_list.find_all("li"):
news.append(
[
list_item.find("a").getText(),
f"https://www.fujitsu.com{list_item.find('a')['href']}",
list_item.getText(strip=True)[len(list_item.find("a").getText()):],
]
)
for news_item in news:
print("\n".join(news_item))
print("-" * 80)
Output (shortened for brevity):
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution
https://www.fujitsu.com/uk/news/pr/2020/fs-20201130.html
London, United Kingdom, November 30, 2020- Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local Authorities to support the electronic counting (e-Counting) of ballot papers at the Scottish Local Government elections in May 2022.Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive LoadsHoofddorp, EMEA, November 11, 2020- Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring a switching capacity of 50A at 14VDC. The FBR53-HC offers a higher contact rating than its 40A FBR53-HW counterpart, yet occupies the same 12.1 x 15.5 x 13.7mm footprint and weighs the same 6g.
--------------------------------------------------------------------------------
and more ...
EDIT:
To get just the last two months, all you need is the first two ul items from the soup. So, add [:2] to the first for loop, like this:
for unordered_list in all_lists[:2]:
# the rest of the loop body goes here
here I modified your code. I combined your bs4 code with selenium. Selenium is very powerful for scrape dynamic or JavaScript based website. You can use selenium with BeautifulSoup for make your life easier. Now it will give you output for all months.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.maximize_window()
url = "https://www.fujitsu.com/uk/news/pr/2020/" #change the url if you want to get result for different year
driver.get(url)
# now your bs4 code start. It will give you output from current month to previous all month
soup = BeautifulSoup(driver.page_source, "html.parser")
#here I am getting all month name from January to november.
months = soup.find_all('h3')
for month in months:
month = month.text
print(f"month_name : {month}\n")
#here we are getting all description text from current month to all previous months
description_texts = soup.find_all('ul',class_='filterlist')
for description_text in description_texts:
description_texts = description_text.text.replace('\n','')
print(f"description_text: {description_texts}")
output:
I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?
Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}
Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the <b> tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: <br/>
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
COAGULATION STUDIES: <br/>
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.
I am trying to scrape data from web pages and able to scrape also.
After using below script getting all div class data but I am confused how to write data in CSV file like.
First Data in the first name column
Last name data in last name column
.
.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
for i in range(len(name_box)):
data = name_box[i].text.strip()
Data:
Information Type
Individual
First Name
KACHAM
Middle Name
Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town
Pin Code
500033
Office Number
04040151614
Fax Number
Website URL
Authority Name
Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1
Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town
Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026
above is the data getting after run above code.
Edit
for i in range(len(name_box)):
data = name_box[i].text.strip()
print (data)
fname = 'out.csv'
with open(fname) as f:
next(f)
for line in f:
head = []
value = []
for row in line:
head.append(row)
print (row)
Expected
Information Type | First | Middle Name | Last Name | ......
Individual | KACHAM | | RAJESHWAR | .....
I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.
Please suggest. Thank you in advance
to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
heads = []
values = []
for i in range(len(name_box)):
data = name_box[i].text.strip()
dataHTML = str(name_box[i])
if 'PInfoType' in dataHTML:
# <div class="col-md-3 col-sm-3" id="PInfoType">
# empty value, maybe additional data for "Information Type"
continue
if 'for="2"' in dataHTML:
# <label for="2">No</label>
# it should be head but actually value
values.append(data)
elif '<label' in dataHTML:
# <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
# head or top row
heads.append(data)
else:
# <div class="col-md-3 col-sm-3">Individual</div>
# value for second row
values.append(data)
csvData = ', '.join(heads) + '\n' + ', '.join(values)
with open("results.csv", 'w') as f:
f.write(csvData)
print "finish."
Question: How to write csv file from scraped data
Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about:
csv.DictWriter
while
next
break
Mapping Types — dict
Skip the first line, as it's the title
Loop Data lines
key = next(data)
value = next(data)
Break loop if no further data
Build dict[key] = value
After finishing the loop, write dict to CSV file
Output:
{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)
I am having hard luck scraping this link via Python 3, BeautifulSoup 4
http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining
I only want to get this section.
When you are in ...
Capitol City Grille
This downtown Lansing restaurant offers ...
Capitol City Grille Lounge
For a glass of wine or a ...
Room Service
If you prefer ...
I have this code
for rest in dining_page_soup.select("div.copy_left p strong"):
if rest.next_sibling is not None:
if rest.next_sibling.next_sibling is not None:
title = rest.text
desc = rest.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
But it gives me TypeError: 'NoneType' object is not callable
on desc = rest.next_sibling.next_sibling even I have an if statement to check whether it is None or not.
Here it is a very simple solution
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
print found_text.text
UPDATE
According to an improvement of the question, here it is a solution using RE.
A specific workaround have to be made for the 1st paragraph "When you..." since it does not respect the structure of other paragraphs.
for tag in soup.find_all(re.compile("^strong")):
title = tag.text
desc = tag.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
Output
Title: Capitol City Grille
This downtown Lansing restaurant offers delicious, contemporary
American cuisine in an upscale yet relaxed environment. You can enjoy
dishes that range from fluffy pancakes to juicy filet mignon steaks.
Breakfast and lunch buffets are available, as well as an à la carte
menu.
Title: Capitol City Grille Lounge
For a glass of wine or a hand-crafted cocktail and great conversation,
spend an afternoon or evening at Capitol City Grille Lounge with
friends or colleagues.
Title: Room Service
If you prefer to dine in the comfort of your own room, order from the
room service menu.
Title: Menus
Breakfast Menu
Title: Capitol City Grille Hours
Breakfast, 6:30-11 a.m.
Title: Capitol City Grille Lounge Hours
Mon-Thu, 11 a.m.-11 p.m.
Title: Room Service Hours
Daily, 6:30 a.m.-2 p.m. and 5-10 p.m.
If you don't mind using xpath, this should work
import requests
from lxml import html
url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)
xp_t = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"
titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d) # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]
for t, d in zip(titles, descriptions):
print("{title}: {description}".format(title=t, description=d))
Here descriptions contains 3 elements: "This downtown...", "For a glass...", "If you prefer...".
If you need also "When you are in the mood...", replace with this:
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"
I'm new to python and have been taking on various projects to get up to speed. At the moment, I'm working on a routine that will read through the Code of Federal Regulations and for each paragraph, print the organizational hierarchy for that paragraph. For example, a simplified version of the CFR's XML scheme would look like:
<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
<SECTION>
<SECTNO>### 229.120</SECTNO>
<SUBJECT>Transfers of property.</SUBJECT>
<P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
</SECTION>
I'd like to be able to print this to a CSV so that I can run text analysis:
Title 22, Volume 2, Part 229, Section 228.120, If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).
Note that I'm not taking the Title and Volume numbers from the XML, because they are actually included in the file name in a much more standardized format.
Because I'm such a Python newbie, the code is mostly based on the search-engine code from Udacity's computer science course. Here's the Python I've written/adapted so far:
import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()
def clean_title(file_name): #Gets the title number from the filename.
start_title = file_name.find('title')
end_title = file_name.find("-", start_title+1)
title = file_name[start_title+5:end_title]
return title
def clean_volume(file_name): #Gets the volume number from the filename.
start_volume = file_name.find('vol')
end_volume = file_name.find('.xml', start_volume)
volume = file_name[start_volume+3:end_volume]
return volume
def get_next_section(page): #Gets all of the text between <SECTION> tags.
start_section = page.find('<SECTION')
if start_section == -1:
return None, 0
start_text = page.find('>', start_section)
end_quote = page.find('</SECTION>', start_text + 1)
section = page[start_text + 1:end_quote]
return section, end_quote
def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
start_section_number = section.find('<SECTNO>###')
if start_section_number == -1:
return None, 0
end_section_number = section.find('</SECTNO>', start_section_number)
section_number = section[start_section_number+11:end_section_number]
return section_number, end_section_number
def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
start_paragraph = section.find('<P>')
if start_paragraph == -1:
return None, 0
end_paragraph = section.find('</P>', start_paragraph)
paragraph = section[start_paragraph+3:end_paragraph]
return start_paragraph, paragraph, end_paragraph
def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
section, endpos = get_next_section(page)
for pragraph in section:
title = clean_title(file_name)
volume = clean_volume(file_name)
section, endpos = get_next_section(page)
section_number, end_section_number = get_section_number(section)
start_paragraph, paragraph, end_paragraph = get_paragraph(section)
if paragraph:
print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
page = page[end_paragraph:]
else:
break
print print_all_paragraphs(page)
doc.close()
At the moment, this code has the following issues (example output to follow):
It prints the first paragraph multiple times. How can I print each tag with its own title number, volume number, etc?
The CFR has empty sections that are "Reserved". These sections don't have tags, so the if loop breaks. I've tried implementing for/while loops, but for some reason when I do this the code then just prints the first paragraph it finds repeatedly.
Here's an example of the output:
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member
of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number: 9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number: 9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
None
Ideally, each of the entries after the citation information would be different.
What kind of loop should I run to print this properly? Is there a more "pythonic" way of doing this kind of text extraction?
I understand that I am a complete novice, and one of the major problems I'm facing is that I simply don't have the vocabulary or topic knowledge to really find detailed answers about parsing XML with this level of detail. Any recommended reading would also be welcome.
I like to solve problems like this with XPATH or XSLT. You can find a great implementation in lxml (not in standard distro, needs to be installed). For instance, the XPATH //CHAPTER/HD/SECTION[SECTNO] selects all sections with data. You use relative XPATH statements to grab the values you want from there. Multiple nested for loops disappear. XPATH has a bit of a learning curve, but there many examples out there.