How to arrange html sentences having different structures

How to arrange html sentences having different structures - python

I have few hundreds of html files look like the below.
<nonDerivativeTable>
<nonDerivativeHolding> #First Holding
<securityTitle>
<value>Stock</value>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Second Holding
<securityTitle>
<footnoteId id="F1"/>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Third Holding
<securityTitle>
<value>Option</value>
<footnoteId id="F2"/>
<footnoteId id="F3"/>
</securityTitle>
</nonDerivativeHolding>
</nonDerivativeTable>
Two variables that I would like to extract is security ('Stock' in #First holding, '' in #Second holding, and 'Option' in #Third holding) and security_footnote ('' in #First holding, 'F1; F2' in #Second holding, and 'F3' in #Third holding. But securityTitle and securityTitleFootnote do not always exist.
Also, sometimes there are multiple footnote IDs just like in the #third holding.
I want to write each rwo using data in each "Holding" tag allowing for empty values.
import csv
from bs4 import BeautifulSoup
with open('output.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile, )
soup = BeautifulSoup(doc, 'htmparser') #Let's say doc has the html.
try:
securityTitles = soup.select('securityTitle > value').text
except:
securitiyTitles = ''
try:
securityTitleFootnotes = '; 'join(soup.select('securityTitle > footnoteid').get('id')
except:
securityTitleFootnotes = ''
for securityTitle, securityTitleFootnote in zip(securitiyTitles, securityTitleFootnotes):
writer.writerow([securityTitle, securityTitleFootnote])
I want the result to be
Want Table
Note: one of url that I am trying to parse is "https://www.sec.gov/Archives/edgar/data/12927/0001225208-09-018738.txt". sentences that I uploaded are only part of the data.
Now I see that those are XML... rather than HTML.

You can find the contents for each nonDerivativeHolding, and then apply a custom list of handlers for each:
from bs4 import BeautifulSoup as soup
c = [i.securitytitle.contents for i in soup(s, 'html.parser').find_all('nonderivativeholding')]
h = [('value', lambda x:x.text), ('footnoteid', lambda x:x['id'])]
results = [[i for i in b if i != '\n'] for b in c]
r = [{a:(lambda x:'' if not x else x[0] if len(x) == 1 else x)([b(j) for j in i if j.name == a]) for a, b in h} for i in results]
Output:
[{'value': 'Stock', 'footnoteid': ''}, {'value': '', 'footnoteid': 'F1'}, {'value': 'Option', 'footnoteid': ['F2', 'F3']}]

Related

add values to a list from specific part of a text file

I am having this text
/** Goodmorning
Alex
Dog
House
Red
*/
/** Goodnight
Maria
Cat
Office
Green
*/
I would like to have Alex , Dog , House and red in one list and Maria,Cat,office,green in an other list.
I am having this code
with open(filename) as f :
for i in f:
if i.startswith("/** Goodmorning"):
#add files to list
elif i.startswith("/** Goodnight"):
#add files to other list
So, is there any way to write the script so it can understands that Alex belongs in the part of the text that has Goodmorning?

I'd recommend you to use dict, where "section name" will be a key:
with open(filename) as f:
result = {}
current_list = None
for line in f:
if line.startswith("/**"):
current_list = []
result[line[3:].strip()] = current_list
elif line != "*/":
current_list.append(line.strip())
Result:
{'Goodmorning': ['Alex', 'Dog', 'House', 'Red'], 'Goodnight': ['Maria', 'Cat', 'Office', 'Green']}
To search which key one of values belongs you can use next code:
search_value = "Alex"
for key, values in result.items():
if search_value in values:
print(search_value, "belongs to", key)
break

I would recommend to use Regular expressions. In python there is a module for this called re
import re
s = """/** Goodmorning
Alex
Dog
House
Red
*/
/** Goodnight
Maria
Cat
Office
Green
*/"""
pattern = r'/\*\*([\w \n]+)\*/'
word_groups = re.findall(pattern, s, re.MULTILINE)
d = {}
for word_group in word_groups:
words = word_group.strip().split('\n\n')
d[words[0]] = words[1:]
print(d)
Output:
{'Goodmorning': ['Alex', 'Dog', 'House', 'Red'], 'Goodnight':
['Maria', 'Cat', 'Office', 'Green']}

expanding on Olvin Roght (sorry can't comment - not enough reputation) I would keep a second dictionary for the reverse lookup
with open(filename) as f:
key_to_list = {}
name_to_key = {}
current_list = None
current_key = None
for line in f:
if line.startswith("/**"):
current_list = []
current_key = line[3:].strip()
key_to_list[current_key] = current_list
elif line != "*/":
current_name=line.strip()
name_to_key[current_name]=current_key
current_list.append(current_name)
print key_to_list
print name_to_key['Alex']
alternative is to convert the dictionary afterwards:
name_to_key = {n : k for k in key_to_list for n in key_to_list[k]}
(i.e if you want to go with the regex version from ashwani)
Limitation is that this only permits one membership per name.

How can I match strings that may or may not exist in regex but have placeholders if the match does not exist

Suppose I have a big text file in the following form
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
There are in total 8 keys which will always be in the order of "Surname","Name","Age","Weight","Height","School","Siblings","Quote" which I know beforehand. As you can see, some profiles do not have the full set of variables. The only thing you can be sure will exist is the name.
I want to create a pandas dataframe with each observation as a row and each column as a key. In the case of James, since he does not have the entries in "School" and "Sibling" I would like the entries of those cells to be the numpy nan object.
My attempt is using something like (?:\[Surname: \"()\"\]) for every variable. But even for the single case of surname I run into problems. If surname does not exist, it returns no place holders just the empty list.
Update:
As an example, I would like the return for monica's profile to be
('','Monica','','33','','','','I am looking forward to christmas')

You can parse the file data, group the results, and pass to a dataframe:
import re
import pandas as pd
def group_results(d):
_group = [d[0]]
for a, b in d[1:]:
if a == 'Name' and not any(c == 'Name' for c, _ in _group):
_group.append([a, b])
elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
yield _group
_group = [[a, b]]
else:
if a == 'Name':
yield _group
_group = [[a, b]]
else:
_group.append([a, b])
yield _group
headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)
Output:
Surname Name ... Siblings Quote
0 Gordon James ... I want to be a pilot
1 Monica ... I am looking forward to christmas
[2 rows x 8 columns]

Building on #WiktorStribiżew comment, you could use groupby (from itertools) to group the lines into empty lines and data lines, for instance like this:
import re
from itertools import groupby
text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]
[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''
patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
re.compile('(\[Name: "(?P<name>\w+?)"\])'),
re.compile('(\[Age: "(?P<age>\d+?)"\])'),
re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
re.compile('\[Height: "(?P<height>\d+?)"\]'),
re.compile('\[Quote: "(?P<quote>.+?)"\]')]
records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
if non_empty:
lines = list(group)
record = {}
for line in lines:
for pattern in patterns:
match = pattern.search(line)
if match:
record.update(match.groupdict())
break
records.append(record)
for record in records:
print(record)
Output
{'weight': '46', 'quote': 'I want to be a pilot', 'age': '13', 'name': 'James', 'height': '12', 'surname': 'Gordon'}
{'weight': '33', 'quote': 'I am looking forward to christmas', 'name': 'Monica'}
{'height': '33', 'quote': 'I am looking forward to christmas', 'name': 'John'}
{'height': '44', 'surname': 'Gordon', 'quote': 'I am looking forward to christmas', 'name': 'James'}
Note: This creates a dictionary where the keys are the field names and the values are the values of each, this format does not match your intended output, but I believe is more complete that what you requested. In any case you can easily convert from this format into the desired tuple format.
Explanation
The groupby function from itertools groups the input data into contiguous groups of empty lines and record lines. Then you only need to process the groups that are not empty. The processing is simple for each line try to match a pattern if the pattern is matched break, assuming the lines are exclusive for each match update the record dictionary with the value of the field, leveraging named groups.

You can rewrite your data file. The code parses your original file into classes D, then uses csv.DictWriter to write it into a normal style csv that should be readable by pandas:
Create demo file:
fn = "t.txt"
with open (fn,"w") as f:
f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")
Itermediate class:
class D:
fields = ["Surname","Name","Age","Weight","Height","Quote"]
def __init__(self,textlines):
t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
self.data = {k:"" for k in D.fields}
self.data.update(t)
def surname(self): return self.data["Surname"]
def name(self): return self.data["Name"]
def age(self): return self.data["Age"]
def weight(self): return self.data["Weight"]
def height(self): return self.data["Height"]
def quote(self): return self.data["Quote"]
def get_data(self):
return self.data
Parsing and rewriting:
fn = "t.txt"
# list of all collected D-Instances
data = []
with open(fn) as f:
# each dataset contains all lines belonging to one "person"
dataset = []
surname = False
for line in f.readlines():
clean = line.strip().strip("[]")
if clean and (clean.startswith("Surname") or clean.startswith("Name")):
if any(e.startswith("Name") for e in dataset):
data.append(D(dataset))
dataset = []
if clean:
dataset.append(clean)
else:
if clean:
dataset.append(clean)
elif clean:
dataset.append(clean)
if dataset:
data.append(D(dataset))
import csv
with open("other.txt", "w", newline="") as f:
dw = csv.DictWriter(f,fieldnames=D.fields)
dw.writeheader()
for entry in data:
dw.writerow(entry.get_data())
Check what was written:
with open("other.txt","r") as f:
print(f.read())
Output:
Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""

Create a list of (key,value) tuples for each info block with re.findall(), and put them in separate dictionaries:
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']

How to scrape and extract all the subcategories names from all its associated pages for a wikipedia category using python 3.6?

I want to scrape all the subcategories and pages under the category header of the Category page: "Category:Computer science". The link for the same is as follows: http://en.wikipedia.org/wiki/Category:Computer_science.
I have got an idea regarding the above mentioned problem, from the following stack overflow answer which is specified in the following link.
Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category
and
How to scrape Subcategories and pages in categories of a Category wikipedia page using Python
However, the answer do not fully solves the problem. It only scrapes the Pages in category "Computer science". But, I want to extract all the subcategories names and its associated pages. I want the process should report the results in BFS manner with a depth of 10. Is there exist any way to do this?
I found the following code from this linked post:
from pprint import pprint
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'https://en.wikipedia.org/wiki/Category:Computer science'
def get_next_link(soup):
return soup.find("a", text="next page")
def extract_links(soup):
return [a['title'] for a in soup.select("#mw-pages li a")]
with requests.Session() as session:
content = session.get(base_url).content
soup = BeautifulSoup(content, 'lxml')
links = extract_links(soup)
next_link = get_next_link(soup)
while next_link is not None: # while there is a Next Page link
url = urljoin(base_url, next_link['href'])
content = session.get(url).content
soup = BeautifulSoup(content, 'lxml')
links += extract_links(soup)
next_link = get_next_link(soup)
pprint(links)

To scrape the subcategories, you will have to use selenium to interact with the dropdowns. A simple traversal over the second category of links will yield the pages, however, to find all the subcategories, recursion is needed to properly group the data. The code below utilizes a simple variant of the breadth-first search to determine when to stop looping over the dropdown toggle objects generated at each iteration of the while loop:
from selenium import webdriver
import time
from bs4 import BeautifulSoup as soup
def block_data(_d):
return {_d.find('h3').text:[[i.a.attrs.get('title'), i.a.attrs.get('href')] for i in _d.find('ul').find_all('li')]}
def get_pages(source:str) -> dict:
return [block_data(i) for i in soup(source, 'html.parser').find('div', {'id':'mw-pages'}).find_all('div', {'class':'mw-category-group'})]
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://en.wikipedia.org/wiki/Category:Computer_science')
all_pages = get_pages(d.page_source)
_seen_categories = []
def get_categories(source):
return [[i['href'], i.text] for i in soup(source, 'html.parser').find_all('a', {'class':'CategoryTreeLabel'})]
def total_depth(c):
return sum(1 if len(b) ==1 and not b[0] else sum([total_depth(i) for i in b]) for a, b in c.items())
def group_categories(source) -> dict:
return {i.find('div', {'class':'CategoryTreeItem'}).a.text:(lambda x:None if not x else [group_categories(c) for c in x])(i.find_all('div', {'class':'CategoryTreeChildren'})) for i in source.find_all('div', {'class':'CategoryTreeSection'})}
while True:
full_dict = group_categories(soup(d.page_source, 'html.parser'))
flag = False
for i in d.find_elements_by_class_name('CategoryTreeToggle'):
try:
if i.get_attribute('data-ct-title') not in _seen_categories:
i.click()
flag = True
time.sleep(1)
except:
pass
else:
_seen_categories.append(i.get_attribute('data-ct-title'))
if not flag:
break
Output:
all_pages:
[{'\xa0': [['Computer science', '/wiki/Computer_science'], ['Glossary of computer science', '/wiki/Glossary_of_computer_science'], ['Outline of computer science', '/wiki/Outline_of_computer_science']]},
{'B': [['Patrick Baudisch', '/wiki/Patrick_Baudisch'], ['Boolean', '/wiki/Boolean'], ['Business software', '/wiki/Business_software']]},
{'C': [['Nigel A. L. Clarke', '/wiki/Nigel_A._L._Clarke'], ['CLEVER score', '/wiki/CLEVER_score'], ['Computational human modeling', '/wiki/Computational_human_modeling'], ['Computational social choice', '/wiki/Computational_social_choice'], ['Computer engineering', '/wiki/Computer_engineering'], ['Critical code studies', '/wiki/Critical_code_studies']]},
{'I': [['Information and computer science', '/wiki/Information_and_computer_science'], ['Instance selection', '/wiki/Instance_selection'], ['Internet Research (journal)', '/wiki/Internet_Research_(journal)']]},
{'J': [['Jaro–Winkler distance', '/wiki/Jaro%E2%80%93Winkler_distance'], ['User:JUehV/sandbox', '/wiki/User:JUehV/sandbox']]},
{'K': [['Krauss matching wildcards algorithm', '/wiki/Krauss_matching_wildcards_algorithm']]},
{'L': [['Lempel-Ziv complexity', '/wiki/Lempel-Ziv_complexity'], ['Literal (computer programming)', '/wiki/Literal_(computer_programming)']]},
{'M': [['Machine learning in bioinformatics', '/wiki/Machine_learning_in_bioinformatics'], ['Matching wildcards', '/wiki/Matching_wildcards'], ['Sidney Michaelson', '/wiki/Sidney_Michaelson']]},
{'N': [['Nuclear computation', '/wiki/Nuclear_computation']]}, {'O': [['OpenCV', '/wiki/OpenCV']]},
{'P': [['Philosophy of computer science', '/wiki/Philosophy_of_computer_science'], ['Prefetching', '/wiki/Prefetching'], ['Programmer', '/wiki/Programmer']]},
{'Q': [['Quaject', '/wiki/Quaject'], ['Quantum image processing', '/wiki/Quantum_image_processing']]},
{'R': [['Reduction Operator', '/wiki/Reduction_Operator']]}, {'S': [['Social cloud computing', '/wiki/Social_cloud_computing'], ['Software', '/wiki/Software'], ['Computer science in sport', '/wiki/Computer_science_in_sport'], ['Supnick matrix', '/wiki/Supnick_matrix'], ['Symbolic execution', '/wiki/Symbolic_execution']]},
{'T': [['Technology transfer in computer science', '/wiki/Technology_transfer_in_computer_science'], ['Trace Cache', '/wiki/Trace_Cache'], ['Transition (computer science)', '/wiki/Transition_(computer_science)']]},
{'V': [['Viola–Jones object detection framework', '/wiki/Viola%E2%80%93Jones_object_detection_framework'], ['Virtual environment', '/wiki/Virtual_environment'], ['Visual computing', '/wiki/Visual_computing']]},
{'W': [['Wiener connector', '/wiki/Wiener_connector']]},
{'Z': [['Wojciech Zaremba', '/wiki/Wojciech_Zaremba']]},
{'Ρ': [['Portal:Computer science', '/wiki/Portal:Computer_science']]}]
full_dict is quite large, and due to its size I am unable to post it entirely here, however, below is an implementation of a function to traverse the structure and select all the elements down to a depth of ten:
def trim_data(d, depth, count):
return {a:None if count == depth else [trim_data(i, depth, count+1) for i in b] for a, b in d.items()}
final_subcategories = trim_data(full_dict, 10, 0)
Edit: script to remove leaves from tree:
def remove_empty_children(d):
return {a:None if len(b) == 1 and not b[0] else
[remove_empty_children(i) for i in b if i] for a, b in d.items()}
When running the above:
c = {'Areas of computer science': [{'Algorithms and data structures': [{'Abstract data types': [{'Priority queues': [{'Heaps (data structures)': [{}]}, {}], 'Heaps (data structures)': [{}]}]}]}]}
d = remove_empty_children(c)
Output:
{'Areas of computer science': [{'Algorithms and data structures': [{'Abstract data types': [{'Priority queues': [{'Heaps (data structures)': None}], 'Heaps (data structures)': None}]}]}]}
Edit 2: flattening the entire structure:
def flatten_groups(d):
for a, b in d.items():
yield a
if b is not None:
for i in map(flatten_groups, b):
yield from i
print(list(flatten_groups(remove_empty_children(c))))
Output:
['Areas of computer science', 'Algorithms and data structures', 'Abstract data types', 'Priority queues', 'Heaps (data structures)', 'Heaps (data structures)']
Edit 3:
To access all the pages for every subcategory to a certain level, the original get_pages function can be utilized and a slightly different version of the group_categories method
def _group_categories(source) -> dict:
return {i.find('div', {'class':'CategoryTreeItem'}).find('a')['href']:(lambda x:None if not x else [group_categories(c) for c in x])(i.find_all('div', {'class':'CategoryTreeChildren'})) for i in source.find_all('div', {'class':'CategoryTreeSection'})}
from collections import namedtuple
page = namedtuple('page', ['pages', 'children'])
def subcategory_pages(d, depth, current = 0):
r = {}
for a, b in d.items():
all_pages_listing = get_pages(requests.get(f'https://en.wikipedia.org{a}').text)
print(f'page number for {a}: {len(all_pages_listing)}')
r[a] = page(all_pages_listing, None if current==depth else [subcategory_pages(i, depth, current+1) for i in b])
return r
print(subcategory_pages(full_dict, 2))
Please note that in order to utilize subcategory_pages, _group_categories must be used in place of group_categories.

How to read a file block-wise in python

I am bit stuck in reading a file block-wise, and facing difficulty in getting some selective data in each block :
Here is my file content :
DATA.txt
#-----FILE-----STARTS-----HERE--#
#--COMMENTS CAN BE ADDED HERE--#
BLOCK IMPULSE DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=1021055:lr=1: \
USERID=ID=291821 NO_USERS=3 GROUP=ONE id_info=1021055 \
CREATION_DATE=27-JUNE-2013 SN=1021055 KEY ="22WS \
DE34 43RE ED54 GT65 HY67 AQ12 ES23 54CD 87BG 98VC \
4325 BG56"
BLOCK PASSION DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
USERID=ID=291821 NO_USERS=1 GROUP=ONE id_info=324356 \
CREATION_DATE=27-MAY-2012 SN=324356 KEY ="22WS \
DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
CD32 12QW"
BLOCK VICTOR DATE 01-JAN-2010 6 DEHDUESO203028DJE \
SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
USERID=ID=291821 NO_USERS=5 GROUP=ONE id_info=324356 \
CREATION_DATE=27-MAY-2012 SN=324356 KEY ="22WS \
DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
CD32 12QW"
#--BLOCK--ENDS--HERE#
#--NEW--BLOCKS--CAN--BE--APPENDED--HERE--#
I am only interested in Block Name , NO_USERS, and id_info of each block .
these three data to be saved to a data-structure(lets say dict), which is further stored in a list :
[{Name: IMPULSE ,NO_USER=3,id_info=1021055},{Name: PASSION ,NO_USER=1,id_info=324356}. . . ]
any other data structure which can hold the info would also be fine.
So far i have tried getting the block names by reading line by line :
fOpen = open('DATA.txt')
unique =[]
for row in fOpen:
if "BLOCK" in row:
unique.append(row.split()[1])
print unique
i am thinking of regular expression approach, but i have no idea where to start with.
Any help would be appreciate.Meanwhile i am also trying , will update if i get something . Please help .

You could use groupy to find each block, use a regex to extract the info and put the values in dicts:
from itertools import groupby
import re
with open("test.txt") as f:
data = []
# find NO_USERS= 1+ digits or id_info= 1_ digits
r = re.compile("NO_USERS=\d+|id_info=\d+")
grps = groupby(f,key=lambda x:x.strip().startswith("BLOCK"))
for k,v in grps:
# if k is True we have a block line
if k:
# get name after BLOCK
name = next(v).split(None,2)[1]
# get lines after BLOCK and get the second of those
t = next(grps)[1]
# we want two lines after BLOCK
_, l = next(t), next(t)
d = dict(s.split("=") for s in r.findall(l))
# add name to dict
d["Name"] = name
# add sict to data list
data.append(d)
print(data)
Output:
[{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'},
{'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'},
{'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
Or without groupby as your file follows a format we just need to extract the second line after the BLOCK line:
with open("test.txt") as f:
data = []
r = re.compile("NO_USERS=\d+|id_info=\d+")
for line in f:
# if True we have a new block
if line.startswith("BLOCK"):
# call next twice to get thw second line after BLOCK
_, l = next(f), next(f)
# get name after BLOCK
name = line.split(None,2)[1]
# find our substrings from l
d = dict(s.split("=") for s in r.findall(l))
d["Name"] = name
data.append(d)
print(data)
Output:
[{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'},
{'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'},
{'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
To extract values you can iterate:
for dct in data:
print(dct["NO_USERS"])
Output:
3
1
5
If you want a dict of dicts and to access each section from 1-n you can store as nested dicts using from 1-n as tke key:
from itertools import count
import re
with open("test.txt") as f:
data, cn = {}, count(1)
r = re.compile("NO_USERS=\d+|id_info=\d+")
for line in f:
if line.startswith("BLOCK"):
_, l = next(f), next(f)
name = line.split(None,2)[1]
d = dict(s.split("=") for s in r.findall(l))
d["Name"] = name
data[next(cn)] = d
data["num_blocks"] = next(cn) - 1
Output:
from pprint import pprint as pp
pp(data)
{1: {'NO_USERS': '3', 'Name': 'IMPULSE', 'id_info': '1021055'},
2: {'NO_USERS': '1', 'Name': 'PASSION', 'id_info': '324356'},
3: {'NO_USERS': '5', 'Name': 'VICTOR', 'id_info': '324356'},
'num_blocks': 3}
'num_blocks' will tell you exactly how many blocks you extracted.

Generate a table of contents from HTML with Python

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.
My plan so far was to:
Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?
Output a nested list of links to the headers in a predefined spot.
It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.
Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?
A example:
<p>This is an introduction</p>
<h2>This is a sub-header</h2>
<p>...</p>
<h3>This is a sub-sub-header</h3>
<p>...</p>
<h2>This is a sub-header</h2>
<p>...</p>

Some quickly hacked ugly piece of code:
soup = BeautifulSoup(html)
toc = []
header_id = 1
current_list = toc
previous_tag = None
for header in soup.findAll(['h2', 'h3']):
header['id'] = header_id
if previous_tag == 'h2' and header.name == 'h3':
current_list = []
elif previous_tag == 'h3' and header.name == 'h2':
toc.append(current_list)
current_list = toc
current_list.append((header_id, header.string))
header_id += 1
previous_tag = header.name
if current_list != toc:
toc.append(current_list)
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>%s</li>' % item)
result.append("</ul>")
return "\n".join(result)
# Table of contents
print list_to_html(toc)
# Modified HTML
print soup

Use lxml.html.
It can deal with invalid html just fine.
It is very fast.
It allows you to easily create the missing elements and move elements around between the trees.

I have come with an extended version of the solution proposed by Łukasz's.
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>{}</li>'.format(item[0], item[1]))
result.append("</ul>")
return "\n".join(result)
soup = BeautifulSoup(article, 'html5lib')
toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0
for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
data = [(slugify(header.string), header.string)]
if header.name == "h2":
toc.append(data)
h3_prev = 0
h4_prev = 0
h5_prev = 0
h2_prev = len(toc) - 1
elif header.name == "h3":
toc[int(h2_prev)].append(data)
h3_prev = len(toc[int(h2_prev)]) - 1
elif header.name == "h4":
toc[int(h2_prev)][int(h3_prev)].append(data)
h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
elif header.name == "h5":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
elif header.name == "h6":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)
toc_html = list_to_html(toc)

How do I generate a table of contents for HTML text in Python?
But I think you are on the right track and reinventing the wheel will be fun.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to arrange html sentences having different structures - python

Related

add values to a list from specific part of a text file

How can I match strings that may or may not exist in regex but have placeholders if the match does not exist

How to scrape and extract all the subcategories names from all its associated pages for a wikipedia category using python 3.6?

How to read a file block-wise in python

Generate a table of contents from HTML with Python

Categories

Resources