I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab
I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---
Related
Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;
Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
I have this website and I would like to pull via Python all company names such as West Wood Events, or Mitchell Event Planning.
But I am stuck on soup.find since it results me [].
When I inspect the page, lets say this:
< div class="LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd">Mitchell Event Planning<wbr></div >
in that I would write:
week = soup.find(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
print(week)
And I get 0.
Am I missing something? I'm pretty new to this.
This string is not single class but many classes separated by spaces.
In some modules you would have to use original string with all spaces but it seems in BS you have to use classes separated by single space.
Code works for me if I uses single space between LinesEllipsis and vendor-name--55315.
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
Or if I use CSS selector with dot for every class in string
week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
Minimal working code
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.theknot.com/marketplace/wedding-planners-acworth-ga?page=2'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
#week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
for item in week:
print(item.text)
Result:
The Charming Details
Enraptured Events
pearl and sky events - planning, design and florals
Unique Occasions ByTNicole, Inc
Platinum Eventions
RED COMPANY ATLANTA
Pop + Fizz: Event Planning and Design
Patricia Elizabeth, certified wedding planners/producer
Rienza Events
Pollyanna Richter Weddings
Calabash Events, Inc.
Weddings by Carmona LLC
Emily Jordan Events
Perfectly Taylored Events
Lindsey Wise Designs
Elegant Weddings and Affairs
Party PLANit
Wedded Bliss
Above the Fray
Willow Jaymes Events
Coco Red Events
Liz D. Events, LLC
Leslie Cox Events
YSE LLC
Marmaros Productions
PerfectionsID, LLC
All Things Love
West Wood Events
Everlasting Elegance
Prestigious Occasions
I'm trying to use Beautiful Soup to extract information from old classified pages online. I mention this in particular because I can imagine that perhaps something has changed about HTML standards or something that may affect the way to do this. It seems that part of the problem may be that text is not enclosed in any tags.
Here's an example of what the page HTML looks like:
<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE
<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st & PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.
<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . website.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE
What I want to do is extract the text of each listing in the RENTALS section as a separate items in a list.
It seems like this would be done by using some combination of parsing for the sibling elements of the header.
However when I run the code:
soup = BeautifulSoup(contents, 'html')
target=soup.find("h5",text="RENTALS")
listingtext=[]
for sib in target.find_next_siblings():
if sib.name=="h5":
break
elif not sib.text:
pass
else:
listingtext.append(sib.text)
All that I get is list of all of the bold header text for the listings and the email addresses, which is all of the text enclosed in tags.
i.e. I get:
["NYC. GREENWICH VILLAGE.","EMAIL",'E. 71st & PARK.', 'BERKSHIRES—','SPECTACULAR VIEW OVER MANHATTAN.','COLD SPRING, NEW YORK.', 'DEMOCRATIC CONVENTION—','EMAIL']
What I really would like is a list that looks like
['NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL','E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.' ... ]
It seems that the problem I'm having stems from the fact that the text is unenclosed and that affects how BeautifulSoup parses the text. It also seems that I probably need to figure out how to use that the tag, which on the page was used to put lines between the listings, to delimit each listing.
You can use this example to parse our info just from 'RENTALS' section:
from bs4 import BeautifulSoup, Tag
txt = '''<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE
<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st & PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.
<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . website.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE'''
soup = BeautifulSoup(txt, 'html.parser')
for hr in soup.select('hr'):
if hr.find_previous('h5') is None or hr.find_previous('h5').text != 'RENTALS':
continue
out, s = [], hr.next_sibling
while not s is None and not (isinstance(s, Tag) and s.name in ('hr', 'h5')):
if isinstance(s, Tag):
out.append(s.get_text(strip=True))
elif s.strip():
out.append(s.strip())
s = s.next_sibling
if out:
print(' '.join(out))
print('-' * 80)
Prints:
NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL .
--------------------------------------------------------------------------------
E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
--------------------------------------------------------------------------------
BERKSHIRES— extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
--------------------------------------------------------------------------------
SPECTACULAR VIEW OVER MANHATTAN. Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
--------------------------------------------------------------------------------
DEMOCRATIC CONVENTION— Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. EMAIL or PHONE.
--------------------------------------------------------------------------------
give the "url" you want to scrape and
i will edit this answer and gave you the correct way with the output also
Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).
Given an input file, e.g.
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>
The desired result is a nested dictionary that stores:
/setid
/docid
/segid
text
I've been using a defaultdict and reading the xml file with BeautifulSoup and nested loops, i.e.
from io import StringIO
from collections import defaultdict
from bs4 import BeautifulSoup
srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""
#ntok = NISTTokenizer()
eval_docs = defaultdict(lambda: defaultdict(dict))
with StringIO(srcfile) as fin:
bsoup = BeautifulSoup(fin.read(), 'html5lib')
setid = bsoup.find('srcset')['setid']
for doc in bsoup.find_all('doc'):
docid = doc['docid']
for seg in doc.find_all('seg'):
segid = seg['id']
eval_docs[setid][docid][segid] = seg.text
[out]:
>>> eval_docs
defaultdict(<function __main__.<lambda>>,
{'newstest2015': defaultdict(dict,
{'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
'2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
'3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
'4': 'High on the agenda are plans for greater nuclear co-operation.',
'5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
'1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
'2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
'3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
'4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})
Is there a simpler way to read the file and get the same eval_docs nested dictionary?
Can it be done easily without using BeautifulSoup?
Note that in the example, there's only one setid and one docid but the actual file has more than one of those.
Since what you have is an HTML with an appearance like XML, you can't go for XML based tools. In most cases your options were
Implement SAX parser
use BS4 (which you are already doing)
Use lxml
In any case you will end up spending more time and effort and have a bigger code to handle this. What you have really sleek and easy. I wouldn't look for another solution if I was you.
PS: What simpler could it be than a 10 liner code!
I don't know if you'll find this simpler, but here's an alternative, using lxml as others have suggested.
Step 1: Convert the XML data into a normalized table (a list of lists)
from lxml import etree
tree = etree.parse('source.xml')
segs = tree.xpath('//seg')
normalized_list = []
for seg in segs:
srcset = seg.getparent().getparent().getparent().attrib['setid']
doc = seg.getparent().getparent().attrib['docid']
normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])
Step 2: Use defaultdict like you did in your original code
d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
d[i[0]][i[1]][i[2]] = i[3]
Depending on how you're keeping the source file, you'll have to use one of these methods to parse XML:
tree = etree.parse('source.xml'): when you want to parse a file directly - you won't need StringIO. File is closed automatically by etree.
tree = etree.fromstring(source): where source is a string object, like in your question.