How to use python beautifulsoup to get image description from html? - python

I did not find this answer in other location, so seek your's help:
I had a python code try to access http://news.yahoo.com/rss/entertainment
To get the title and descriptions. but some is in image alt format:
This is my code:
for child in body_tag.contents[0].channel.children:
if (child.__class__ != NavigableString):
if child.title != None :
print "------title----------"
print(child.title.contents[0].encode('ascii','ignore'))
print "-----description-class------------"
mchild=child.find_next("description").contents[0]
print mchild.__class__
print "-------description---------"
print mchild.find_next("img")
print(mchild.encode('ascii','ignore'))
print "-------end---------"
This is part of the output:
------title----------
University of Connecticut revokes Cosby's honorary degree
-----description-class------------
class 'bs4.element.NavigableString'
-------description---------
None
To display it, I use () replace "<" and ">"
(p) (a href="http://news.yahoo.com/university-connecticut-revokes-cosbys-honorary-degree-155552959.html")
(img src="http://l.yimg.com/bt/api/res/1.2/cjgCZP4YBj7M6SmdpoGj.Q--/YXBwaWQ9eW5ld3NfbGVnbztmaT1maWxsO2g9ODY7cT03NTt3PTEzMA--/http://media.zenfs.com/en_us/News/ap_webfeeds/7b35f971ec59428491aef6308db4567e.jpg" width="130" height="86" alt="FILE - In this May 24, 2016 file photo, Bill Cosby departs the Montgomery County Courthouse after a preliminary hearing, in Norristown, Pa. A 72-year-old New Hampshire woman who says Bill Cosby raped her in 1965 has withdrawn her civil defamation lawsuit against the comedian after a federal judge had allowed the case to move forward. (AP Photo/Matt Rourke, File)" align="left" title="FILE - In this May 24, 2016 file photo, Bill Cosby departs the Montgomery County Courthouse after a preliminary hearing, in Norristown, Pa. A 72-year-old New Hampshire woman who says Bill Cosby raped her in 1965 has withdrawn her civil defamation lawsuit against the comedian after a federal judge had allowed the case to move forward. (AP Photo/Matt Rourke, File)" border="0" /((/a)STORRS, Conn. (AP) The University of Connecticut on Wednesday revoked an honorary degree awarded to Bill Cosby, saying he engaged in conduct "incongruent" with the university's values.(/p((br) clear="all"/)
-------end---------
-------end---------
How could I get the tile inside the img tag:
title="FILE - In this May 24, 2016 file photo,
I tried to find_next("img") and others but I couldn't get them.

So you want all the text from the description and the title from any img tags, you can find all the decription tags then turn the description.text in a BeautifulSoup object then look for the img in that try to pull either the title or alt attribute, to find the matching title find the previous title to the description tag:
for desc in soup.find_all("description"):
d = BeautifulSoup(desc.text,"lxml")
img = d.find("img")
print("Title = {}".format(desc.find_previous("title").text))
img_text = img.get("title") or img.get("alt","") if img else ""
print("Decscription = {}\n" .format(d.find(text=True) + img_text))
Which gives you:
Title = Entertainment News Headlines — Yahoo! News
Decscription = Get the latest entertainment news headlines from Yahoo! News. Find breaking entertainment news, including analysis and opinion on top entertainment stories.
Title = Spotify's Top 10 most streamed tracks
Decscription = The following list represents the most streamed tracks on Spotify, based on the number of people who shared it divided by the number who listened to it, from Monday, Oct. 20 to Sunday Oct. 26 via Facebook, Tumblr, Twitter and Spotify.FILE - In this Sept. 7, 2012 file photo, musician Robin Thicke performs during Macy's Passport presents Glamorama 2012 at The Orpheum Theatre in Los Angeles. Thicke's "Blurred Lines (feat. T.I. & Pharrell)" was the top streamed tracks on Spotify from Monday, June 10, to Sunday, June 16, 2013. (Photo by Matt Sayles/Invision/AP, File)
Title = Who will win at the Tony Awards? AP predicts
Decscription = NEW YORK (AP) — The great comedian W.C. Fields is credited with the line, "Never work with children or animals." He would have had trouble on Broadway this season.This theater image released by The O+M Company shows the cast during a performance of the musical "Kinky Boots." The Cyndi Lauper-scored "Kinky Boots," based on the 2005 British movie about a real-life shoe factory that struggles until it finds new life in fetish footwear, is nominated for 13 Tony Award nominations. The awards will be broadcast on CBS from Radio City Music Hall on June 9. (AP Photo/The O+M Company, Matthew Murphy)
Title = The top iPhone and iPad apps on App Store
Decscription = App Store Official Charts for the week ending November 3, 2014:
Title = Fairey: 'Vindicated' by dismissal of Detroit tagging case
Decscription = DETROIT (AP) — Graffiti artist Shepard Fairey says he feels "relieved and vindicated" now that a malicious destruction of property case in Detroit has been dismissed.
Title = FBI seeks Rockwell painting on 40th anniversary of its theft
Decscription = CHERRY HILL, N.J. (AP) — Federal authorities are seeking the public's help in recovering a 1919 Norman Rockwell painting on the 40th anniversary of its theft from a New Jersey home.
Title = APNewsBreak: Union has deal with 4th Atlantic City casino
Decscription = ATLANTIC CITY, N.J. (AP) — Atlantic City's main casino workers union reached agreement Thursday with four of the five casinos it had been targeting for a strike this weekend.Union members cheer as they discuss preparations for a strike against as many as five of the city's eight casinos in Atlantic City, N.J. on Wednesday June 29, 2016. Local 54 of the Unite-HERE union says it will go on strike Friday if it can't reach new contracts with three casinos owned by Caesars Entertainment (Bally's, Caesars and Harrah's) and two casinos owned by billionaire investor Carl Icahn (the Tropicana and the Trump Taj Mahal). About 6,500 of the union's nearly 10,000 workers are at the five hotels. (AP Photo/Wayne Parry)
Title = The Latest: APNewsBreak: Union has deal with 4th casino
Decscription = The Latest on contract negotiations with casinos (all times local): 4:35 p.m. Atlantic City's main casino workers union has reached agreement with the fourth of five casinos it had been targeting for a ...Union members cheer as they discuss preparations for a strike against as many as five of the city's eight casinos in Atlantic City, N.J. on Wednesday June 29, 2016. Local 54 of the Unite-HERE union says it will go on strike Friday if it can't reach new contracts with three casinos owned by Caesars Entertainment (Bally's, Caesars and Harrah's) and two casinos owned by billionaire investor Carl Icahn (the Tropicana and the Trump Taj Mahal). About 6,500 of the union's nearly 10,000 workers are at the five hotels. (AP Photo/Wayne Parry)
Title = CBS reporter traveling to 59 parks in a year
Decscription = NEW YORK (AP) — Conor Knighton didn't take the easy route when he proposed a "CBS Sunday Morning" story on the National Park Service's centennial.
Title = Here come the virtual reality Olympics ... for Samsung users
Decscription = NEW YORK (AP) — Athletes in Rio will compete to be the fastest sprinter and highest jumper at the Olympics this August. But there's another test underway as well: How well can virtual reality capture sporting events?This photo provided by NBC and HD Studio shows NBC's daytime and late night set for the Rio Olympics located on Copacabana Beach in Rio. NBC says it will provide 85 hours of virtual reality programming during the Rio Olympics in August, but only to users of Samsung Galaxy smartphones and the Samsung Gear VR headset. (HD Studio/Courtesy of NBC via AP)
Title = Oscars timetable for 2017 revealed
Decscription = Movie buffs, mark your calendars: your 2017 Oscars party will be on Sunday, February 26. The Academy of Motion Picture Arts and Sciences announced the timetable for the 89th Oscars on Thursday, one day after it announced that it had invited a record number of artists to join the body, the majority of them women and people of color.A view of the Oscars logo at the 88th Annual Academy Awards nominee luncheon on February 8, 2016 in Beverly Hills, California
Title = Queen marks deadly Somme centenary at Westminster Abbey
Decscription = LONDON (AP) — Queen Elizabeth II attended a service at Westminster Abbey on Thursday, the eve of the centenary of the Battle of the Somme, one of the deadliest chapters of World War I.
Title = Rob Wasserman, accomplished bass player, dead at 64
Decscription = NEW YORK (AP) — Rob Wasserman, a highly respected bass player and composer who performed and recorded with Lou Reed, Neil Young, Brian Wilson and many other musicians, has died. He was 64.
Title = Documents filed by some Prince claimants to become public
Decscription = CHASKA, Minn. (AP) — A Minnesota judge overseeing the legal proceedings about Prince's estate will allow documents filed by some claimants to become public.
Title = The Latest: Oprah Winfrey to appear at Essence Festival
Decscription = NEW ORLEANS (AP) — The Latest on the annual Essence Festival held over the July 4th holiday in New Orleans (all times local):FILE - In this Jan. 20, 2009, file photo, Mariah Carey performs at the Neighborhood Inaugural Ball in Washington. Music is at the heart of the annual Essence Festival in New Orleans, and this year is no different. Fans will get to hear from first-timers Mariah Carey, Puff Daddy and Jeremih as well as from festival veterans Charlie Wilson, Maxwell, New Edition, Tyrese and Lalah Hathaway - all of whom are scheduled to perform inside the Superdome Friday, July 1, 2016, through Sunday. (AP Photo/Alex Brandon, File)
Title = Brad Paisley: West Virginia floods shocking, heartbreaking
Decscription = CHARLESTON, W.Va. (AP) — Brad Paisley said he's shocked and heartbroken by the destruction from deadly flooding in his home state of West Virginia.Principal Mike Kelley walks through a hallway that is filled with slick mud at Herbert Hoover High School in Clendenin, W.Va., Monday, June 27, 2016. The first floor hallways and rooms of the school are caked in 3-5 inches mud, which was left by over six feet of flood water that swamped the building late last week. (Sam Owens/Charleston Gazette-Mail via AP)
Title = Chechen leader Kadyrov seeks apprentice on reality TV show
Decscription = MOSCOW (AP) — Another powerful, controversial man is taking to reality TV to find an assistant — not Donald Trump but the leader of Chechnya.FILE - In this Wednesday March 23, 2016 file photo, Chechen regional leader Ramzan Kadyrov addresses a rally marking the 13th anniversary of the adoption of the Constitution of Russian region of Chechnya, in the regional capital of Grozny, Russia. Russian state television on Thursday is to broadcast the opening episode of "Live - The Team," in which participants compete to become an assistant to leader of Chechnya Ramzan Kadyrov. (AP Photo/Musa Sadulayev, File)
Title = With an eye to Tuscany, Debi Mazar plots culinary future
Decscription = NEW YORK (AP) — Debi Mazar and her brood spend at least a month in Tuscany each year, but if the "Younger" actress had her way, the region would be a far more permanent fixture in her life.FILE - In this Wednesday, Jan. 6, 2016 file photo, Debi Mazar speaks during the "Younger" panel at the TV Land 2016 Winter TCA in Pasadena, Calif. After the success of her award-winning cooking show "Extra Virgin," Mazar's creative juices are still flowing, as the actress talks about the possibility of another show and more of her culinary dreams. (Photo by Richard Shotwell/Invision/AP)
Title = Wisecracking De Niro touts Catskills with NY governor
Decscription = BETHEL, N.Y. (AP) — Robert De Niro is conjuring the legacy — and the stand-up jokes — of comedians like Rodney Dangerfield, Henny Youngman and Milton Berle while praising the natural beauty of New York's Catskills region.
Title = Music Review: Sara Watkins branches out
Decscription = Sara Watkins, "Young in All the Wrong Ways" (New West Records)FILE - In this July 29, 2012 file photo, Sara Watkins performs at the Newport Folk Festival in Newport, R.I. Watkins describes her latest venture as “a breakup album with myself,” but it seems like there might have been someone else involved. The songs on her new album, “Young in All the Wrong Ways,” have bite to them. There is anger here, a jarring departure from Watkins’ previous work. A couple of the songs push into hard-edged rock, her voice straining against a jagged electric guitar. (AP Photo/Joe Giblin)
Title = Disney Animation's 'Wreck-It Ralph 2' set for March 2018
Decscription = LOS ANGELES (AP) — "Wreck-It Ralph" is headed back to the arcade, and theaters, in a sequel planned for release on March 9, 2018. Co-directors Rich Moore and Phil Johnston announced the sequel to the 2012 animated film Thursday morning on Facebook Live.FILE - In this Oct. 29, 2012 file photo, Director Rich Moore arrives at the world premiere of "Wreck-It Ralph" at El Capitan Theatre in Los Angeles. “Wreck-It Ralph” is headed back to the arcade, and theaters, in a sequel planned for release on March 9, 2018. Co-directors Rich Moore and Phil Johnston announced the sequel to the 2012 animated film Thursday, June 30, 2016 on Facebook Live. (Photo by Jordan Strauss/Invision/AP)
Title = Scarlett Johansson ranked Hollywood's top-grossing actress
Decscription = Scarlett Johansson has taken the crown as Hollywood's highest-grossing actress ever.FILE - In this April 21, 2015, file photo, Scarlett Johansson poses for photographers upon arrival at the premiere for the film 'The Avengers Age of Ultron' in London. Box Office Mojo has crowned Johansson as Hollywood's highest grossing actress on a list updated June 29, 2016.(Photo by Joel Ryan/Invision/AP, File)
Title = HLN's Nancy Grace leaving her legal show
Decscription = NEW YORK (AP) — Tough-talking former prosecutor Nancy Grace is leaving her prime-time show on the HLN network in October.FILE - In this Friday, Oct. 21, 2014, file photo, television host Nancy Grace arrives at the 7th annual GLSEN Respect Awards in Beverly Hills, Calif. Grace is leaving her prime-time show on the HLN network in October 2016. The CNN sister station said Grace told her staff on Thursday, June 30, 2016 that her show would be ending after 12 years. An HLN spokeswoman said the network had no immediate announcement on what program would go in its place. (AP Photo/Matt Sayles, File)
Title = Moviegoers to Hollywood: It better be good
Decscription = NEW YORK (AP) — As Hollywood girds for a low-key Fourth of July box office weekend and watches its summer season dip 15 percent below last year's, an even more worrisome trend has taken shape: Moviegoers are growing pickier.FILE - This image released by Warner Bros. Entertainment shows Alexander Skarsgard from "The Legend of Tarzan." For films that aren’t “the movie to see,” moviegoers are increasingly staying home. With word-of-mouth traveling at the speed of Twitter, quality has become a more vital currency. (Jonathan Olley/Warner Bros. Entertainment via AP, File)
Title = 8 rescued after Oklahoma City roller coaster gets stuck
Decscription = OKLAHOMA CITY (AP) — No one was injured when a roller coaster at an Oklahoma City amusement park stalled out and stranded eight people, including seven children.
Title = Smallest national park? Kosciuszko, forgotten son of liberty
Decscription = PHILADELPHIA (AP) — If the hip-hop Broadway smash "Hamilton" can reignite interest in the first U.S. treasury secretary, what will it take to drum up interest in another forgotten hero from America's fight for independence?FILE - In this April 1, 2013 file photo a statue of Poland's General Thaddeus Kosciuszko is enveloped in the early morning fog in Lafayette Park across from the White House in Washington. Kosciuszko was a military engineer from Poland, Kosciuszko came to Philadelphia in August 1776 to offer his services in the fight against the British. (AP Photo/Jacquelyn Martin, File)
Title = Pregnant Alanis Morissette posts nude underwater photo
Decscription = Alanis Morissette has posted a nude photo of herself sporting a large baby bump while floating underwater.FILE - In this Nov. 22, 2015, file photo, Souleye, left, and Alanis Morissette arrive at the American Music Awards in Los Angeles. Morissette posted a nude photo of herself sporting a large baby bump while floating underwater on Instagram on June 28, 2016. (Photo by Jordan Strauss/Invision/AP, File)
Title = New Orleans ready to 'party with a purpose' at Essence Fest
Decscription = NEW ORLEANS (AP) — Music has always been at the heart of the annual Essence Festival, now in its 22nd year, and this year will be no different.FILE - In this Jan. 20, 2009, file photo, Mariah Carey performs at the Neighborhood Inaugural Ball in Washington. Music is at the heart of the annual Essence Festival in New Orleans, and this year is no different. Fans will get to hear from first-timers Mariah Carey, Puff Daddy and Jeremih as well as from festival veterans Charlie Wilson, Maxwell, New Edition, Tyrese and Lalah Hathaway - all of whom are scheduled to perform inside the Superdome Friday, July 1, 2016, through Sunday. (AP Photo/Alex Brandon, File)
Title = Alvin Toffler, author of 'Future Shock,' dead at 87
Decscription = NEW YORK (AP) — Alvin Toffler, a guru of the post-industrial age whose million-selling "Future Shock" and other books anticipated the disruptions and transformations brought about by the rise of digital technology, has died. He was 87.
Title = Theater shows R-rated comedy trailer with "Finding Dory"
Decscription = CONCORD, Calif. (AP) — The owner of a California movie theater is apologizing after a trailer for an R-rated upcoming Seth Rogen comedy was shown ahead of a screening of Disney's "Finding Dory."FILE - This undated file image released by Disney shows the character Dory, voiced by Ellen DeGeneres, in a scene from "Finding Dory." In its second week, “Finding Dory” easily remained on top with an estimated $73.2 million, according to studio estimates Sunday, June 26, 2016. (Pixar/Disney via AP, File)
Title = Christie's to sell contents of Reagans' LA home
Decscription = NEW YORK (AP) — A two-day auction of the contents of Ronald and Nancy Reagan's ranch-style home in California will include everything from personal mementos from heads of state and friends to objects the couple took with them to the White House.This undated photo provided by Christie's shows a needlepoint cushion given to Ronald Reagan for his 70th birthday in 1981. The pillow, which will be sold by Christie's New York during a two-day auction of the contents of Ronald and Nancy Reagan's ranch-style home in California, has a pre-sale estimate of $1,000-1,500. Christie’s announced Thursday, June 30, 2016, highlights of the Sept. 21-22 sale in New York City. (Christie's via AP) MANDATORY CREDIT
Title = Asian actors too busy to fret over Hollywood 'white-washing'
Decscription = TOKYO (AP) — The film world of Asia, known for producing Akira Kurosawa, Satyajit Ray, Brillante Mendoza and other greats, is too busy making movies of its own to fret much about the debate slamming Hollywood — the casting of white people in roles written for Asians.FILE - In this Sept. 5, 2007, file photo, Japanese actress Kaori Momoi poses during the photo call for the movie "Sukiyaki Western Django" at the 64th Venice Film Festival, in Venice, Italy. The film world of Asia is too busy making movies of its own to fret much about the debate slamming Hollywood - the casting of white people in roles written for Asians. Momoi, who appeared in “Memoirs of a Geisha,” as well as Russian filmmaker Aleksandr Sokurov’s “The Sun,” suggested acting was ultimately about individual talent, not skin color or nationality. (AP Photo/Andrew Medichini, File)
Title = Film academy invites 683 new members to join
Decscription = LOS ANGELES (AP) — Six months after announcing intentions to double the number of female and minority members in its ranks by 2020, the Academy of Motion Picture Arts and Sciences has invited 683 new members to join the organization.FILE - In this March 2, 2014 file photo, an Oscar statue is displayed at the Oscars at the Dolby Theatre in Los Angeles. Six months after announcing intentions to double the number of female and minority members in its ranks, the Academy of Motion Picture Arts and Sciences has invited 683 new members to join the organization. The academy says its invitees are 46 percent female, 41 percent minority and represent 59 countries.(Photo by Matt Sayles/Invision/AP, File)
Title = Miss Teen USA pageant replaces swimsuits with athletic wear
Decscription = LAS VEGAS (AP) — The Miss Teen USA pageant is dropping the swimsuit portion of its competition.
Title = YouTube personality charged with making false police report
Decscription = LOS ANGELES (AP) — A gay YouTube personality who said he was assaulted outside a West Hollywood club has been charged with filing a false police report and faking his injuries.This Wednesday, June 29, 2016, photo released by Los Angeles County Sheriff's Department shows Calum McSwiggan. The London-native gay YouTube personality who said he was assaulted outside a West Hollywood club has been charged with filing a false police report and faking his injuries. (Los Angeles County Sheriff's Department via AP) MANDATORY CREDIT
Title = Jesus Christ film coming to virtual reality
Decscription = LOS ANGELES (AP) — The story of Jesus Christ is coming to virtual reality for the first time.This undated photo provided by Autumn VR Inc. and VRWERX, LLC, shows a production still from "Jesus VR - The Story of Christ." The story of Jesus Christ is coming to virtual reality for the first time. Autumn Productions and VRWerx announced plans Wednesday, June 29, 2016, to release the live-action film on all major VR platforms this Christmas. (Autumn VR Inc. and VRWERX, LLC via AP)
Title = The Latest: Celebrities record tribute to nightclub victims
Decscription = ORLANDO, Fla. (AP) — The Latest on the mass shooting at a gay Orlando nightclub that left 49 people dead (all times local):
Title = The Latest: Golfer Bubba Watson plans to help flood victims
Decscription = CHARLESTON, W.Va. (AP) — The Latest on flooding that has devastated parts of West Virginia (all times local):
Title = Twitter dominated by tongue-in-cheek #HeterosexualPrideDay
Decscription = What appears to be a tongue-in-cheek social media movement to mark June 29 as a day to celebrate heterosexual pride has become one of the day's top online trends.
Title = Miss Teen USA axes 'outdated' bikini competition
Decscription = One of America's top beauty pageants has axed its swimsuit competition, ditching bikinis for sportswear to fend off years of complaints that parading in a bikini is sexist and demeaning. The Miss Universe Organization, which operates the pageant, said from now on contestants would be judged on athletic wear, in addition to the evening wear and personality competitions. "Miss Teen USA's transition to athletic wear reads as less exploitative and more focused on the importance of physical fitness for its younger participants," it said.Miss Teen USA 2016 Katherine Haik (R) congratulates Miss District of Columbia USA 2016 Deshauna Barber during the 2016 Miss USA pageant at T-Mobile Arena on June 5, 2016 in Las Vegas, Nevada
Title = Kayne West, Adidas expand partnership for Yeezy line
Decscription = LOS ANGELES (AP) — Rapper Kanye West and Adidas are expanding their partnership that began almost two years ago with retail hubs for his Yeezy products and additional sportswear designs.FILE - In this Aug. 30, 2015, file photo, Kanye West accepts the video vanguard award at the MTV Video Music Awards at the Microsoft Theater in Los Angeles. West and Adidas are expanding their partnership that began almost two years ago with retail hubs for his Yeezy products and additional sportswear designs. The sportswear company announced the collaboration on Wednesday, June 29, 2016, and described it as the most significant partnership between a non-athlete and an athletic brand. (Photo by Matt Sayles/Invision/AP, File)
You cannot find every title first and then the following description as not all titles are related to a description but all descriptions are related to a title.

Related

Python beautifulsoup extract all urls from a website search results. New Python beginner

I am attempting to extract all the urls from the search results of this website. It has 754 search results across 26 pages. https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated%20Data%20Infrastructure%20(IDI)/field/projeb/mode/exact/conn/and
This is the code I wrote but it didn't get anything...sorry I am new to Python, can anyone give me some clue how I could be there? Many thanks
import requests
from bs4 import BeautifulSoup
url = 'https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated%20Data%20Infrastructure%20(IDI)/field/projeb/mode/exact/conn/and'
reqs = requests.get(url,verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
There is 754 book, i show example with 35.
To get all the books, at the end of the url change 35 to 754
import pandas as pd
import requests
url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated%20Data%20Infrastructure%20(IDI)/field/projeb/mode/exact/conn/and/maxRecords/35'
response = requests.get(url)
books = []
for book in response.json()['items']:
books.append({
'link': ('https://cdm20045.contentdm.oclc.org' + book['itemLink']).replace('singleitem', 'digital'),
'title': book['metadataFields'][0]['value'],
'subjec': book['metadataFields'][1]['value'],
'date': book['metadataFields'][2]['value'],
'publis': book['metadataFields'][3]['value']
})
df = pd.DataFrame(books)
print(df.to_string())
OUTPUT:
link title subjec date publis
0 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1181 The Future of Work in New Zealand - An Empirical Examiniation (MAA2019-95) Business Practices; 2021 Auckland University of Technology;
1 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/993 In Fairness to Our Schools: Better measures for better outcomes\n Education 2019 The New Zealand Initiative
2 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1029 Dementia case-finding and prevalence estimation using routinely collected health data in the Integrated Data Infrastructure (IDI) (MAA2020-12) Health; 2020 University of Auckland
3 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1218 Investigating linkage bias in the IDI using education and census data (MAA2020-69) Meta-research; Education; 2020-12 University of Auckland;
4 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/757 Explaining Ethnic Differences in Student Success at University in New Zealand [MAA2018-09] Education and training 2018 Auckland University of Technology
5 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1357 Health and social harms from alcohol: what does NZ's data tell us? Health; 2022-03 University of Otago;
6 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1268 'What about the Menz?' Low employer attachment and ineligibility for partner parental leave Income and Work; People and Communities; 2021-08 Auckland Council; Social Wellbeing Agency;
7 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/936 Migrant Networks, Brain Waste and its Economic Impacts: Evidence from Immigrants in New Zealand [MAA2019-31] Employment; People and Communities 2019 University of Auckland
8 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1306 Pacific uptake of temporary work visas Employment; Business financials; People and Communities; 2020-05 NZIER; Ministry of Business, Innovation & Employment, MBIE; Ministry of Foreign Affairs and Trade;
9 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/123 Firm Productivity Growth and Skill (MAA2012-16) Income and work; Business practices 2015 Motu Economic and Public Policy Research
10 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/380 The Productivity Costs of Four Health Conditions in New Zealand [MAA2016-59] Health; Income and work 2016 University of Otago
11 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/732 Differential labour supply effects of pension eligibility to beneficiaries and non-beneficiaries [MAA2018-50] Income and Work; Benefits and Social Services; 2018 Auckland University of Technology;
12 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/189 Intergenerational Analyses Using the IDI People and communities 2017 COMPASS (The University of Auckland)
13 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/198 School to work: What matters? Education and Employment of Young People Born in 1991 Education and training 2016 Ministry of Education
14 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/975 Equity Index of socioeconomic disadvantage in education [MAA2019-85] Education; Benefits and Social Services; Income and Work 2019-12 Ministry of Education
15 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1225 Linkage bias in the IDI (MAA2020-58) Meta-research; 2020-11 University of Otago;
16 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/431 The relationship between exposure to the natural environment and children's health at different life stages [MAA2017-11] Health 2017 Massey University
17 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1025 Who are the 1M and 1X? Police engagement with citizens in mental distress (MAA2020-08) Justice; Health; 2020 University of Auckland
18 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1073 Who received the Wage Subsidy and Wage Subsidy Extension? (MAA2018-48) Benefits and Social Services; 2020 Ministry of Social Development, MSD;
19 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/929 Factsheet IDI first stage - Exploring work-related claims difference by Maori and non-Maori Business Practices; Health; Income and Work; People and Communities 2019-02 Worksafe New Zealand
20 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1138 Measuring Commute Patterns Over Time: Using administrative data to identify where employees live and work (MAA2018-55) Transport; Employment; 2020 Motu Economic and Public Policy Research; New Zealand Transport Agency;
21 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/835 Evaluating the Family Start Programme [MAA2018-87] People and Communities; Health; Education; Employment; Benefits and Social Services; Justice 2018-12 Ministry for Vulnerable Children Oranga Tamariki
22 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/980 Access to primary health care services for people in Canterbury with poor access: improving our understanding of people who are unenrolled or tenuously enrolled with a general practice team [MAA2019-51] Health 2019-11 Pegasus Health (Charitable) Limited
23 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/996 Intergenerational Analyses Using the IDI:\nAn update\n [MAA2016-53] Population 2020-03 COMPASS, University of Auckland;
24 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1351 Individualised Funding in Aotearoa Benefits and Social Services; 2020 Nicholson Consulting;
25 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/39 Comparing the Household Economic Survey to administrative records: an analysis of income and benefit receipt (MAA2015-27) Benefits and Social Services 2017 New Zealand Treasury
26 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1320 International students and graduates and their impact on the NZ housing market (MAA2017-31) Housing; Education and Training; 2021 Universities New Zealand;
27 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1002 The expression, experience and transcendence of low-skill in Aotearoa New Zealand (MAA2019-91) Education and Training; 2019 Auckland University of Technology
28 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/394 Comparison of NZDep2013 with Index of Multiple Deprivation (IMD2013) [MAA2017-70] Health; Income and Work; Housing; Justice 2017 University of Otago
29 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1005 Accessibility of Disability Support Services funding in NZ (MAA2019-102) Health; 2019 Nicholson Consulting;
30 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1135 Understanding Parental education and health of Pacific families: Background and study protocol: Parental Education and Pacific Health, study protocol (MAA2018-47) Education; People and Communities; Children; 2020 University of Otago;
31 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/47 Evaluation of the Impact of the Youth Service: Youth Payment and Young Parent Payment (MAA2013-16) Benefits and Social Services 2017 New Zealand Treasury
32 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/63 Using IDI data to estimate fiscal impacts of better social sector performance (MAA2013-16) Benefits and Social Services 2016 New Zealand Treasury
33 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1040 Māori Student Transitions (MAA2020-35) Education and Training; Benefits and Social Services; Income and Work; Employment; 2020 Social Wellbeing Agency;
34 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/64 Financial wellbeing of older workers following injury: research utilising Statistics New Zealand’s Integrated Data Infrastructure Health; Income and work 2017 University of Otago
#bmcculley,already has stated that the required data is loaded dynamically via JS and Bs4 can render JS. So you have two options: use either selenium (more complex) or use API. As they are using API, so you can esily grab the required data from API as GET methosd as Json format data which is the robust way.
I've made the pagination using for loop and range function.
Example:
import requests
import pandas as pd
api_url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated%20Data%20Infrastructure%20(IDI)/field/projeb/mode/exact/conn/and/page/{page}/maxRecords/30'
data = []
for page in range(1,27):
r = requests.get(api_url.format(page=page))
for link in r.json()['items']:
url = 'https://cdm20045.contentdm.oclc.org' + link['itemLink'].replace('singleitem','digital')
data.append({
'URL':url
})
df = pd.DataFrame(data)
print(df)
Outout:
URL
0 https://cdm20045.contentdm.oclc.org/digital/co...
1 https://cdm20045.contentdm.oclc.org/digital/co...
2 https://cdm20045.contentdm.oclc.org/digital/co...
3 https://cdm20045.contentdm.oclc.org/digital/co...
4 https://cdm20045.contentdm.oclc.org/digital/co...
.. ...
749 https://cdm20045.contentdm.oclc.org/digital/co...
750 https://cdm20045.contentdm.oclc.org/digital/co...
751 https://cdm20045.contentdm.oclc.org/digital/co...
752 https://cdm20045.contentdm.oclc.org/digital/co...
753 https://cdm20045.contentdm.oclc.org/digital/co...
[754 rows x 1 columns]

Python/regex - Bypass table of contents when extracting text

I have a dataframe consisting of the following:
identifier
text
34678
0000950123-04-010521.txt : 20040901.....
87902
0000950123-04-010521.txt : 20040901.....
I am trying extract a portion of text from the "text" variable in Python that follows a line starting with "Item 5.02". I am placing the extracted text in a new variable called ("important_text"). With the help of fellow stack overflowers, I was able to construct the following code to extract the text:
pattern = r'\bItem\s+5\.02\s*([\w\W]*?)(?=\s*(?:Item\s+
[89]\.01|Item\s+5\.03|Item\s+5\.07|Item\s+7\.01|SIGNATURES|SIGNATURE|' + r'Pursuant
to the requirements of the Securities Exchange Act of 1934)\b)'.replace(' ', '\s*')
pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)
So, this is extracting all text between the first occurrence of "Item 5.02" and the first occurrence of various terms (i.e., "Item 8.01", "Item 9.01", "SIGNATURES", etc.).
In general, this does a really good job of extracting the portion of text I am looking for. However, in some instances, the text variable contains a Table of Contents that will have a line starting with "Item 5.02". In these instances, the regex code does not grab the portion of text I need. Does anyone have any advice for how to bypass the Table of Contents?
Here is an example that includes a Table of Contents (apologies for the large amount of text...I thought it would be best to give a full example):
<SEC-DOCUMENT>0000950137-05-007782.txt : 20050623
<SEC-HEADER>0000950137-05-007782.hdr.sgml : 20050623
<ACCEPTANCE-DATETIME>20050623154401
ACCESSION NUMBER: 0000950137-05-007782
CONFORMED SUBMISSION TYPE: 8-K/A
PUBLIC DOCUMENT COUNT: 3
CONFORMED PERIOD OF REPORT: 20050511
ITEM INFORMATION: Entry into a Material Definitive Agreement
ITEM INFORMATION: Departure of Directors or Principal Officers; Election of
Directors; Appointment of Principal Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20050623
DATE AS OF CHANGE: 20050623
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: HILLENBRAND INDUSTRIES INC
CENTRAL INDEX KEY: 0000047518
STANDARD INDUSTRIAL CLASSIFICATION: MISCELLANEOUS FURNITURE & FIXTURES [2590]
IRS NUMBER: 351160484
STATE OF INCORPORATION: IN
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 8-K/A
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-06651
FILM NUMBER: 05912533
BUSINESS ADDRESS:
STREET 1: 700 STATE ROUTE 46 E
CITY: BATESVILLE
STATE: IN
ZIP: 47006-8835
BUSINESS PHONE: 8129347000
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K/A
<SEQUENCE>1
<FILENAME>c96192ae8vkza.htm
<DESCRIPTION>AMENDMENT TO CURRENT REPORT
<TEXT>
e8vkza
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 8-K/A
CURRENT REPORT
Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934
Date of Report (Date of earliest event reported): May 11, 2005
HILLENBRAND INDUSTRIES, INC.
(Exact name of registrant as specified in its charter)
Indiana (State or other jurisdiction of incorporation) 1-6651 (Commission File Number) 35-1160484 (IRS Employer Identification No.)
700 State Route 46 East Batesville, Indiana (Address of principal executive offices) 47006-8835 (Zip Code)
Registrants telephone number, including area code: (812) 934-7000
Not Applicable
(Former name or former address, if changed since last report.)
Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy
the filing obligation of the registrant under any of the following provisions:
o Written communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425)
o Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR 240.14a-12)
o Pre-commencement communications pursuant to Rule 14d-2(b) under the Exchange Act
(17 CFR 240.14d-2(b))
o Pre-commencement communications pursuant to Rule 13e-4(c) under the Exchange
Act
(17 CFR 240.13e-4(c))
1
TABLE OF CONTENTS
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS
Item 9.01. FINANCIAL STATEMENTS AND EXHIBITS
SIGNATURES
EXHIBIT INDEX
Employment Agreement
Stock Award
Table of Contents
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT.
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.
Similar to the above example, The Table of Contents will usually start with "Table of Contents" and end with "Table of Contents". To further complicate things, the text will sometimes randomly say "Table of Contents" towards the beginning of the text (this is also shown in the above example).
Here is what I would like to extract:
DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.

Pyspark AnalysisException Py4JJavaError on transformation withColumn()

Working with Pyspark using the withColumn() command in order to do some basic transformation on the dataframe, namely, to update the value of a column. Looking for some debug assistance while I also strudy the problem.
Pyspark is issuing an AnalysisException & Py4JJavaError on the usage of the pyspark.withColumn command.
_c49='EVENT_NARRATIVE' is the withColumn('EVENT_NARRATIVE')... reference data elements inside the spark df (dataframe).
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('EVENT_NARRATIVE', lower(col('EVENT_NARRATIVE')))
Py4JJavaError: An error occurred while calling o100.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_NARRATIVE`' given input columns: [_c3, _c17, _c40, _c21, _c48, _c12, _c39, _c18, _c31, _c10, _c45, _c26, _c5, _c43, _c24, _c33, _c9, _c14, _c1, _c16, _c47, _c20, _c46, _c32, _c22, _c7, _c2, _c42, _c37, _c36, _c30, _c8, _c38, _c23, _c25, _c13, _c29, _c41, _c19, _c44, _c11, _c28, _c6, _c50, _c49, _c0, _c15, _c4, _c34, _c27, _c35];;
'Project [_c0#604, _c1#605, _c2#606, _c3#607, _c4#608, _c5#609, _c6#610, _c7#611, _c8#612, _c9#613, _c10#614, _c11#615, _c12#616, _c13#617, _c14#618, _c15#619, _c16#620, _c17#621, _c18#622, _c19#623, _c20#624, _c21#625, _c22#626, _c23#627, ... 28 more fields]
+- Relation[_c0#604,_c1#605,_c2#606,_c3#607,_c4#608,_c5#609,_c6#610,_c7#611,_c8#612,_c9#613,_c10#614,_c11#615,_c12#616,_c13#617,_c14#618,_c15#619,_c16#620,_c17#621,_c18#622,_c19#623,_c20#624,_c21#625,_c22#626,_c23#627,... 27 more fields] csv
1 row of sample data from df.head():
[Row(_c0='BEGIN_YEARMONTH', _c1='BEGIN_DAY', _c2='BEGIN_TIME', _c3='END_YEARMONTH', _c4='END_DAY', _c5='END_TIME', _c6='EPISODE_ID', _c7='EVENT_ID', _c8='STATE', _c9='STATE_FIPS', _c10='YEAR', _c11='MONTH_NAME', _c12='EVENT_TYPE', _c13='CZ_TYPE', _c14='CZ_FIPS', _c15='CZ_NAME', _c16='WFO', _c17='BEGIN_DATE_TIME', _c18='CZ_TIMEZONE', _c19='END_DATE_TIME', _c20='INJURIES_DIRECT', _c21='INJURIES_INDIRECT', _c22='DEATHS_DIRECT', _c23='DEATHS_INDIRECT', _c24='DAMAGE_PROPERTY', _c25='DAMAGE_CROPS', _c26='SOURCE', _c27='MAGNITUDE', _c28='MAGNITUDE_TYPE', _c29='FLOOD_CAUSE', _c30='CATEGORY', _c31='TOR_F_SCALE', _c32='TOR_LENGTH', _c33='TOR_WIDTH', _c34='TOR_OTHER_WFO', _c35='TOR_OTHER_CZ_STATE', _c36='TOR_OTHER_CZ_FIPS', _c37='TOR_OTHER_CZ_NAME', _c38='BEGIN_RANGE', _c39='BEGIN_AZIMUTH', _c40='BEGIN_LOCATION', _c41='END_RANGE', _c42='END_AZIMUTH', _c43='END_LOCATION', _c44='BEGIN_LAT', _c45='BEGIN_LON', _c46='END_LAT', _c47='END_LON', _c48='EPISODE_NARRATIVE', _c49='EVENT_NARRATIVE', _c50='DATA_SOURCE'),
Row(_c0='201210', _c1='29', _c2='1600', _c3='201210', _c4='29', _c5='1922', _c6='68680', _c7='416744', _c8='NEW HAMPSHIRE', _c9='33', _c10='2012', _c11='October', _c12='High Wind', _c13='Z', _c14='12', _c15='EASTERN HILLSBOROUGH', _c16='BOX', _c17='29-OCT-12 16:00:00', _c18='EST-5', _c19='29-OCT-12 19:22:00', _c20='0', _c21='0', _c22='0', _c23='0', _c24='109.60K', _c25='0.00K', _c26='ASOS', _c27='55.00', _c28='MG', _c29=None, _c30=None, _c31=None, _c32=None, _c33=None, _c34=None, _c35=None, _c36=None, _c37=None, _c38=None, _c39=None, _c40=None, _c41=None, _c42=None, _c43=None, _c44=None, _c45=None, _c46=None, _c47=None, _c48='Sandy, a hybrid storm with both tropical and extra-tropical characteristics, brought high winds and coastal flooding to southern New England. Easterly winds gusted to 50 to 60 mph for interior southern New England; 55 to 65 mph along the eastern Massachusetts coast and along the I-95 corridor in southeast Massachusetts and Rhode Island; and 70 to 80 mph along the southeast Massachusetts and Rhode Island coasts. A few higher higher gusts occurred along the Rhode Island coast. A severe thunderstorm embedded in an outer band associated with Sandy produced wind gusts to 90 mph and concentrated damage in Wareham early Tuesday evening, |a day after the center of Sandy had moved into New Jersey. In general, moderate coastal flooding occurred along the Massachusetts coastline, and major coastal flooding impacted the Rhode Island coastline. The storm surge was generally 2.5 to 4.5 feet along the east coast of Massachusetts, but peaked late Monday afternoon in between high tide cycles. Seas built to between 20 and 25 feet Monday afternoon and evening just off the Massachusetts east coast. Along the south coast, the storm surge was 4 to 6 feet and seas from 30 to a little over 35 feet were observed in the outer coastal waters. The very large waves on top of the storm surge caused destructive coastal flooding along stretches of the Rhode Island exposed south coast. ||Sandy grew into a hurricane over the southwest Caribbean and then headed north across Jamaica, Cuba, and the Bahamas. As Sandy headed north of the Bahamas, the storm interacted with a vigorous weather system moving west to east across the United States and began to take on a hybrid structure. Strong high pressure over southeast Canada helped with the expansion of the strong winds well north of the center of Sandy. In essence, Sandy retained the structure of a hurricane near its center (until shortly before landfall) while taking on more of an extra-tropical cyclone configuration well away from the center. Sandy���s track was unusual. The storm headed northeast and then north across the western Atlantic and then sharply turned to the west to make landfall near Atlantic City, NJ during Monday evening. Sandy subsequently weakened and moved west across southern Pennsylvania on Tuesday before turning north and heading across western New York state into Quebec during Tuesday night and Wednesday.', _c49='The Automated Surface Observing System at Manchester-Boston Regional Airport (KMHT) recorded sustained wind speeds of 38 mph and gusts to 63 mph. In Manchester, a tree was downed on Harrison Street. In Hudson, a tree was downed on Lawrence Road, bringing down wires that sparked a fire that damaged a house. In Merrimack, a tree was downed, taking down wires and closing Amherst Road from Meetinghouse Road to Riverside Drive. In Nashua, a tree was downed onto a house on Broad Street, near the Hollils line. No structural damage was found. Numerous trees were downed, blocking roads.', _c50='CSV')
The column names are in the form of _c followed by a number, because presumbaly you did not specify header=True while reading the input file. You can do
df = spark.read.csv('filepath', header=True)
So that the column names will be BEGIN_YEARMONTH, BEGIN_DAY, ... etc, instead of _c0, _c1, ..., and then your withColumn code should work.
You can also consider adding inferSchema=True to ensure that the data types are suitable.
Of course, you can also stick with your current code, and do
df2 = df.withColumn('_c49', lower(col('_c49')))
But that's not a good long-term solution. Column names should be sensible, and you also don't want the header to be one of the rows in your dataframe.

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

How to convert text article with keywords to pandas data frame

I have similar text files to below, about 5,000 times and I want to extract the text article to one df column and the keywords in a list to another df column. I need this to have more training data.
In below sample, the article I want to extract is everything from 'Addis Abeba' to 'private bank' and the keywords are all keywords after 'SUBJECT' without percentages in brackets.
Sample of the dataset:
Addis Fortune
February 2011
Declaration? AU Action Needed in Favour of Democracy [opinion]
LENGTH: 692 words
Addis Abeba has been hosting delegates and heads of state for the AU Summit. It
is encouraging to see leaders of Africa discussing issues of continental
importance that accelerate the process of integration and thereby put Africa in
a better bargaining position in its relations with the outside world.
Indeed, "United We Stand, Divided We Fall."
It is time that the AU took a bold step to ensure that the leaders of the
continent win the hearts and minds of their citizens. It should ensure the
existence of democratic governments, which, at a minimum, guarantee popular
participation based on an acceptance of political equality among all citizens,
respect for civil liberties, and meaningful checks and balances on the power of
the executive.
This is also indispensable to the realisation of the age-old dream of the
formation of the United States of Africa. Donor countries and organisations also
have moral obligations to extend much needed support in this aspect.
Dawit Haile is a loan officer at a private bank.
SUBJECT: HEADS OF STATE & GOVERNMENT (90%); ELECTIONS (90%); INTERNATIONAL
ASSISTANCE (89%); INTERNATIONAL RELATIONS (73%); GROSS DOMESTIC PRODUCT (70%);
ECONOMIC NEWS (70%); EMBEZZLEMENT (68%); ELECTION FRAUD (68%) Ethiopia;
International Organizations and Africa
GEOGRAPHIC: AFRICA (96%); EGYPT (93%); UNITED STATES (93%); CHINA (92%);
ETHIOPIA (79%); TUNISIA (79%); ISRAEL (79%) Africa
LOAD-DATE: February 8, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
Copyright 2011 AllAfrica Global Media.
All Rights Reserved
2 of 1352 DOCUMENTS
Addis Fortune
February 2011
Gebrekidan Beyene's Prosecutors Repeat Request for 25 Years
BYLINE: Eden Sahle
LENGTH: 815 words
During the appeals hearing last week of Gebrekidan Beyene, a.k.a. Morocco,
general manager and a shareholder of a private limited company by the same name,
prosecutors of the Ethiopian Revenues and Customs Authority (ERCA) requested
almost the same sentence they originally had, in August 2010: a maximum jail
term and confiscation of properties.
However, the lower court's decision to mitigate the sentence was correct and the
Appeals Bench should release Gebrekidan, either as a free man or on parole, the
defence argued. His good behaviour in prison and the investment he had made in
his country should be counted as mitigating circumstances, the lawyer claimed,
also counting the defendant's poor health in mitigation. The case was adjourned
for a verdict until May 2, 2011.
An alleged similar offence involving money laundering and loan sharking against
Ayalew Tesema, board chairman and major shareholder of Ayat Real Estate, is
underway at the Federal High Court.
SUBJECT: LITIGATION (91%); JUSTICE DEPARTMENTS (90%); BANKING & FINANCE (90%);
EXCISE & CUSTOMS (90%); LIMITED LIABILITY COMPANIES (90%); SENTENCING (90%);
APPEALS (89%); LAW COURTS & TRIBUNALS (89%); JAIL SENTENCING (89%); LAWYERS
(89%); VERDICTS (89%); SUPREME COURTS (89%); FINES & PENALTIES (89%);
SETTLEMENTS & DECISIONS (78%); CRIMINAL CONVICTIONS (78%); DECISIONS & RULINGS
(78%); PRISONS (77%); SUITS & CLAIMS (77%); VALUE ADDED TAX (77%); JUDGES (73%);
INCOME TAX (72%); MONEY LAUNDERING (69%); COUNTERFEITING (68%); INTEREST RATES
(55%); ECONOMIC NEWS (55%) Ethiopia; Legal and Judicial Affairs
GEOGRAPHIC: MOROCCO (90%)
LOAD-DATE: March 1, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
My expected result would be:
df
content keywords
1 'string article 1' [HEADS OF STATE & GOVERNMENT, ELECTIONS, ...]
2 'string article 2' [LITIGATION, JUSTICE DEPARTMENTS, ...]

Categories

Resources