Sum values from specific column in DataFrame in duplicate rows - python

I have a DataFrame of books that I removed and reworked some information. However, there are some rows in the column "bookISBN" that have duplicate values, and I want to merge all those rows into one.
I plan to make a new DataFrame where I keep the first values for the url, the ISBN, the title and the genre, but I want to sum the values of the column "genreVotes" in order to create the merge. How can I do this?
Original dataframe:
In [23]: network = data[["bookTitle", "bookISBN", "highestVotedGenre", "genreVotes"]]
network.head().to_dict("list")
Out [23]:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [103407, 80856, 59070, 73590, 26376]}
Duplicates:
In [24]: duplicates = network[network.duplicated(subset=["bookISBN"], keep=False)]
duplicates.loc[(duplicates["bookISBN"] == "9780439023481") | (duplicates["bookISBN"] == "9780375831003")]
Out [24]:
{'bookTitle': ['The Hunger Games',
'The Book Thief',
'The Hunger Games',
'The Book Thief',
'The Book Thief'],
'bookISBN': ['9780439023481',
'9780375831003',
'9780439023481',
'9780375831003',
'9780375831003'],
'highestVotedGenre': ['Young Adult',
'Historical-Historical Fiction',
'Young Adult',
'Historical-Historical Fiction',
'Historical-Historical Fiction'],
'genreVotes': [103407, 59070, 103407, 59070, 59070]}
(In this example the votes were all the same but in some cases the values are different).
Expected output:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [260814, 80856, 177210, 73590, 26376]}

Related

how to stop letter repeating itself python

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items
As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

Python Generators and how to iterate over correctly to drop records based on a key within the dictionary being present in a a separate list

I'm new to the concept of generators and I'm struggling with how to apply my changes to the records within the generator object returned from the RISparser module.
I understand that a generator only reads a record at a time and doesn't actually store the data in memory but I'm having a tough time iterating over it effectively and applying my changes.
My changes will involve dropping records that have not got ['doi'] values that are contained within a list of DOIs [doi_match].
doi_match = ['10.1002/14651858.CD008259.pub2','10.1002/14651858.CD011552','10.1002/14651858.CD011990']
Generator object returned form RISparser contains the following information, this is just the first 2 records returned of a few 100. I want to iterate over it and compare the 'doi': key from the generator with the list of DOIs.
{'type_of_reference': 'JOUR', 'title': "The CoRe Outcomes in WomeN's health (CROWN) initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Neurourology and Urodynamics', 'alternate_title1': 'Neurourol. Urodyn.', 'volume': '33', 'number': '8', 'start_page': '1176', 'end_page': '1177', 'year': '2014', 'doi': '10.1002/nau.22674', 'issn': '07332467 (ISSN)', 'authors': ['Khan, K.'], 'keywords': ['Bias (epidemiology)', 'Clinical trials', 'Consensus', 'Endpoint determination/standards', 'Evidence-based medicine', 'Guidelines', 'Research design/standards', 'Systematic reviews', 'Treatment outcome', 'consensus', 'editor', 'female', 'human', 'medical literature', 'Note', 'outcomes research', 'peer review', 'randomized controlled trial (topic)', 'systematic review (topic)', "women's health", 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'John Wiley and Sons Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: NEURE'], 'type_of_work': 'Note', 'name_of_database': 'Scopus', 'custom2': '25270392', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908368202&doi=10.1002%2fnau.22674&partnerID=40&md5=b220702e005430b637ef9d80a94dadc4'}
{'type_of_reference': 'JOUR', 'title': "The CROWN initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Gynecologic Oncology', 'alternate_title1': 'Gynecol. Oncol.', 'volume': '134', 'number': '3', 'start_page': '443', 'end_page': '444', 'year': '2014', 'doi': '10.1016/j.ygyno.2014.05.005', 'issn': '00908258 (ISSN)', 'authors': ['Karlan, B.Y.'], 'author_address': 'Gynecologic Oncology and Gynecologic Oncology Reports, India', 'keywords': ['clinical trial (topic)', 'decision making', 'Editorial', 'evidence based practice', 'female infertility', 'health care personnel', 'human', 'outcome assessment', 'outcomes research', 'peer review', 'practice guideline', 'premature labor', 'priority journal', 'publication', 'systematic review (topic)', "women's health", 'editorial', 'female', 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'Academic Press Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: GYNOA', 'Correspondence Address: Karlan, B.Y.; Gynecologic Oncology and Gynecologic Oncology ReportsIndia'], 'type_of_work': 'Editorial', 'name_of_database': 'Scopus', 'custom2': '25199578', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908351159&doi=10.1016%2fj.ygyno.2014.05.005&partnerID=40&md5=ab5a4d26d52c12d081e38364b0c79678'}
I tried iterating over the generator and applying the changes. But the records that have matches are not being placed in the match list.
match = []
for entry in ris_records:
if entry['doi'] in doi_match:
match.append(entry)
else:
del entry
any advice on how to iterate over a generator correctly, thanks.

How to check frequency of every unique value from pandas data-frame?

If I have a data-frame of 2000 and in which let say brand have 142 unique values and i want to count frequency of every unique value form 1 to 142.values should change dynamically.
brand=clothes_z.brand_name
brand.describe(include="all")
unique_brand=brand.unique()
brand.describe(include="all"),unique_brand
Output:
(count 2613
unique 142
top Mango
freq 54
Name: brand_name, dtype: object,
array(['Jack & Jones', 'TOM TAILOR DENIM', 'YOURTURN', 'Tommy Jeans',
'Alessandro Zavetti', 'adidas Originals', 'Volcom', 'Pier One',
'Superdry', 'G-Star', 'SIKSILK', 'Tommy Hilfiger', 'Karl Kani',
'Alpha Industries', 'Farah', 'Nike Sportswear',
'Calvin Klein Jeans', 'Champion', 'Hollister Co.', 'PULL&BEAR',
'Nike Performance', 'Even&Odd', 'Stradivarius', 'Mango',
'Champion Reverse Weave', 'Massimo Dutti', 'Selected Femme Petite',
'NAF NAF', 'YAS', 'New Look', 'Missguided', 'Miss Selfridge',
'Topshop', 'Miss Selfridge Petite', 'Guess', 'Esprit Collection',
'Vero Moda', 'ONLY Petite', 'Selected Femme', 'ONLY', 'Dr.Denim',
'Bershka', 'Vero Moda Petite', 'PULL & BEAR', 'New Look Petite',
'JDY', 'Even & Odd', 'Vila', 'Lacoste', 'PS Paul Smith',
'Redefined Rebel', 'Selected Homme', 'BOSS', 'Brave Soul', 'Mind',
'Scotch & Soda', 'Only & Sons', 'The North Face',
'Polo Ralph Lauren', 'Gym King', 'Selected Woman', 'Rich & Royal',
'Rooms', 'Glamorous', 'Club L London', 'Zalando Essentials',
'edc by Esprit', 'OYSHO', 'Oasis', 'Gina Tricot',
'Glamorous Petite', 'Cortefiel', 'Missguided Petite',
'Missguided Tall', 'River Island', 'INDICODE JEANS',
'Kings Will Dream', 'Topman', 'Esprit', 'Diesel', 'Key Largo',
'Mennace', 'Lee', "Levi's®", 'adidas Performance', 'jordan',
'Jack & Jones PREMIUM', 'They', 'Springfield', 'Benetton', 'Fila',
'Replay', 'Original Penguin', 'Kronstadt', 'Vans', 'Jordan',
'Apart', 'New look', 'River island', 'Freequent', 'Mads Nørgaard',
'4th & Reckless', 'Morgan', 'Honey punch', 'Anna Field Petite',
'Noisy may', 'Pepe Jeans', 'Mavi', 'mint & berry', 'KIOMI', 'mbyM',
'Escada Sport', 'Lost Ink', 'More & More', 'Coffee', 'GANT',
'TWINTIP', 'MAMALICIOUS', 'Noisy May', 'Pieces', 'Rest',
'Anna Field', 'Pinko', 'Forever New', 'ICHI', 'Seafolly', 'Object',
'Freya', 'Wrangler', 'Cream', 'LTB', 'G-star', 'Dorothy Perkins',
'Carhartt WIP', 'Betty & Co', 'GAP', 'ONLY Tall', 'Next', 'HUGO',
'Violet by Mango', 'WEEKEND MaxMara', 'French Connection'],
dtype=object))
As it is showing only frequency of Mango "54" because it is top frequency and I want every value frequency like what is the frequency of Jack & Jones, TOM TAILOR DENIM and YOURTURN and so on... and values should change dynamically.
You could simply do,
clothes_z.brand_name.value_counts()
This would list down the unique values and would give you the frequency of every element in that Pandas Series.
from collections import Counter
ll = [...your list of brands...]
c = Counter(ll)
# you can do whatever you want with your counted values
df = pd.DataFrame.from_dict(c, orient='index', columns=['counted'])

How to groupby/merge a data frame with various data types

I have a data frame that has different data types (list, dictionary, list of dictionary, strings, etc).
df = pd.DataFrame([{'category': [{'id': 1, 'name': 'House Targaryen'}],
'connection': ['Rhaena Targaryen', 'Aegon Targaryen'],
'description': 'Jon Snow, born Aegon Targaryen, is the son of Lyanna Stark '
'and Rhaegar Targaryen, the late Prince of Dragonstone',
'name': 'Jon Snow'},
{'category': [{'id': 2, 'name': 'House Stark'},
{'id': 3, 'name': 'Nights Watch'}],
'connection': ['Robb Stark', 'Sansa Stark', 'Arya Stark', 'Bran Stark'],
'description': 'After successfully capturing a wight and presenting it to '
'the Lannisters as proof that the Army of the Dead are real, '
'Jon pledges himself and his army to Daenerys Targaryen.',
'name': 'Jon Snow'}])
I want to merge these two rows by Jon Snow and combine all other fields together so it looks like
name category description connection
Jon Snow ['House Targaryen','House Stark','Nights Watch'] Jon Snow, born ...... his army to Daenerys Targaryen. ['Rhaena Targaryen',...,'Bran Stark']
It might be a little tricky with list of dictionaries, since this is a toy example, it only contains two rows, and it's easy to explode it and combine two rows of category together. But I don't think it's practical to do that in my actual data set.
I also thought about using df.groupby('name').aggregate('category': func1,'description':func2, 'connection':func3) but I'm not sure if there's a build-in function for what I need.
Thank yall for helping!
Looking at your data, it might be possible to first do a simple groupby and sum. Then deal with the categories using list comprehension:
import pandas as pd
df = pd.DataFrame([{'category': [{'id': 1, 'name':'House Targaryen'}],
'name': 'Jon Snow',
'description':'Jon Snow, born Aegon Targaryen, is the son of Lyanna Stark and Rhaegar Targaryen, the late Prince of Dragonstone',
'connection':['Rhaena Targaryen', 'Aegon Targaryen']},
{'category': [{'id': 2, 'name': 'House Stark'},{'id': 3, 'name': 'Nights Watch'}],
'name': 'Jon Snow',
'description': 'After successfully capturing a wight and presenting it to the Lannisters as proof that the Army of the Dead are real, '
'Jon pledges himself and his army to Daenerys Targaryen.',
'connection':['Robb Stark', 'Sansa Stark', 'Arya Stark', 'Bran Stark']},
{"category":[{"id":4,"name":"Some house"}],
"name": "Some name",
"description": "some desc",
"connection":["connection 1"]}])
result = df.groupby("name").sum()
result["category"] = [[item.get("name") for item in i] for i in result["category"]]
result.reset_index(inplace=True)
print (result)
#
name category description connection
0 Jon Snow [House Targaryen, House Stark, Nights Watch] Jon Snow, born Aegon Targaryen, is the son of ... [Rhaena Targaryen, Aegon Targaryen, Robb Stark...
1 Some name [Some house] some desc [connection 1]

Scraping data with a lack of classes / ids on elements

I'm trying to scrape data to build an object which looks like;
{
"artist": "Oasis",
"albums": {
"Definitely Maybe": [
"Rock n Roll Star",
"Shakermaker",
...
],
"(What's The Story) Morning Glory": [
"Hello",
"Roll With It"
...
],
...
}
}
Here is how the HTML on the page looks;
I'm currently scrapping the data like so;
data = []
for div in soup.find_all("div",{"id":"listAlbum"}):
links = div.findAll('a')
for a in links:
if a.text.strip() is "":
pass
elif a.text.strip():
data.append(a.text.strip())
Likewise, grabbing the album names is straightforward also;
for div in soup.find_all("div",{"class":"album"}):
titles = div.findAll('b')
for t in titles:
...
My problem is how to use the above two loops to build an object like the one at the top. How can I ensure the songs from X album, go into the correct album object. If each song had an album attribute, it would be clear to me. However, with the HTML structured the way it is - I'm at a bit of a loss.
EDIT: Find the HTML below;
<div id="listAlbum">
<a id="1368"></a>
<div class="album">album: <b>"Definitely Maybe"</b> (1994)</div>
Rock 'n' Roll Star<br>
Shakermaker<br>
Live Forever<br>
Up In The Sky<br>
Columbia<br>
Supersonic<br>
Bring It On Down<br>
Cigarettes & Alcohol<br>
Digsy's Diner<br>
Slide Away<br>
Married With Children<br>
Sad Song<br>
<a id="1366"></a>
<div class="album">album: <b>"(What's The Story) Morning Glory"</b> (1995)</div>
Hello<br>
Roll With It<br>
Wonderwall<br>
Don't Look Back In Anger<br>
Hey Now<br>
Some Might Say<br>
Cast No Shadow<br>
She's Electric<br>
Morning Glory<br>
Champagne Supernova<br>
Bonehead's Bank Holiday<br>
You can do this using find_next_siblings().
Code:
oasis = {
'artist': 'Oasis',
'albums': {}
}
soup = BeautifulSoup(html, 'lxml') # where html is the html you've provided
all_albums = soup.find('div', id='listAlbum')
first_album = all_albums.find('div', class_='album')
album_name = first_album.b.text
songs = []
for tag in first_album.find_next_siblings(['a', 'div']):
# If tag is <div> add the previous album.
if tag.name == 'div':
oasis['albums'][album_name] = songs
songs = []
album_name = tag.b.text
# If tag is <a> append song to the list.
else:
songs.append(tag.text)
# Add the last album
oasis['albums'][album_name] = songs
print(oasis)
Output:
{
'artist': 'Oasis',
'albums': {
'"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song', ''],
'"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"]
}
}
EDIT:
After checking the website, I've made a few changes to the code.
First, you need to skip this <a id="6910"></a> tag (which is located at the end of each album) as it will add a song with empty name. Second, the text other songs: is not located inside <b> tag; so it will raise an error with album_name = tag.b.text.
Doing the following changes will give you exactly what you need.
for tag in first_album.find_next_siblings(['a', 'div']):
if tag.name == 'div':
oasis['albums'][album_name] = songs
songs = []
album_name = tag.text if tag.text == 'other songs:' else tag.b.text
continue
if tag.get('id'):
continue
songs.append(tag.text)
Final output:
{
'artist': 'Oasis',
'albums': {
'"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song'],
'"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"],
'"Be Here Now"': ["D'You Know What I Mean?", 'My Big Mouth', 'Magic Pie', 'Stand By Me', 'I Hope, I Think, I Know', 'The Girl In The Dirty Shirt', 'Fade In-Out', "Don't Go Away", 'Be Here Now', 'All Around The World', "It's Getting Better (Man!!)"],
'"The Masterplan"': ['Acquiesce', 'Underneath The Sky', 'Talk Tonight', 'Going Nowhere', 'Fade Away', 'I Am The Walrus (Live)', 'Listen Up', "Rockin' Chair", 'Half The World Away', "(It's Good) To Be Free", 'Stay Young', 'Headshrinker', 'The Masterplan'],
'"Standing On The Shoulder Of Giants"': ["Fuckin' In The Bushes", 'Go Let It Out', 'Who Feels Love?', 'Put Yer Money Where Yer Mouth Is', 'Little James', 'Gas Panic!', 'Where Did It All Go Wrong?', 'Sunday Morning Call', 'I Can See A Liar', 'Roll It Over'],
'"Heathen Chemistry"': ['The Hindu Times', 'Force Of Nature', 'Hung In A Bad Place', 'Stop Crying Your Heart Out', 'Song Bird', 'Little By Little', '(Probably) All In The Mind', 'She Is Love', 'Born On A Different Cloud', 'Better Man'],
'"Don\'t Believe The Truth"': ['Turn Up The Sun', 'Mucky Fingers', 'Lyla', 'Love Like A Bomb', 'The Importance Of Being Idle', 'The Meaning Of Soul', "Guess God Thinks I'm Abel", 'Part Of The Queue', 'Keep The Dream Alive', 'A Bell Will Ring', 'Let There Be Love'],
'"Dig Out Your Soul"': ['Bag It Up', 'The Turning', 'Waiting For The Rapture', 'The Shock Of The Lightning', "I'm Outta Time", '(Get Off Your) High Horse Lady', 'Falling Down', "To Be Where There's Life", "Ain't Got Nothin'", 'The Nature Of Reality', 'Soldier On', 'I Believe In All'],
'other songs:': ["(As Long As They've Got) Cigarettes In Hell", '(I Got) The Fever', 'Alice', 'Alive', 'Angel Child', 'Boy With The Blues', 'Carry Us All', 'Cloudburst', 'Cum On Feel The Noize', "D'Yer Wanna Be A Spaceman", 'Eyeball Tickler', 'Flashbax', 'Full On', 'Helter Skelter', 'Heroes', 'I Will Believe', "Idler's Dream", 'If We Shadows', "It's Better People", 'Just Getting Older', "Let's All Make Believe", 'My Sister Lover', 'One Way Road', 'Round Are Way', 'Step Out', 'Street Fighting Man', 'Take Me', 'Take Me Away', 'The Fame', 'Whatever', "You've Got To Hide Your Love Away"]
}
}

Categories

Resources