Join lists to export as csv - python

I am trying to join two lists and export as csv, but when the csv is built, it's all messed up with spaces that I don't know the origin of and with the first line strangely duplicated as you can see in the attached image.
import csv
amazondata = [{'amzlink': 'https://www.amazon.com/dp/B084ZZ7VY3', 'asin': 'B084ZZ7VY3', 'url': 'https://www.amazon.com/s?k=712145360504&s=review-rank', 'title': '100% Non-GMO Berberine HCL Complex Supplement - Supports Gut, Heart, and Immune System Health- Harvested in The Himalayas, Helps Regulate Blood Sugar & Cholesterol, 100% Free of Additives & Allergens', 'price': '$14.95', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/81D1P4QqLfL._AC_SX425_.jpg', 'rank': 'Best Sellers Rank: #194,130 in Health & Household (See Top 100 in Health & Household)\n#6,896 in Blended Vitamin & Mineral Supplements', 'rating': '4.7 out of 5'}, {'amzlink': 'https://www.amazon.com/dp/B000NRTWS6', 'asin': 'B000NRTWS6', 'url': 'https://www.amazon.com/s?k=753950000698&s=review-rank'}, {'amzlink': 'https://www.amazon.com/dp/B07XM9P4C4', 'asin': 'B07XM9P4C4', 'url': 'https://www.amazon.com/s?k=753950005266&s=review-rank'}, {'amzlink': 'https://www.amazon.com/dp/B08KJ1VQJD', 'asin': 'B08KJ1VQJD', 'url': 'https://www.amazon.com/s?k=753950005242&s=review-rank'}, {'amzlink': 'https://www.amazon.com/dp/B005P0VD4W', 'asin': 'B005P0VD4W', 'url': 'https://www.amazon.com/s?k=043292560180&s=review-rank'}, {'amzlink': 'https://www.amazon.com/dp/B008FCJTES', 'asin': 'B008FCJTES', 'url': 'https://www.amazon.com/s?k=311845053213&s=review-rank'}]
amazonPage = [{'title': '100% Non-GMO Berberine HCL Complex Supplement - Supports Gut, Heart, and Immune System Health- Harvested in The Himalayas, Helps Regulate Blood Sugar & Cholesterol, 100% Free of Additives & Allergens', 'price': '$14.95', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/81D1P4QqLfL._AC_SX425_.jpg', 'rank': 'Best Sellers Rank: #194,130 in Health & Household (See Top 100 in Health & Household)\n#6,896 in Blended Vitamin & Mineral Supplements', 'rating': '4.7 out of 5'}, {'title': "Doctor's Best High Absorption CoQ10 with BioPerine, Vegan, Gluten Free, Naturally Fermented, Heart Health & Energy Production, 100 mg 60 Veggie Caps", 'price': '$14.24 ($0.24 / Count)', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/71sNP5u1N1S._AC_SX425_.jpg', 'rank': 'Rank not found', 'rating': 'Rating not found'}, {'title': "Doctor's Best CoQ10 Gummies 200 Mg, Coenzyme Q10 (Ubiquinone), Supports Heart Health, Boost Cellular Energy, Potent Antioxidant, 60 Ct (Packaging May Vary)", 'price': '$19.96 ($0.33 / Count)', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/71NIwP2V2uL._AC_SY450_.jpg', 'rank': 'Best Sellers Rank: #20,634 in Health & Household (See Top 100 in Health & Household)\n#73 in CoQ10 Nutritional Supplements', 'rating': '4.7 out of 5'}, {'title': 'CoQ10 300mg Doctors Best 30 Softgel', 'price': '$19.64', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/61DttvOf18L._AC_SX425_.jpg', 'rank': 'Manufacturer \u200f : \u200e Doctors Best', 'rating': 'Rating not found'}, {'title': 'Ultra Glandulars Ultra Raw Eye 60 Tab', 'price': '$19.65', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/71ASBPlEphL._AC_SX425_.jpg', 'rank':
'Best Sellers Rank: #286,790 in Health & Household (See Top 100 in Health & Household)\n#14,261 in Herbal Supplements', 'rating': '4.1 out of 5'}, {'title': 'Mason Natural Garlic Oil 500 mg Odorless Allium Sativum Supplement - Supports Healthy Circulatory Function, 100 Softgels', 'price': '$6.75', 'image': 'https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/71eXVu3zxFL._AC_SX425_.jpg', 'rank': 'Best Sellers Rank: #346,287 in Health & Household (See Top 100 in Health & Household)\n#350 in Garlic Herbal Supplements', 'rating': '4.9 out of 5'}]
result = []
amazonPage.extend(amazondata)
for myDict in amazonPage:
if myDict not in result:
result.append(myDict)
print (result)
amazonPage[0].update(amazondata[0])
keys = amazonPage[0].keys()
print(keys)
with open('Test WF.csv', 'w', newline='', encoding="utf-8") as csvfile:
dict_writer = csv.DictWriter(csvfile, keys)
dict_writer.writeheader()
dict_writer.writerows(result)
csv output image

You're only merging the first dictionary in each list. You should merge them all.
result = [data | page for data, page in zip(amazondata, amazonPage)]
keys = result[0].keys()

Related

BeautifulSoup: Extracting a Title and adjacent <a> tags

I'm attempting to get data from Wikipedias sidebar on the 'Current Events' page with the below. At the moment this produces an array of Objects each with value title and url.
I would also like to provide a new value to the objects in array headline derived from the <h3> id or text content. This would result in each object having three values: headline, url and title. However, I'm unsure how to iterate through these.
Beautiful Soup Code
soup = BeautifulSoup(response, "html.parser").find('div', {'aria-labelledby': 'Ongoing_events'})
links = soup.findAll('a')
for item in links:
title = item.text
url = ("https://en.wikipedia.org"+item['href'])
eo = CurrentEventsObject(title, url)
eventsArray.append(eo)
Wikipedia Current Events List
<div class="mw-collapsible-content">
<h3><span class="mw-headline" id="Disasters">Disasters</span</h3>
<ul>
<li>Climate crisis</li>
<li>COVID-19 pandemic</li>
<li>2021–22 European windstorm season</li>
<li>2020–21 H5N8 outbreak</li>
<li>2021 Pacific typhoon season</li>
<li>Madagascar food crisis</li>
<li>Water crisis in Iran</li>
<li>Yemeni famine</li>
<li>2021 La Palma eruption</li>
</ul>
<h3><span class="mw-headline" id="Economic">Economic</span></h3>
<ul>
<li>2020–2021 global chip shortage</li>
<li>2021 global supply chain crisis</li>
<li>COVID-19 recession</li>
<li>Lebanese liquidity crisis</li>
<li>Pandora Papers leak</li>
<li>Sri Lankan economic and food crisis</li>
<li>Turkish currency and debt crisis</li>
<li>United Kingdom natural gas supplier crisis</li>
</ul>
<h3><span class="mw-headline" id="Politics">Politics</span></h3>
<ul>
<li>Belarus−European Union border crisis</li>
<li>Brazilian protests</li>
<li>Colombian tax reform protests</li>
<li>Eswatini protests</li>
<li>Haitian protests</li>
<li>Indian farmers' protests</li>
<li>Insulate Britain protests</li>
<li>Jersey dispute</li>
<li>Libyan peace process</li>
<li>Malaysian political crisis</li>
<li>Myanmar protests</li>
<li>Nicaraguan protests</li>
<li>Nigerian protests</li>
<li>Persian Gulf crisis</li>
<li>Peruvian crisis</li>
<li>Russian election protests</li>
<li>Solomon Islands unrest</li>
<li>Tigrayan peace process</li>
<li>Thai protests</li>
<li>Tunisian political crisis</li>
<li>United States racial unrest</li>
<li>Venezuelan presidential crisis</li>
</ul>
<div class="editlink noprint plainlinks"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Portal:Current_events/Sidebar&action=edit">edit section</a></div>
</div>
Note: Try to select your elements more specific to get all information in one process - Defining a list outside your loops will avoid from overwriting
Following steps will create a list of dicts, that for example could simply iterated or turned into a data frame.
#1
Select all <ul> that are direct siblings of a <h3>
soup.select('h3 + ul')
#2 Select the <h3> and get its text:
e.find_previous_sibling('h3').text.strip()
#3 Select all <a> in the <ul> and iterat the results while creating a list of dicts:
for a in e.select('a'):
data.append({
'headline':headline,
'title': a['title'],
'url':'https://en.wikipedia.org'+a['href']
})
Example
soup = BeautifulSoup(response, "html.parser").find('div', {'aria-labelledby': 'Ongoing_events'})
data = []
for e in soup.select('h3 + ul'):
headline = e.find_previous_sibling('h3').text.strip()
for a in e.select('a'):
data.append({
'headline':headline,
'title': a['title'],
'url':'https://en.wikipedia.org'+a['href']
})
data
Output
[{'headline': 'Disasters',
'title': 'Climate crisis',
'url': 'https://en.wikipedia.org/wiki/Climate_crisis'},
{'headline': 'Disasters',
'title': 'COVID-19 pandemic',
'url': 'https://en.wikipedia.org/wiki/COVID-19_pandemic'},
{'headline': 'Disasters',
'title': '2021–22 European windstorm season',
'url': 'https://en.wikipedia.org/wiki/2021%E2%80%9322_European_windstorm_season'},
{'headline': 'Disasters',
'title': '2020–21 H5N8 outbreak',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%9321_H5N8_outbreak'},
{'headline': 'Disasters',
'title': '2021 Pacific typhoon season',
'url': 'https://en.wikipedia.org/wiki/2021_Pacific_typhoon_season'},
{'headline': 'Disasters',
'title': '2021 Madagascar food crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Madagascar_food_crisis'},
{'headline': 'Disasters',
'title': 'Water scarcity in Iran',
'url': 'https://en.wikipedia.org/wiki/Water_scarcity_in_Iran'},
{'headline': 'Disasters',
'title': 'Famine in Yemen (2016–present)',
'url': 'https://en.wikipedia.org/wiki/Famine_in_Yemen_(2016%E2%80%93present)'},
{'headline': 'Disasters',
'title': '2021 Cumbre Vieja volcanic eruption',
'url': 'https://en.wikipedia.org/wiki/2021_Cumbre_Vieja_volcanic_eruption'},
{'headline': 'Economic',
'title': '2020–2021 global chip shortage',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_global_chip_shortage'},
{'headline': 'Economic',
'title': '2021 global supply chain crisis',
'url': 'https://en.wikipedia.org/wiki/2021_global_supply_chain_crisis'},
{'headline': 'Economic',
'title': 'COVID-19 recession',
'url': 'https://en.wikipedia.org/wiki/COVID-19_recession'},
{'headline': 'Economic',
'title': 'Lebanese liquidity crisis',
'url': 'https://en.wikipedia.org/wiki/Lebanese_liquidity_crisis'},
{'headline': 'Economic',
'title': 'Pandora Papers',
'url': 'https://en.wikipedia.org/wiki/Pandora_Papers'},
{'headline': 'Economic',
'title': '2021 Sri Lankan economic crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Sri_Lankan_economic_crisis'},
{'headline': 'Economic',
'title': '2018–2021 Turkish currency and debt crisis',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Turkish_currency_and_debt_crisis'},
{'headline': 'Economic',
'title': '2021 United Kingdom natural gas supplier crisis',
'url': 'https://en.wikipedia.org/wiki/2021_United_Kingdom_natural_gas_supplier_crisis'},
{'headline': 'Politics',
'title': '2021 Belarus–European Union border crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Belarus%E2%80%93European_Union_border_crisis'},
{'headline': 'Politics',
'title': '2021 Brazilian protests',
'url': 'https://en.wikipedia.org/wiki/2021_Brazilian_protests'},
{'headline': 'Politics',
'title': '2021 Colombian protests',
'url': 'https://en.wikipedia.org/wiki/2021_Colombian_protests'},
{'headline': 'Politics',
'title': '2021 Eswatini protests',
'url': 'https://en.wikipedia.org/wiki/2021_Eswatini_protests'},
{'headline': 'Politics',
'title': '2018–2021 Haitian protests',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Haitian_protests'},
{'headline': 'Politics',
'title': "2020–2021 Indian farmers' protest",
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_Indian_farmers%27_protest'},
{'headline': 'Politics',
'title': 'Insulate Britain protests',
'url': 'https://en.wikipedia.org/wiki/Insulate_Britain_protests'},
{'headline': 'Politics',
'title': '2021 Jersey dispute',
'url': 'https://en.wikipedia.org/wiki/2021_Jersey_dispute'},
{'headline': 'Politics',
'title': 'Libyan peace process',
'url': 'https://en.wikipedia.org/wiki/Libyan_peace_process'},
{'headline': 'Politics',
'title': '2020–21 Malaysian political crisis',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%9321_Malaysian_political_crisis'},
{'headline': 'Politics',
'title': '2021 Myanmar protests',
'url': 'https://en.wikipedia.org/wiki/2021_Myanmar_protests'},
{'headline': 'Politics',
'title': '2018–2021 Nicaraguan protests',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Nicaraguan_protests'},
{'headline': 'Politics',
'title': 'End SARS',
'url': 'https://en.wikipedia.org/wiki/End_SARS'},
{'headline': 'Politics',
'title': '2019–2021 Persian Gulf crisis',
'url': 'https://en.wikipedia.org/wiki/2019%E2%80%932021_Persian_Gulf_crisis'},
{'headline': 'Politics',
'title': '2017–present Peruvian political crisis',
'url': 'https://en.wikipedia.org/wiki/2017%E2%80%93present_Peruvian_political_crisis'},
{'headline': 'Politics',
'title': '2021 Russian election protests',
'url': 'https://en.wikipedia.org/wiki/2021_Russian_election_protests'},
{'headline': 'Politics',
'title': '2021 Solomon Islands unrest',
'url': 'https://en.wikipedia.org/wiki/2021_Solomon_Islands_unrest'},
{'headline': 'Politics',
'title': 'Tigrayan peace process',
'url': 'https://en.wikipedia.org/wiki/Tigrayan_peace_process'},
{'headline': 'Politics',
'title': '2020–2021 Thai protests',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_Thai_protests'},
{'headline': 'Politics',
'title': '2021 Tunisian political crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Tunisian_political_crisis'},
{'headline': 'Politics',
'title': '2020–2021 United States racial unrest',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_United_States_racial_unrest'},
{'headline': 'Politics',
'title': 'Venezuelan presidential crisis',
'url': 'https://en.wikipedia.org/wiki/Venezuelan_presidential_crisis'}]

Scraping profiles with Python and the "scrape-linkedin" package

I am trying to use the scrape_linkedin package. I follow the section on the github page on how to set up the package/LinkedIn li_at key (which I paste here for clarity).
Getting LI_AT
Navigate to www.linkedin.com and log in
Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
Find and copy the li_at value
Once I collect the li_at value from my LinkedIn, I run the following code:
from scrape_linkedin import ProfileScraper
with ProfileScraper(cookie='myVeryLong_li_at_Code_which_has_characters_like_AQEDAQNZwYQAC5_etc') as scraper:
profile = scraper.scrape(url='https://www.linkedin.com/in/justintrudeau/')
print(profile.to_dict())
I have two questions (I am originally an R user).
How can I input a list of profiles:
https://www.linkedin.com/in/justintrudeau/
https://www.linkedin.com/in/barackobama/
https://www.linkedin.com/in/williamhgates/
https://www.linkedin.com/in/wozniaksteve/
and scrape the profiles? (In R I would use the map function from the purrr package to apply the function to each of the LinkedIn profiles).
The output (from the original github page) is returned in a JSON style format. My second question is how I can convert this into a pandas data frame (i.e. it is returned similar to the following).
{'personal_info': {'name': 'Steve Wozniak', 'headline': 'Fellow at
Apple', 'company': None, 'school': None, 'location': 'San Francisco
Bay Area', 'summary': '', 'image': '', 'followers': '', 'email': None,
'phone': None, 'connected': None, 'websites': [],
'current_company_link': 'https://www.linkedin.com/company/sandisk/'},
'experiences': {'jobs': [{'title': 'Chief Scientist', 'company':
'Fusion-io', 'date_range': 'Jul 2014 – Present', 'location': 'Primary
Data', 'description': "I'm looking into future technologies applicable
to servers and storage, and helping this company, which I love, get
noticed and get a lead so that the world can discover the new amazing
technology they have developed. My role is principally a marketing one
at present but that will change over time.", 'li_company_url':
'https://www.linkedin.com/company/sandisk/'}, {'title': 'Fellow',
'company': 'Apple', 'date_range': 'Mar 1976 – Present', 'location': '1
Infinite Loop, Cupertino, CA 94015', 'description': 'Digital Design
engineer.', 'li_company_url': ''}, {'title': 'President & CTO',
'company': 'Wheels of Zeus', 'date_range': '2002 – 2005', 'location':
None, 'description': None, 'li_company_url':
'https://www.linkedin.com/company/wheels-of-zeus/'}, {'title':
'diagnostic programmer', 'company': 'TENET Inc.', 'date_range': '1970
– 1971', 'location': None, 'description': None, 'li_company_url':
''}], 'education': [{'name': 'University of California, Berkeley',
'degree': 'BS', 'grades': None, 'field_of_study': 'EE & CS',
'date_range': '1971 – 1986', 'activities': None}, {'name': 'University
of Colorado Boulder', 'degree': 'Honorary PhD.', 'grades': None,
'field_of_study': 'Electrical and Electronics Engineering',
'date_range': '1968 – 1969', 'activities': None}], 'volunteering':
[]}, 'skills': [], 'accomplishments': {'publications': [],
'certifications': [], 'patents': [], 'courses': [], 'projects': [],
'honors': [], 'test_scores': [], 'languages': [], 'organizations':
[]}, 'interests': ['Western Digital', 'University of Colorado
Boulder', 'Western Digital Data Center Solutions', 'NEW Homebrew
Computer Club', 'Wheels of Zeus', 'SanDisk®']}
Firstly, You can create a custom function to scrape data and use map function in Python to apply it over each profile link.
Secondly, to create a pandas dataframe using a dictionary, you can simply pass the dictionary to pd.DataFrame.
Thus to create a dataframe df, with dictionary dict, you can do like this:
df = pd.DataFrame(dict)

Python Generators and how to iterate over correctly to drop records based on a key within the dictionary being present in a a separate list

I'm new to the concept of generators and I'm struggling with how to apply my changes to the records within the generator object returned from the RISparser module.
I understand that a generator only reads a record at a time and doesn't actually store the data in memory but I'm having a tough time iterating over it effectively and applying my changes.
My changes will involve dropping records that have not got ['doi'] values that are contained within a list of DOIs [doi_match].
doi_match = ['10.1002/14651858.CD008259.pub2','10.1002/14651858.CD011552','10.1002/14651858.CD011990']
Generator object returned form RISparser contains the following information, this is just the first 2 records returned of a few 100. I want to iterate over it and compare the 'doi': key from the generator with the list of DOIs.
{'type_of_reference': 'JOUR', 'title': "The CoRe Outcomes in WomeN's health (CROWN) initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Neurourology and Urodynamics', 'alternate_title1': 'Neurourol. Urodyn.', 'volume': '33', 'number': '8', 'start_page': '1176', 'end_page': '1177', 'year': '2014', 'doi': '10.1002/nau.22674', 'issn': '07332467 (ISSN)', 'authors': ['Khan, K.'], 'keywords': ['Bias (epidemiology)', 'Clinical trials', 'Consensus', 'Endpoint determination/standards', 'Evidence-based medicine', 'Guidelines', 'Research design/standards', 'Systematic reviews', 'Treatment outcome', 'consensus', 'editor', 'female', 'human', 'medical literature', 'Note', 'outcomes research', 'peer review', 'randomized controlled trial (topic)', 'systematic review (topic)', "women's health", 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'John Wiley and Sons Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: NEURE'], 'type_of_work': 'Note', 'name_of_database': 'Scopus', 'custom2': '25270392', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908368202&doi=10.1002%2fnau.22674&partnerID=40&md5=b220702e005430b637ef9d80a94dadc4'}
{'type_of_reference': 'JOUR', 'title': "The CROWN initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Gynecologic Oncology', 'alternate_title1': 'Gynecol. Oncol.', 'volume': '134', 'number': '3', 'start_page': '443', 'end_page': '444', 'year': '2014', 'doi': '10.1016/j.ygyno.2014.05.005', 'issn': '00908258 (ISSN)', 'authors': ['Karlan, B.Y.'], 'author_address': 'Gynecologic Oncology and Gynecologic Oncology Reports, India', 'keywords': ['clinical trial (topic)', 'decision making', 'Editorial', 'evidence based practice', 'female infertility', 'health care personnel', 'human', 'outcome assessment', 'outcomes research', 'peer review', 'practice guideline', 'premature labor', 'priority journal', 'publication', 'systematic review (topic)', "women's health", 'editorial', 'female', 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'Academic Press Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: GYNOA', 'Correspondence Address: Karlan, B.Y.; Gynecologic Oncology and Gynecologic Oncology ReportsIndia'], 'type_of_work': 'Editorial', 'name_of_database': 'Scopus', 'custom2': '25199578', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908351159&doi=10.1016%2fj.ygyno.2014.05.005&partnerID=40&md5=ab5a4d26d52c12d081e38364b0c79678'}
I tried iterating over the generator and applying the changes. But the records that have matches are not being placed in the match list.
match = []
for entry in ris_records:
if entry['doi'] in doi_match:
match.append(entry)
else:
del entry
any advice on how to iterate over a generator correctly, thanks.

Scraping Dynamic website using beautifulsoup

I am scraping a website nykaa.com and the link is (https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=1). There are 25 pages and the data loads dynamically per page. I am unable to find the source of the data. Moreover when Scrape the data I am only able to get 20 products which become redundant and the list becomes 420 products.
import requests
from bs4 import BeautifulSoup
import unicodecsv as csv
urls = []
l1 = []
for page in range(1,5):
result = requests.get("https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=" + str(page))
src = result.content
soup = BeautifulSoup(src,'lxml')
for div_tag in soup.find_all("div", class_ = "card-wrapper-container col-xs-12 col-sm-6 col-md-4"):
for div1_tag in soup.find_all("div", class_ = "product-list-box card desktop-cart"):
h2_tag = div1_tag.find("h2").find("span")
price_tag = div1_tag.find("div", class_ = "price-info")
l1 = [h2_tag.get_text(),price_tag.get_text()]
urls.append(l1)
#print(urls)
with open('xyz.csv', 'wb') as myfile:
wr = csv.writer(myfile)
wr.writerows(urls)
The above code fetches me a list of around 1200 product names and prices, out of which only 30 to 40 are unique otherwise all are redundant. I want to fetch data of 25 pages uniquely and there are total 486 unique products. I also used selenium to click the next page link but that also didn't work out.
This shows making the request the page does (as viewed in network tab) in a loop over all pages (including determing number of pages). results is a list of lists you can easily write to csv.
import requests, math, csv
page = '1'
def append_new_rows(data):
for i in data:
if 'name' in i:
results.append([i['name'], i['final_price']])
with requests.Session() as s:
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
results_per_page = 20
total_results = r['response']['total_found']
num_pages = math.ceil(total_results/results_per_page)
results = []
append_new_rows(r['response']['products'])
for page in range(2, num_pages + 1):
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
append_new_rows(r['response']['products'])
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Name','Price'])
for row in results:
w.writerow(row)
You can use selenium:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397')
def get_products(_d):
return [{'title':(lambda x:x if not x else x.text)(i.find('div', {'class':'m-content__product-list__title'})), 'price':(lambda x:x if not x else x.text)(i.find('span', {'class':'post-card__content-price-offer'}))} for i in _d.find_all('div', {'class':'card-wrapper-container col-xs-12 col-sm-6 col-md-4'})]
s = soup(d.page_source, 'html.parser')
r = [list(filter(None, get_products(s)))]
while 'disable-event' not in s.find('li', {'class':'next'}).attrs['class']:
d.get(f"https://www.nykaa.com{s.find('li', {'class':'next'}).a['href']}")
s = soup(d.page_source, 'html.parser')
r.append(list(filter(None, get_products(s))))
Sample output (first three pages):
[[{'title': 'The Face Shop Calendula Essential Moisture Serum', 'price': '₹1320 '}, {'title': 'Palmers Cocoa Butter Formula Skin Perfecting Ultra Hydrating...', 'price': '₹970 '}, {'title': "Cheryl's Cosmeceuticals Clarifi Acne Anti Blemish Serum", 'price': '₹875 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹1250 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹1250 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹3900 '}, {'title': 'Klairs Freshly Juiced Vitamin Drop', 'price': '₹1492 '}, {'title': 'Innisfree The Green Tea Seed Serum', 'price': '₹1950 '}, {'title': "Kiehl's Midnight Recovery Concentrate", 'price': '₹2100 '}, {'title': 'The Face Shop White Seed Brightening Serum', 'price': '₹1990 '}, {'title': 'Biotique Bio Dandelion Visibly Ageless Serum', 'price': '₹230 '}, {'title': None, 'price': None}, {'title': 'St.Botanica Vitamin C 20% + Vitamin E & Hyaluronic Acid Faci...', 'price': '₹1499 '}, {'title': 'Biotique Bio Coconut Whitening & Brightening Cream', 'price': '₹199 '}, {'title': 'Neutrogena Fine Fairness Brightening Serum', 'price': '₹849 '}, {'title': "Kiehl's Clearly Corrective Dark Spot Solution", 'price': '₹4300 '}, {'title': "Kiehl's Clearly Corrective Dark Spot Solution", 'price': '₹4300 '}, {'title': 'Lakme Absolute Perfect Radiance Skin Lightening Serum', 'price': '₹960 '}, {'title': 'St.Botanica Hyaluronic Acid + Vitamin C, E Facial Serum', 'price': '₹1499 '}, {'title': 'Jeva Vitamin C Serum with Hyaluronic Acid for Anti Aging and...', 'price': '₹350 '}, {'title': 'Lotus Professional Phyto-Rx Whitening & Brightening Serum', 'price': '₹595 '}], [{'title': 'The Face Shop Chia Seed Moisture Recharge Serum', 'price': '₹1890 '}, {'title': 'Lotus Herbals WhiteGlow Skin Whitening & Brightening Gel Cre...', 'price': '₹280 '}, {'title': 'Lakme 9 to 5 Naturale Aloe Aqua Gel', 'price': '₹200 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹5900 '}, {'title': 'Mixify Unloc Skin Glow Serum', 'price': '₹499 '}, {'title': 'St.Botanica Retinol 2.5% + Vitamin E & Hyaluronic Acid Profe...', 'price': '₹1499 '}, {'title': 'LANEIGE Hydration Combo Set', 'price': '₹3000 '}, {'title': 'Biotique Bio Dandelion Ageless Visiblly Serum', 'price': '₹690 '}, {'title': 'The Moms Co. Natural Vita Rich Face Serum', 'price': '₹699 '}, {'title': "It's Skin Power 10 Formula VC Effector", 'price': '₹950 '}, {'title': "Kiehl's Powerful-Strength Line-Reducing Concentrate", 'price': '₹5100 '}, {'title': 'Olay Natural White Light Instant Glowing Fairness Skin Cream', 'price': '₹99 '}, {'title': 'Plum Green Tea Skin Clarifying Concentrate', 'price': '₹881 '}, {'title': 'Olay Total Effects 7 In One Anti-Ageing Smoothing Serum', 'price': '₹764 '}, {'title': 'Elizabeth Arden Ceramide Daily Youth Restoring Serum 60 Caps...', 'price': '₹5850 '}, {'title': None, 'price': None}, {'title': 'Olay Regenerist Advanced Anti-Ageing Micro-Sculpting Serum', 'price': '₹1699 '}, {'title': 'Lakme Absolute Argan Oil Radiance Overnight Oil-in-Serum', 'price': '₹945 '}, {'title': 'The Face Shop Mango Seed Silk Moisturizing Emulsion', 'price': '₹1890 '}, {'title': 'The Face Shop Calendula Essential Good to Glow Combo', 'price': '₹2557 '}, {'title': 'Garnier Skin Naturals Light Complete Serum Cream', 'price': '₹69 '}], [{'title': 'Clinique Moisture Surge Hydrating Supercharged Concentrate', 'price': '₹2550 '}, {'title': 'LANEIGE Sleeping Mask Combo', 'price': '₹3000 '}, {'title': 'Klairs Rich Moist Soothing Serum', 'price': '₹1492 '}, {'title': 'Estee Lauder Idealist Pore Minimizing Skin Refinisher', 'price': '₹5500 '}, {'title': 'O3+ Whitening & Brightening Serum', 'price': '₹1475 '}, {'title': 'Elizabeth Arden Ceramide Daily Youth Restoring Serum 90 Caps...', 'price': '₹6900 '}, {'title': 'Olay Natural White Light Instant Glowing Fairness Skin Cream', 'price': '₹189 '}, {'title': "L'Oreal Paris White Perfect Clinical Expert Anti-Spot Whiten...", 'price': '₹1480 '}, {'title': 'belif Travel Kit', 'price': '₹1499 '}, {'title': 'Forest Essentials Advanced Soundarya Serum With 24K Gold', 'price': '₹3975 '}, {'title': "L'Occitane Immortelle Reset Serum", 'price': '₹4500 '}, {'title': 'Lakme Absolute Skin Gloss Reflection Serum 30ml', 'price': '₹990 '}, {'title': 'Neutrogena Hydro Boost Emulsion', 'price': '₹999 '}, {'title': 'Innisfree Anti-Aging Set', 'price': '₹2350 '}, {'title': 'Clinique Fresh Pressed 7-Day System With Pure Vitamin C', 'price': '₹2400 '}, {'title': 'The Face Shop The Therapy Premier Serum', 'price': '₹2490 '}, {'title': 'The Body Shop Vitamin E Overnight Serum In Oil', 'price': '₹1695 '}, {'title': 'Jeva Vitamin C Serum with Hyaluronic Acid for Anti Aging and...', 'price': '₹525 '}, {'title': 'Olay Regenerist Micro Sculpting Cream and White Radiance Hyd...', 'price': '₹2698 '}, {'title': 'The Face Shop Yehwadam Pure Brightening Serum', 'price': '₹4350 '}]]

API printing in terminal but not in HTML template

I am importing news API in my Django project. I can print my data in my terminal however I can't print through my news.html file. This could be an issue related to importing the data in HTML.
from django.shortcuts import render
import requests
def news(request):
url = ('https://newsapi.org/v2/top-headlines?'
'sources=bbc-news&'
'apiKey=647505e4506e425994ac0dc310221d04')
response = requests.get(url)
print(response.json())
news = response.json()
return render(request,'new/new.html',{'news':news})
base.html
<html>
<head>
<title></title>
</head>
<body>
{% block content %}
{% endblock %}
</body>
</html>
news.html
{% extends 'base.html' %}
{% block content %}
<h2>news API</h2>
{% if news %}
<p><strong>{{ news.title }}</strong><strong>{{ news.name}}</strong> public repositories.</p>
{% endif %}
{% endblock %}
Terminal and API Output
System check identified no issues (0 silenced).
November 28, 2018 - 12:31:07
Django version 2.1.3, using settings 'map.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
{'status': 'ok', 'totalResults': 10, 'articles': [{'source':
{'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News',
'title': 'Sri Lanka defence chief held over murders',
'description': "The country's top officer is in custody, accused of covering up illegal killings in the civil war.", 'url': 'http://www.bbc.co.uk/news/world-asia-46374111', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/1010/production/_104521140_26571c51-e151-41b9-85a3-d6e441f5262b.jpg', 'publishedAt': '2018-11-28T12:12:05Z', 'content': "Image copyright AFP Image caption Adm Wijeguneratne denies the charges Sri Lanka's top military officer has been remanded in custody, accused of covering up civil war-era murders. Chief of Defence Staff Ravindra Wijeguneratne appeared in court after warrants … [+288 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Flash-flooding causes chaos in Sydney', 'description': "Emergency crews respond to hundreds of calls on the city's wettest November day since 1984.", 'url': 'http://www.bbc.co.uk/news/world-australia-46366961', 'urlToImage': 'https://ichef.bbci.co.uk/images/ic/1024x576/p06t1d6h.jpg', 'publishedAt': '2018-11-28T11:58:49Z', 'content': 'Media caption People in vehicles were among those caught up in the floods Sydney has been deluged by the heaviest November rain it has experienced in decades, causing flash-flooding, traffic chaos and power cuts. Heavy rain fell throughout Wednesday, the city… [+2282 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Rapist 'gets chance to see victim's child'", 'description': 'Sammy Woodhouse calls for a law change after rapist Arshid Hussain is given the chance to see his son.', 'url': 'http://www.bbc.co.uk/news/uk-england-south-yorkshire-46368991', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/12C94/production/_95184967_jessica.jpg', 'publishedAt': '2018-11-28T09:38:07Z', 'content': "Image caption Sammy Woodhouse's son was conceived when she was raped by Arshid Hussain A child exploitation scandal victim has called for a law change amid claims a man who raped her has been invited to play a role in her son's life. Arshid Hussain, who was j… [+2543 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'China chemical plant explosion kills 22', 'description': 'Initial reports say a vehicle carrying chemicals exploded while waiting to enter the north China plant.', 'url': 'http://www.bbc.co.uk/news/world-asia-46369041', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/2E1A/production/_104520811_mediaitem104520808.jpg', 'publishedAt': '2018-11-28T08:03:12Z', 'content': 'Image copyright AFP Image caption A line of burnt out vehicles could be seen outside the chemical plant At least 22 people have died and 22 more were injured in a blast outside a chemical factory in northern China. A vehicle carrying chemicals exploded while … [+1252 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Thousands told to flee Australia bushfire', 'description': 'Queensland\'s fire danger warning has been raised to "catastrophic" for the first time.', 'url': 'http://www.bbc.co.uk/news/world-australia-46366964', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/8977/production/_104519153_1ccd493b-4500-4d8d-9e6c-f32ba036dd3e.jpg', 'publishedAt': '2018-11-28T07:01:41Z', 'content': 'Image copyright EPA Image caption More than 130 bushfires are burning across Queensland, officials say Thousands of Australians have been told to evacuate their homes as a powerful bushfire threatens properties in Queensland. It follows the raising of the sta… [+974 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Chinese scientist defends 'gene-editing'", 'description': "He Jiankui shocked the world by claiming he had created the world's first genetically edited children.", 'url': 'http://www.bbc.co.uk/news/world-asia-china-46368731', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/7A23/production/_97176213_breaking_news_bigger.png', 'publishedAt': '2018-11-28T06:00:22Z', 'content': 'A Chinese scientist who claims to have created the world\'s first genetically edited babies has defended his work. Speaking at a genome summit in Hong Kong, He Jiankui, an associate professor at a Shenzhen university, said he was "proud" of his work. He said "… [+335 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Republican wins Mississippi Senate seat', 'description': "Cindy Hyde-Smith wins Mississippi's Senate race in a vote overshadowed by racial acrimony.", 'url': 'http://www.bbc.co.uk/news/world-us-canada-46361369', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/3A2B/production/_104519841_050855280.jpg', 'publishedAt': '2018-11-28T04:19:15Z', 'content': "Image copyright Reuters Image caption In her victory speech, Cindy Hyde-Smith promised to represent all Mississippians Republican Cindy Hyde-Smith has won Mississippi's racially charged Senate election, beating a challenge from the black Democrat, Mike Espy. … [+4327 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Lion Air should 'improve safety culture'", 'description': 'Indonesian authorities release a preliminary report into a crash in October that killed 189 people.', 'url': 'http://www.bbc.co.uk/news/world-asia-46121127', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/1762F/production/_104519759_45e74e27-2dc6-45dc-bded-405c057702f5.jpg', 'publishedAt': '2018-11-28T04:10:45Z', 'content': "Image copyright Reuters Image caption The families of the victims visited the site of the crash to pay tribute Indonesian authorities have recommended that budget airline Lion Air improve its safety culture, in a preliminary report into last month's deadly cr… [+1725 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Trump 'may cancel Putin talks over Ukraine'", 'description': '"I don\'t like the aggression," the US leader says after Russia seizes Ukrainian boats off Crimea.', 'url': 'http://www.bbc.co.uk/news/world-europe-46367191', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/0C77/production/_104519130_050842389.jpg', 'publishedAt': '2018-11-28T01:08:30Z', 'content': 'Image copyright AFP Image caption Some of the detained Ukrainians have appeared in court in Crimea US President Donald Trump says he may cancel a meeting with Russian President Vladimir Putin following a maritime clash between Russia and Ukraine. Mr Trump tol… [+4595 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Wandering dog home after 2,200-mile adventure', 'description': "Sinatra the husky was found in Florida 18 months after vanishing in New York. Here's how he got home.", 'url': 'http://www.bbc.co.uk/news/world-us-canada-46353240', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/D49E/production/_104503445_p06t0kn9.jpg', 'publishedAt': '2018-11-27T21:47:59Z', 'content': "Video Sinatra the husky was found in Florida 18 months after vanishing in New York. Here's the remarkable story of how he got home."}]}
[28/Nov/2018 12:31:12] "GET / HTTP/1.1" 200 155
The data you get from that API doesn't have title or name as attributes at the top level. Rather, they are inside the articles element, which itself is a list.
{% for article in news.articles %}
<p><strong>{{ article.title }}</strong><strong>{{ article.source.name }}</strong> public repositories.</p>
{% endif %}

Categories

Resources