I created a scraper, but I keep struggling with one part: getting the keywords associated with a movie/tv-show title.
I have a df with the following urls
keyword_link_list = ['https://www.imdb.com/title/tt7315526/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt11723916/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt7844164/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt2034855/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt11215178/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt10941266/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt13210836/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt0913137/keywords?ref_=tt_ql_sm']
df = pd.DataFrame({'keyword_link':keyword_link_list})
print(df)
Then, I like the loop through the column keyword_link, get all the keywords, and add them to a dictionary. I managed to get all the keywords, but I do not manage to add them to a dictionary. It seems like a simple problem, but I'm not seeing what I'm doing wrong (after hours of struggling). Many thanks in advance for your help!
# Import packages
import requests
import re
from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
# Loop through column keyword_link and get the keywords for each link
keyword_dicts = []
for index, row in df.iterrows():
keyword_link = row['keyword_link']
print(keyword_link)
headers = {"Accept-Language": "en-US,en;q=0.5"}
r=requests.get(keyword_link, headers=headers)
html = r.text
soup = bs.BeautifulSoup(html, 'html.parser')
elements = soup.find_all('td', {'class':"soda sodavote"})
for element in elements:
for keyword in element.find_all('a'):
keyword = keyword['href']
keyword = re.sub(r'\/search/keyword\?keywords=', '', keyword)
keyword = re.sub(r'\?item=kw\d+', '', keyword)
print(keyword)
keyword_dict = {}
keyword_dict['keyword'] = keyword
keyword_dicts.append(keyword_dict)
print(keyword_dicts)
Update
After running the definition, I get the following error:
Note: cause expected output is not that clear and could be improved, this example deals with operating on your list only. you can use the output to create a dataframe, lists, ...
What happens?
Your dictionary is defined right behind the loop - You won't get any information to store and your list just contains [{'keyword': ''}]
How to fix?
Append your dictionary while iterating over the keywords.
Alternativ approach:
However, it do not need a dataframe and only one line to get your keywords:
keywords = [e.a.text for e in soup.select('[data-item-keyword]')]
In following example I come up with some variations on how and what could be collected:
Collect just the keywords separated by whitespace:
[e.a.text for e in soup.select('[data-item-keyword]')]
Collect same keywords separated by "-" as in the url:
['-'.join(x.split()) for x in keywords]
collect keywords and votings maybe also interesting:
[{'keyword':k,'votes':v} for k,v in zip(keywords,votes)]
Example
import requests, time
from bs4 import BeautifulSoup
import pandas as pd
keyword_link_list = ['https://www.imdb.com/title/tt7315526/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt11723916/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt7844164/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt2034855/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt11215178/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt10941266/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt13210836/keywords?ref_=tt_ql_sm',
'https://www.imdb.com/title/tt0913137/keywords?ref_=tt_ql_sm'
]
def cook_soup(url):
#do not harm the website add some delay
#time.sleep(2)
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36','Accept-Language': 'en-US,en;q=0.5'
}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text,'lxml')
return soup
data = []
for url in keyword_link_list:
soup = cook_soup(url)
keywords = [e.a.text for e in soup.select('[data-item-keyword]')]
votes = [e['data-item-votes'] for e in soup.select('[data-item-votes]')]
data.append({
'url':url,
'keywords':keywords,
})
print(data)
### pd.DataFrame(data)
The problem with your code is that you're not saving the keywords in the loop. Also, instead of iterating over dataframe rows, create a function that does what you want and apply it on keyword_link column.
def get_keywords(row):
headers = {"Accept-Language": "en-US,en;q=0.5"}
r=requests.get(row, headers=headers)
# ^^^ replace keyword_link to row here
html = r.text
soup = bs.BeautifulSoup(html, 'html.parser')
elements = soup.find_all('td', {'class':"soda sodavote"})
keyword_dict = {'keyword':[]}
# ^^^ declare the dict here
for element in elements:
for keyword in element.find_all('a'):
keyword = keyword['href']
keyword = re.sub(r'\/search/keyword\?keywords=', '', keyword)
keyword = re.sub(r'\?item=kw\d+', '', keyword)
if keyword:
keyword_dict['keyword'].append(keyword)
# ^^^ move this inside the loop
return keyword_dict
However, it might be better to store list of keywords since the 'keyword' key is really doing nothing here.
Anyway, then you can use it as
df[keywords] = df['keyword_link'].apply(get_keywords)
Now, if you need a list of the keyword dictionaries, you can do
keyword_dicts = df[keywords].tolist()
Related
I am making Japanese flashcards by scraping the this website
My plan is to format it in a text file with the kanji, its 3 word examples, the hirigana reading on top of each of the words, and the english translation below it.
I want it to look like this:
kanji {word1},{hirigana},{english translation}
{word2},{hirigana},{english translation}
{word3},{hirigana},{english translation}
Example:
福 祝福 祝福,しゅくふ,blessing
幸福,こうふく,happiness; well-being; joy; welfare; blessedness
裕福,ゆうふく,wealthy; rich; affluent; well-off
So far I am trying just with the website I mentioned and eventually loop it for a list of kanji character I have. However I am not sure how to extract the text here from the website
I know soup can be used however I dont know what to put in the function to get the text I want.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
word1_list = []
word2_list = []
word3_list = []
kanji = '福'
url = f'https://jpdb.io/search?q={kanji}+%23kanji&lang=english#a'
session = HTMLSession()
response = session.get(url)
# // uncertain what I should put here
soup = BeautifulSoup(response.html.html, 'html.parser')
words = soup.select('div.jp') # // uncertain what I should put here
word1_list.append(words) # // I want to try putting the data I want here
Here is one way of getting the information you're after:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://jpdb.io/search?q=%E7%A6%8F+%23kanji&lang=english#a'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
translation = soup.select_one('h6:-soup-contains("Keyword")').find_next('div').get_text(strip=True)
print(translation)
big_list = []
japanese_texts = soup.select('div[class="jp"]')
for j in japanese_texts:
japanese_text = j.get_text(strip=True)
translation = j.find_next_sibling('div').get_text(strip=True)
big_list.append((japanese_text, translation))
df = pd.DataFrame(big_list, columns = ['Japanese', 'English'])
print(df)
Result in terminal:
good fortune
Japanese English
0 ç¥ç¦ blessing
1 å¹¸ç¦ happiness; well-being; joy; welfare; bless...
2 è£ç¦ wealthy; rich; affluent; well-off
3 æãããã¨ãããã¦æããããã¨ã... To love and to be loved is the greatest happin...
4 å½¼ã¯å¹¸ç¦ã§ããããã ã He seems happy.
5 幸ç¦ãªèãããã°ãã¾ãä¸å¹¸ãªèã... Some are happy; others unhappy.
BeautifulSoup documentation can be found here. Also, try to avoid using deprecated packages: requests-html was last released on Feb 17, 2019, so it's pretty much unmaintained.
I'm trying to create a dataframe from a webscraping. Precisely: from a search of a topic on github, the objective is to retrieve the name of the owner of the repo, the link and the about.
I have many problems.
1. The search shows that there are, for example, more than 300,000 repos, but my scraping can only get the information from 90. I would like to scrape all available repos.
2. Sometimes about is empty. It stops me after creating a dataframe
ValueError: All arrays must be of the same length
3. My search for names is completely strange.
My code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
search_topics = "https://github.com/search?p="
stock_urls = []
stock_names = []
stock_about = []
for page in range(1, 99):
req = requests.get(search_topics + str(page) + "&q=" + "nlp" + "&type=Repositories", headers = headers)
soup = BeautifulSoup(req.text, "html.parser")
#about
for about in soup.select("p.mb-1"):
stock_about.append(about.text)
#urls
for url in soup.findAll("a", attrs = {"class":"v-align-middle"}):
link = url['href']
complete_link = "https://github.com" + link
stock_urls.append(complete_link)
#profil name
for url in soup.findAll("a", attrs = {"class":"v-align-middle"}):
link = url['href']
names = re.sub(r"\/(.*)\/(.*)", "\1", link)
stock_names.append(names)
dico = {"name": stock_names, "url": stock_urls, "about": stock_about}
#df = pd.DataFrame({"name": stock_names, "url": stock_urls, "about": stock_about})
df = pd.DataFrame.from_dict(dico)
My output:
ValueError: All arrays must be of the same length
Lazy fix: zip all the lists together so that the columns are cropped to match lengths with the shortest one. [DataFrame data: list of dictionaries formed using list comprehension.]
pd.DataFrame([{'name': n, 'url': u, 'about': a} for n, u, a
in zip(stock_names, stock_urls, stock_about)])
But there's a problem that would then be ignored: if the lists don't match up, how can you know that stock_names[i] and stock_urls[i] and stock_about[i] [for any given i] are of the same repo? The lists don't match up because some repos are missing an "about" section, but because the lists are built individually, you don't have any way to figure out which ones.
That's why it's better to merge the loops [in a sense] - loop through the containers of individual results and build dico up as a list of dictionaries right from the start, result by result:
dico = []
for page in range(1, 99):
req = requests.get(search_topics + str(page) + "&q=" + "nlp" + "&type=Repositories", headers = headers)
soup = BeautifulSoup(req.text, "html.parser")
# for repo in soup.find('ul', class_="repo-list").find_all('li'):
for repo in soup.select('ul.repo-list>li:has(a.v-align-middle[href])'):
link = repo.select_one('a.v-align-middle[href]')
about = repo.select_one('p.mb-1')
dico.append({
# 'name': re.sub(r"\/(.*)\/(.*)", "\1", link.get('href')),
'name': ' by '.join(link.text.strip().split('/', 1)[::-1]),
'url': "https://github.com" + link.get('href'),
'about': about.text.strip() if about else None
})
df = pandas.DataFrame(dico)
df looks something like
I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement
#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}"
r = requests.get(url,headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
for item in divs:
title = item.find ("a").text.strip()
company = item.find("span", class_="company").text.strip()
try:
salary = item.find("span", class_ = "salarytext").text.strip()
finally:
salary = ""
summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")
job = {
"title":title,
"company":company,
'salary':salary,
"summary":summary
}
joblist.append(job)
joblist = []
for i in range(0,40,10):
print(f'Getting page, {i}')
c = extract(10)
transform(c)
df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')
Here is the output I get
Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []
Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.
The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:
https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}
That returns an error page.
So you should add an f before opening quote when you set url.
Also, you are calling extract(10) each time, instead of extract(i).
This is the correct way of using url
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}".format(page=page)
r = requests.get(url,headers)
here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api
I am new to scraping. I am asked to get a list of store number, city, state from website: https://www.lowes.com/Lowes-Stores
Below is what I have tried so far. Since the structure does not have an attribute, I am not sure how to continue my code. Please guide!
import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df
url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')
lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
print(i)
example = lowes_list[0]
example_content = example.contents
example_content
You've found the list elements that contain the links that you need for state store lookups in your for loop. You will need to get the href attribute from the "a" tag inside each "li" element.
This is only the first step since you'll need to follow those links to get the store results for each state.
Since you know the structure of this state link result, you can simply do:
for i in lowes_list:
list_items = i.find_all('li')
for x in list_items:
for link in x.find_all('a'):
print(link['href'])
There are definitely more efficient ways of doing this, but the list is very small and this works.
Once you have the links for each state, you can create another request for each one to visit those store results pages. Then obtain the href attribute from those search results links on each state's page. The
Anchorage Lowe's
contains the city and the store number.
Here is a full example. I included lots of comments to illustrate the points.
You pretty much had everything up to Line 27, but you needed to follow the links for each state. A good technique for approaching these is to test the path out in your web browser first with the dev tools open, watching the HTML so you have a good idea of where to start with the code.
This script will obtain the data you need, but doesn't provide any data presentation.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.lowes.com/Lowes-Stores"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")
lowes_state_lists = soup.find_all(class_="list unstyled")
# we will store the links for each state in this array
state_stores_links = []
# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
list_items = ul.find_all("li")
# now we have all the list items from the page, we have to extract the href
for li in list_items:
for link in li.find_all("a"):
state_stores_links.append(link["href"])
# This next part is what the original question was missing, following the state links to their respective search result pages.
# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}
for link in state_stores_links:
# splitting up the link on the / gives us the parts of the URLs.
# by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
link_components = link.split("/")
state_name = link_components[2]
state_abbreviation = link_components[3]
# let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
# the type and shape of this dict is irrelevant at this point. This example illustrates how to obtain the info you're after
# in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}
try:
# simple error catching in case something goes wrong, since we are sending many requests.
# our link is just the second half of the URL, so we have to craft the new one.
new_link = "https://www.lowes.com" + link
state_search_results = requests.get(new_link, headers=headers, timeout=5)
stores = []
if state_search_results.status_code == 200:
store_directory = bs(state_search_results.content, "html.parser")
store_directory_div = store_directory.find("div", class_="storedirectory")
# now we get the links inside the storedirectory div
individual_store_links = store_directory_div.find_all("a")
# we now have all the stores for this state! Let's parse and save them into our store dict
# the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
# example: "/store/AK-Wasilla/2512"
for store in individual_store_links:
href = store["href"]
try:
# by splitting the href which looks to be consistent throughout the site, we can get the info we need
split_href = href.split("/")
store_number = split_href[3]
# the store city is after the -, so we have to split that element up into its two parts and access the second part.
store_city = split_href[2].split("-")[1]
# creating our store dict
store_object = {"city": store_city, "store_number": store_number}
# adding the dict to our state's dict
states_stores[state_abbreviation]["stores"].append(store_object)
except Exception as e:
print(
"Error getting store info from {0}. Exception: {1}".format(
split_href, e
)
)
# let's print something so we can confirm our script is working
print(
"State store count for {0} is: {1}".format(
states_stores[state_abbreviation]["state_name"],
len(states_stores[state_abbreviation]["stores"]),
)
)
else:
print(
"Error fetching: {0}, error code: {1}".format(
link, state_search_results.status_code
)
)
except Exception as e:
print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))
I'm trying to scrape website from a job postings data, and the output looks like this:
[{'job_title': 'Junior Data Scientist','company': '\n\n BBC',
summary': "\n We're now seeking a Junior Data Scientist to
come and work with our Marketing & Audiences team in London. The Data
Science team are responsible for designing...", 'link':
'www.jobsite.com',
'summary_text': "Job Introduction\nImagine if Netflix, The Huffington Post, ESPN, and Spotify were all rolled into one....etc
I want to create a dataframe, or a CSV, that looks like this:
right now, this is the loop I'm using:
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
results = []
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '',
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
and after using the loop, I just print the results.
What would be a good way to get the output in a dataframe? Thanks!
Look at the pandas Dataframe API. There are several ways you can initialize a dataframe
list of dictionaries
list of lists
You just need to append either a list or a dictionary to a global variable, and you should be good to go.
results = []
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '', # assuming this has value like you shared in the example in your question
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
results.append(result)
# results is now a list of dictionaries
df= pandas.DataFrame(results)
One other suggestion, don't think about dumping this in a dataframe within the same program. Dump all your HTML files first into a folder, and then parse them again. This way if you need more information from the page which you hadn't considered before, or if a program terminates due to some parsing error or timeout, the work is not lost. Keep parsing separate from crawling logic.
I think you need to define the number of pages and add that into your url (ensure you have a placeholder for the value which I don't think your code, nor the other answer have). I have done this via extending your url to include a page parameter in the querystring which incorporates a placeholder.
Is your selector of class result correct? You could certainly also use for job in soup.select('.job'):. You then need to define appropriate selectors to populate values. I think it easier to grab all the job links for each page then visit the page and extract the values from a json like string in the page. Add Session to re-use connection.
Explicit waits required to prevent being blocked
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 3
with requests.Session() as s:
for page in range(1, pages + 1):
try:
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
links.append([link['href'] for link in soup.select('.job-title a')])
except Exception as e:
print(e, url )
finally:
time.sleep(2)
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of results:
I can't imagine you want all pages but you could use something similar to:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 0
def get_links(url, page):
try:
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
page_links = [link['href'] for link in soup.select('.job-title a')]
if page == 1:
global pages
pages = int(soup.select_one('.page-title span').text.replace(',',''))
except Exception as e:
print(e, url )
finally:
time.sleep(1)
return page_links
with requests.Session() as s:
links.append(get_links('https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page=1',1))
for page in range(2, pages + 1):
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
links.append(get_links(url, page))
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )