I am trying to read the LinkedIn company page, for example, https://www.linkedin.com/company/facebook
getting company name,location,type of industry,etc.
This is my code below
urlCreate1<-"https://www.linkedin.com/company/facebook"
parse_rvest<-getURL(urlCreate1,'useragent' = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36")
nameRest <- content %>%html_nodes(".industry") %>%html_text()
nameRest
and the output I get for this is character(0) which from previous posts I understand that its not getting .industry tag as I read the https code.
I have also tried this
parse_rvest<-content(GET(urlCreate1),encoding='UTF-8')
but it doesn't help
I have a python code that works but I need this to be done in R
This is part of the python code I got online
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers)
formatted_response = response.content.replace('<!--', '').replace('-->', '')
doc = html.fromstring(formatted_response)
datafrom_xpath = doc.xpath('//code[#id="stream-promo-top-bar-embed-id-content"]//text()')
if datafrom_xpath:
try:
json_formatted_data = json.loads(datafrom_xpath[0])
company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None
size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None
Please help me in reading the page. I am using selector gadget to get the xpath(.industry)
Have a look at the LIN API:
https://cran.r-project.org/web/packages/Rlinkedin/Rlinkedin.pdf
Then, you should be able to easily, and legally, do whatever you want to do.
Here are some ideas to get you started.
http://thinktostart.com/analyze-linkedin-with-r/
https://github.com/hadley/httr/issues/200
https://www.reddit.com/r/datascience/comments/3rufk5/pulling_data_from_linkedin_api/
Related
I try to learn Web Scraping using python. 1st I try to scrape from an amazon page. I try to find out the Best Sellers in Women's Fashion Sneakers.
My code:
no_pages = 2
def get_data(pageNo):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
r = requests.get('https://www.amazon.com/Best-Sellers-Womens-Fashion-Sneakers/zgbs/fashion/679394011'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
print(soup)
I did not get any output for this portion. Then I find out this. I thought that web scraping is not possible for Amazon. Then I changed my source page to monster.com. But did not get the output.
no_pages = 2
def get_data(pageNo):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
r = requests.get('https://www.monster.com/jobs/search/?q=Software-developer&where=Texas-City__2C-TX'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
print(soup)
How could I solve this problem. Thank you.
It seems that your code runs correctly. However, you need to call the function in order to run it. For instance, after your function, write the following:
get_data(no_pages)
This will trigger the function to run.
I'm trying to make a program that will connect on a website using this code:
import requests
url = "https://website/cours/login/index.php"
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Mobile Safari/537.36"
}
brack = requests.post(url,headers = headers, data={"username":"e19****","password":"0****"})
content = brack.content
print(content)
the thing is that when im posting, nothing happens, like if I just did a GET. There's no error when I run (200 OK).
Depending on the API definition, a POST request may not return a content. Since you're getting a 200 response. Most probably there's a json object that was returned with your POST request.
Try this:
import requests
url = "https://website/cours/login/index.php"
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Mobile Safari/537.36"
}
response = requests.post(url,headers=headers,
data={"username":"e19****","password":"0****"})
print(response.json())
Usually, if the POST request is expected to give a real-time/streaming ("instant") response to provide/return some data or content, then you might get an session/job ID where you can poll/fetch the completed job at a later time.
These tutorial might be helpful:
https://realpython.com/flask-connexion-rest-api/
https://rapidapi.com/blog/how-to-use-an-api-with-python/
https://www.w3schools.com/tags/ref_httpmethods.asp
I have been trying to find the Xpath information for the "customers who viewed this product also viewed" but i cannot seem to get the code correct. I have very little experience with Xpath but i have been using an online scraper to get the info and learn from it.
what ive been doing is
def AmzonParser(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,headers=headers)
def scraper():
While True:
XPATH_RECOMMENDED = '//a[#id="anonCarousel3"]/ol/li[1]/div/a/div[2] //href'
RAW_RECOMMENDED = doc.xpath(XPATH_RECOMMENDED)
RECOMMENDED = ' '.join(RAW_RECOMMENDED).strip() if RAW_RECOMMENDED else None
And my main goal is to get the customers also viewed link. so i can pass it to the scraper. This is just a snippet of my code.
I am trying to make a scan of the HTML in 2 requests.
at the first one, it's working but when I am trying to use another one,
The HTML I am trying to locate are not visible its getting it wrong.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get('https://www.amazon.com/gp/aw/ol/B00DZKQSRQ/ref=mw_dp_olp?ie=UTF8&condition=new, headers=headers)
newHeader = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}
pagePrice = requests.get('https://www.amazon.com/gp/aw/ol/B01EQJU8AW/ref=mw_dp_olp?ie=UTF8&condition=new',headers=newHeader)
The first request works fine and gets me the good HTML.
The second request gives bad HTML.
I saw this package, but not success :
https://pypi.python.org/pypi/fake-useragent
And I saw this topic not unaswered :
Double user-agent tag, "user-agent: user-agent: Mozilla/"
Thank you very much!
Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease