I'm trying to scrape Twitter account image, I tried multiple ways and the output keeps give me empty list!
My Code:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://twitter.com/jack/photo')
soup = BeautifulSoup(url.text, 'lxml')
image = soup.find_all('img')
print(image)
Output:
[]
That's a part of my project .. I tried lxml and find by class, but I still get nothing, maybe I'm missing something there but I don't know what it is.
If anyone can help me with it, I will be so appreciated.
Thanks in advance
I can see some React being used in the page. If you open the page and inspect the elements, you will see that as soon as you click on the photo to enlarge it, a new div appears as if from thin air. Which implies that that get created by react.
In order to address this you will need to use Selenium to open the page in a virtual browser, let the JavaScript do its magic and then look for the img tag.
You're trying to scrape the path for JavaScript twitter. If you examine the response of your page you will see the following snippit.
<form action="https://mobile.twitter.com/i/nojs_router?path=%2Fjack%2Fphoto" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
<div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
<p>We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?</p>
<p style="margin: 20px 0;">
<button type="submit" style="background-color: #1da1f2; border-radius: 100px; border: none; box-shadow: none; color: #fff; cursor: pointer; font-size: 14px; font-weight: bold; line-height: 20px; padding: 6px 16px;">Yes</button>
</p>
</div>
</form>
I would recommend disabling javascript in your browser and then figuring out how to view the photos like that. Then you could mimic those requests using requests.
What worked for me was sendind a request to the path:
https://mobile.twitter.com/jack
Then using the css selector: class = "avatar". There should be one child, an image tag, grab the src of that image tag and that should be the link to your photo.
As requested, here is the python code I used:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://mobile.twitter.com/jack')
soup = BeautifulSoup(response.text, 'lxml')
avatars = soup.findAll("td", {"class": "avatar"})
print(avatars[0].findAll('img')[0].get('src'))
Note: Twitter changes their layout frequently, so this may not work for long.
Related
I have several browsers open and each has a webpage already open that has some div and canvas elements.
<div class="webgl-content">
<div id="game" class="game" style="padding: 0px; margin: 0px; border: 0px; background: rgb(25, 39, 54);">
<canvas id="#canvas" style="width: 100%; height: 100%; cursor: default;" width="1400" height="800"></canvas>
</div>
</div>
How can I change the height:100% to a different value (say 77%) in all browser windows using python?
I am thinking something that can identify the open browsers, then find the canvas element and change the style height percentage to a new value.
I've been playing around with pywinauto library but not much progress so far: I get errors indicating that Chrome application was not found.
I have several icons that I want to line up at the end of the application. So that when I click on the image, I was transferred to a link. How should I do it?
So far, I have only managed to add this implementation through st.markdown.but they are arranged vertically because I added a new item every time I wrote markdown.
You can create custom components in streamlit using HTML. Maybe you can create a social media component.
Create a file my_component.html
<html>
<head>
<style>
.body {
height: 64px;
}
.parent {
width: 100%;
height: auto;
display: flex;
justify-content: center;
}
.child {
margin: 5px;
height: 32px;
width: 32px;
}
</style>
</head>
<body>
<div class="parent">
<a class = "child" href="https://www.google.com"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Google_%22G%22_Logo.svg/1200px-Google_%22G%22_Logo.svg.png" alt="alt" style="width:32px;height:32px;"></a>
<a class = "child" href="https://wwww.reddit.com"><img src="https://www.redditinc.com/assets/images/site/reddit-logo.png" alt="alt" style="width:32px;height:32px;"></a>
<a class = "child" href="https://wwww.facebook.com"><img src="https://facebookbrand.com/wp-content/uploads/2019/04/f_logo_RGB-Hex-Blue_512.png?w=512&h=512" alt="alt" style="width:32px;height:32px;"></a>
</div>
</body>
</html>
I've added 3 links to google, reddit,and facebook respectively. Add or edit these to something custom.
In the streamlit file, you can import HTML files as components using the components library. The implementation I'm sharing is a very simplified version.
import streamlit as st
import streamlit.components.v1 as components
HtmlFile = open("my_component.html", 'r', encoding='utf-8')
source_code = HtmlFile.read()
print(source_code)
st.text("Navbar Component")
components.html(source_code)
It's a bit basic but yields something like this.
I am able to click on the drop down heading but not able to click on the options. Also I am not able to identify the 'id' or xpath for options which is visible after clicking the dropdown option.
Find the HTML below:
<div id="object260310" style="position: absolute; overflow: hidden; background: transparent; z-index: 50; left: 154px; top: 5px; width: 74px; height: 20px;">
<div id="object351" style="position: absolute; z-index: 11; left: 0px; top: 0px; width: 74px; height: 20px; background-color: rgb(255, 255, 255);">
<div style="position: absolute; width: 42px;">
<div role="menu" aria-label="1" class="font89" style="padding-left: 0px; cursor: pointer; position: absolute; left: 0px; color: rgb(126, 126, 126); width: 43px; height: 20px; line-height: 20px; background: transparent;" onclick="plw.menu.click(this,351,0,"1",true);" onmouseenter="plw.menu.over(event,this,351,0,"1");this.style.color="rgb(174,174,174)";this.style.backgroundColor="rgb(255,255,255)"" onmouseleave="plw.menu.out(351);this.style.background="transparent";this.style.color="rgb(126,126,126)";">
<div style="position:absolute;left:0px;top:2px" class="image347 "></div>
<span style="position:relative;left:21px;top:0px">New</span>
</div>
</div>
</div>
</div>
Below is my selenium code:
new_create = WebDriverWait(driver, 40).until( EC.presence_of_all_elements_located((By.XPATH, '/html/body/div[2]/div[2]/div[5]/div/div/div/div')))
driver.find_element_by_id("object260310").click()
#its working fine till here
driver.find_element_by_xpath(".//*[contains(#onclick, '231')]").click()
#This line doesnt seems to work.
And here is a screenshot of the site.
Clicking drop down options have been iffy for me before. This is a thing I do when there aren't any other solutions:
Click the drop down then you can send keys depending on the letter of the option you need. so if an option is "action" then you press "a".
This highly depends on what you have in the drop down though.
EDIT: I would highly look at all the related questions there... They have some stuff you might be able to use. As I said this is only if those don't work!
You can try to click on the drop-down first(New in your case)
If the dropdown is not select , then use the below code select the value.
driver.find_element_by_xpath("//div[text(),'Product Variation']").click();
driver.find_element_by_xpath("//div[contains(text(), 'Product Variation']").click();
Think the xpath should be written that way. Will you send me the URL and I'll make sure I can click before I make another edit on this post?
Thank you all for your help. I was able to find the xpath using chrome XPATH Helper extension.Extension gave below xpath :
/html/body/div[#id='m235e0-SUB-1']/table[#class='oldmenu']/tbody/tr[#id='235-0-SUB-1-1']/td[#class='oldmenu']
Using this I was able to identify the id for sub menu i.e. '235-0-SUB-1-1'
So I modified the code accordingly as below :
driver.find_element_by_xpath('//*[#id="235-0-SUB-1-1"]').click()
I am trying to scrape data from yellow pages, website being this
I want this div class= search-results listing-group
I tried this
parent = soup.find('div',{'class':"search-results listing-group"})
But, I am not getting any result.
This URL has anti-scraping protections in place, which resist programmatic HTML extraction. That's the main reason why you're not getting any output. You can see this by examining the raw data returned from a request:
from bs4 import BeautifulSoup
import requests
url = "https://www.yellowpages.com.au/find/boat-yacht-sales/melbourne-vic"
soup = BeautifulSoup(requests.get(url).text)
print(soup)
Excerpt:
This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.
Are you using requests?
It appears the webpage does not allow automated scraping, atleast using Beautiful Soup.
I tried scraping it for you and this is what I see in the content.
<p style="font-weight: bold;">Why did this happen?</p>
<p style="margin-top: 20px;">This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.</p>
</div>, <div style="border-bottom: 1px #E7E7E7 solid;
margin-top: 20px;
margin-bottom: 20px;
height: 1px;
width: 100%;">
</div>, <div style="margin-left: auto;
margin-right: auto;
font-size: 20px;
max-width: 460px;
text-align: center;">
We value the quality of content provided to our customers, and to maintain this, we would like to ensure real humans are accessing our information.</div>, <div style="margin-left: auto;
margin-right: auto;
margin-top: 30px;
max-width: 305px;">
You might have to try other (legitimate) methods of scraping it.
It seems the page you are accessing doesn't allow static scraping , you need to use advance scraping using selenium like this..
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
url = "https://www.yellowpages.com.au/find/boat-yacht-sales/melbourne-vic"
driver=webdriver.Chrome(executable_path="{location}/chromedriver")
driver.get(url)
content_element = driver.find_elements_by_xpath("//div[#class='search-results
listing-group']")
content_html = content_element[0].get_attribute("innerHTML")
soup = BeautifulSoup(content_html, "html.parser")
print soup
since the class name contains space , so you need to use something like xpath or id to fetch the data.
For more info on advance scraping read this one:
https://medium.com/dualcores-studio/advanced-web-scraping-in-python-d19dfccba235
this is my html code and i need to select font size, bgcolor which is there in nonscript.
<iframe src="javascript:''" id="__gwt_historyFrame" tabIndex='-1' style="position:absolute;width:0;height:0;border:0"></iframe>
<!-- RECOMMENDED if your web app will not function without JavaScript enabled -->
<noscript>
<div style="width: 22em; position: absolute; left: 50%; margin-left: -11em; color: red; background-color: white; border: 1px solid red; padding: 4px; font-family: sans-serif">
Your web browser must have JavaScript enabled
in order for this application to display correctly.
</div>
</noscript>
can any budy help with this.
I am using python as my scriptin language.
you just want to grab the text? you can look at something like beautiful soup (which i'm not familiar with), or use a simple regex
import re
m = re.compile(r'background-color: (\w+);',re.I)
result = m.search(text)
if result:
bgc = result.group(1)