How can I Scrape Business Email Contact with python? - python

this morning I wanted to create a little Software/Script in Python, it was 6am when I started and now I'm about to become crazy because it's 22pm and I have nothing that works.
So basically, I want to do this: Given an Instagram Username, scrape the Name, Number of followers and the business contact email.
I found out that going to the page source will give me this info (let's consider only the email for now): https://imgur.com/a/jYQ2FtR
Any idea about how I can do that? I try many different things and nothing is working. I don't know what to do. I tried downloading the page and parsing the text looking for "business_email" but I have no idea about how to implement it and extracting the data I'm looking for, I know it's a simple task, but I'm a total noob and I haven't been coding for years.
Can someone tell me how to do it? Or at least point me in the right direction.

There are different ways to approach this problem. If the data you want is visible on the page, then you could scrap that info using Beatiful Soup. If not, then it's a little more trickier but you could extract the info for the page source using regular expressions with the re module.

Related

How to web scrape to find out new updates on website

I know it's a broad question, but I'm looking for ideas to go about doing this. Not looking for the exact coded answer, but a rough gameplan of how to go about this!
I'm trying to scrape a blog site to check for new blog posts, and if so, to return the URL of that particular blog post.
There are 2 parts to this question, namely
Finding out if the website has been updated
Finding what is the difference (new content)
I'm wondering what are the approaches I could go about doing this. I have been using Selenium for quite a bit, and am aware that with the Selenium driver I could check for 1. with driver.page_source.
Is there a better way to do both 1 and 2 together, and if possible even across various different blog sites (thinking whether it is possible to write more general code applied to various blogposts at once, not a customs script for each post)?
Bonus: Is there a way to do a "diff" on the before and after of the code to see the difference, and extract necessary information from there?
Thanks so much in advance!
If you're looking for a way to know if pages have been added or deleted, you can either look at directly, or build yourself a copy of a sitemap.xml file. If they do not have a sitemap.xml, you can crawl the menu and navigation for the site and build up your own from that. Sitemap files have a 'last modified' entry. If you know the interval you are scraping on, you can calculate rather quickly if the change occurred within the interval. This is good for site-wide changes.
Alternatively, you can also check the site-header to determine the last modified time for the page. Apply the same interval check as the site-map and go from there.
You can always check the last modified value in the web sites header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified

Filling forms on different websites using Selenium and Python

I'm a beginner to Python and trying to start my very first project, which revolves around creating a program to automatically fill in pre-defined values in forms on various websites.
Currently, I'm struggling to find a way to identify web elements using the text shown on the website. For example, website A's email field shows "Email:" while another website might show "Fill in your email". In such cases, finding the element using ID or name would not be possible (unless I write a different set of code for each website) as they vary from website to website.
So, my question is, is it possible to write a code where it will scan all the fields -> check the text -> then fill in the values based on the texts that are associated with each field?
It is possible if you know the markup of the page, and you can write code to parse this page. In this case you should use xpath, lxml, beautiful soup, selenium etc. You can look at many manuals on google or youtube, just type "python scraping"
But if you want to write a program able to understand random page on a random site and understand what it should do, it is very difficult, it's a complex task with using machine learning. I guess this task is completely not for beginners.

Django image scraping then using sort thumbnail

I'm creating a Reddit clone, and this is my last piece I ne d to finish the job. I want image scraping from the url user provides. I have no idea how to start this though. I checked out Reddit code, and it seems like there are different functions for different sites. Any guidance to get me started? Any tutorial I can take a look at?
Check out Beautiful Soup. It should get the job done.

Scraping information from a flash object on a website using python or any other method

I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

How to get names of first 200 Facebook users using Python?

I am trying to get the names of first 200 Facebook users.
I am using Python and BeautifulSoup
The approach which I'm using is that instead of using Graph API, I'm trying to get the names using the title of the webpage.(The title of the profile webpage is the name of the person)
The first user is Zuckerberg(id:4). I want names till 200.
Here's what I've tried.
import urllib2
from BeautifulSoup import BeautifulSoup
x=4
while(x<=200):
print BeautifulSoup(urllib2.urlopen("https://www.facebook.com/"+str(x))).title.string
x+=1
Can anyone help?
Well, I concur with the other commenters that there is barely enough information to figure out what the problem is. But reading between the lines a bit, I imagine the OP is expecting that results for pages such as
https://www.facebook.com/4
which gets redirected to https://www.facebook.com/zuck, Mark Zuckerberg's page, and https://www.facebook.com/5
which gets redirected to https://www.facebook.com/ChrisHughes, another early Facebook employee, will continue to work for further arbitrary user IDs the OP plugs in. In fact, I believe this trick did used to work in the past... until someone posted a spreadsheet of the first 2000 Facebook users somewhere, and Facebook clamped down on this hole (this is from memory, I bet there's a news story out there if someone feels like digging).
Anyway, trying further user IDs in the URL such as:
https://www.facebook.com/7 now gives a "Sorry, this page isn't available" response. To the OP, I don't think there's any easy way you can code around this -- Zuck obviously doesn't care that you're harvesting his own page, but I guess he's not keen on letting you scrape the entire Facebook user list. Sorry.
Update: you might still be able to get away with such harvesting using Facebook's Graph API -- it appears that GETs of pages like https://graph.facebook.com/100 will work for most User IDs. You should be able to script up what you need from there (if I were Facebook, I would have rate-limiting in place to prevent mass harvesting, but you'll have to try and see what you get for yourself.) Here's a script similar to what you're trying to do.

Categories

Resources