Scraping: run for loop n number of times - python

I am using instaloader to scrape instagram posts as part of a study project.
To avoid getting shut down by instagram, I use sleep function to sleep between 1-20 sec between each round. This works well.
I don't want to have to go through all posts each time I scrape, and therefore i want the loop to run 5 times. Which will give me 5 posts. But I don't seem to manage to get it to do it.
I had written the following function to try to scrape the profile and return the first 5 posts:
## importing and creating instance
from instaloader import Instaloader
from instaloader import Profile
import instaloader
import time
from random import randint
L = instaloader.Instaloader()
#random time for sleep
vent = randint(1,20)
# function:
def get2posts(profile_name):
profile = Profile.from_username(L.context, profile_name)
POSTS = profile.get_posts()
for post in POSTS:
for i in range(2):
L.download_post(post, profile_name)
time.sleep(vent)
break
print('scrape done')
This code returns 5 of the same posts though, and I simply can't figure out a way to get it to return the first 5 posts of an account.
The working function, which harvests all posts of a profile is:
# the original function (without range)
def get_posts(profile_name):
profile = Profile.from_username(L.context, profile_name)
POSTS = profile.get_posts()
for post in POSTS:
L.download_post(post, profile_name)
time.sleep(vent)
print('I am done')
Hope you can help :)

The problem is that the inner for loop runs download_post twice (range(2)) on the same post, and then the outer loop breaks. If POSTS is a list, you can use slicing to loop only over the first 5 items like so: for post in POSTS[:5]:. A safer method though would be to count the posts as you go, which should work for most types of iterables (not just lists), like so:
def get2posts(profile_name):
profile = Profile.from_username(L.context, profile_name)
POSTS = profile.get_posts()
for i, post in enumerate(POSTS):
L.download_post(post, profile_name)
if i == 4:
break
time.sleep(vent)
print('scrape done')

Related

Tweepy- Get all friends of self and unfollow the 50 oldest friends

I am trying to gather a list of people I follow. Next, I would reverse the list and unfollow the first 50. I have seen similar answers to the question of how to gather the list of all friends, but I am still getting stuck and unsure if the documentation changed since those questions are a little old. I am getting Twitter error response: status code = 431
Below is the current relevant code,
import os
import logging
import time
import tweepy
from time import sleep
from config import *
api = initialize_api()
ids = []
for page in tweepy.Cursor(api.friends_ids, screen_name="xxxxxxx").pages():
ids.extend(page)
time.sleep(60)
screen_names = [user.screen_name for user in api.lookup_users(user_ids=ids)]
screen_names.reverse()
logger.info("Starting Unfollow")
i=0
while i<50:
api.destroy_friendship(screen_names[i])
i += 1

How can I quickly get the follower count for a large list of Instagram users?

I have the following program in python that reads in a list of 1390680 URLS of Instagram accounts and gets the follower count for each user. It utilizes the instaloader. Here's the code:
import pandas as pd
from instaloader import Instaloader, Profile
# 1. Loading in the data
# Reading the data from the csv
data = pd.read_csv('IG_Audience.csv')
# Getting the profile urls
urls = data['Profile URL']
def getFollowerCount(PROFILE):
# using the instaloader module to get follower counts from this programmer
# https://stackoverflow.com/questions/52225334/webscraping-instagram-follower-count-beautifulsoup
try:
L = Instaloader()
profile = Profile.from_username(L.context, PROFILE)
print(PROFILE, 'has', profile.followers, 'followers')
return(profile.followers)
except Exception as exception :
print(exception, False)
return(0)
# Follower count List
followerCounts = []
# This loop will fetch the follower count for each user
for url in urls:
# Getting the profile username from the URL by removing the instagram.com
# portion and the backslash at the end of the url
url_dirty = url.replace('https://www.instagram.com/', '')
url_clean = url_dirty[:-1]
followerCounts.append(getFollowerCount(url_clean))
# Converting the list to a series, adding it to the dataframe, and writing it to
# a csv
data['Follower Count'] = pd.Series(followerCounts)
data.to_csv('IG_Audience.csv')
The main issue I have with this is that it is taking a very long time to read through the entire list. It took 14 hours just to get the follower counts for 3035 users. Is there any way to speed up this process?
First I wanna say I'm sorry for being VERY late but hopefully this can help someone in the future. I'm having a similar issue and I believe I found out why, when you get the followers you instaloader doesn't just go to the profiles page and read the number but it gets the URL and profile ID for each account and can only get so many at a time, the best way I can think to get around this would be to make a request to the page and just read the follower count on their main page issue with this however is after I believe 9999 followers it will start saying "10k" or "10.1k" so you'll be off by 100 and just gets worse if the person has over a million because then its off by even more.

Why won't my python program print the list?

import praw
import time
from selenium import webdriver
driver = webdriver.Chrome()
r = praw.Reddit(client_id='XXXXXXXXXX',
client_secret='XXXXXXXXXXXXX', password='XXXXXXXXX',
user_agent='Grand Street Tech', username='grandstreetsupreme')
subreddit = r.subreddit('supremeclothing')
submissions = []
users = []
for submission in r.subreddit('supremeclothing').new(limit=999):
for comment in submission.comments:
author = comment.author
users.append(author)
It takes like 10 minutes to complete and when it does it doesn't do anything.
There is no print statement for the users right, put a statement like below.
print users
This is because you just created the list users, you need to tell python to print it.
After your for loop, put print users

Facebook API using Python. How to print out comments from ALL post?

I am new to Facebook API. Currently, I am trying to print out ALL the comments that have been posted for this facebook page called 'leehsienloong'. However, I could only print out a total of 700+ comments. I'm sure there are more than 700+ comments in total.
I find out that the problem is, I did not request to go to another page to print out the comments. I read about paging Facebook API, but I still do not understand how to do the code for paging.
Is there anyone out there who will be able to help/assist me? I really need help. Thank you.
Here is my code, without paging:
import facebook #sudo pip install facebook-sdk
import itertools
import json
import re
import requests
access_token = "XXX"
user = 'leehsienloong'
graph = facebook.GraphAPI(access_token)
profile = graph.get_object(user)
posts = graph.get_connections(profile['id'], 'posts')
Jstr = json.dumps(posts)
JDict = json.loads(Jstr)
count = 0
for i in JDict['data']:
allID = i['id']
try:
allComments = i['comments']
for a in allComments['data']:
count += 1
print a['message']
except (UnicodeEncodeError):
pass
print count
You can use the limit parameter to increase the number of comments to be fetched. The default is 25. You can increase it like this:
posts = graph.get_connections(profile['id'], 'posts', limit=100)
But more convenient way would be get the previous and next pages from paging and do multiple requests.
to get all the comments of a post the logic should be something like
comments = []
for post in posts["data"]:
first_comments = graph.get_connections(id=post["id"], connection_name="comments")
comments.extend(first_comments["data"])
while True:
try:
next_comments = requests.get(post_comments["paging"]["next"]).json()
comments.extend(next_comments["data"])
except KeyError:
break

Error with Python and Instagram API when searching for hashtag

I would like to find out the ten Instagram users that posted most pictures with a certain hashtag.
I am using python 2.7, and I wrote this:
import urllib, json
from collections import Counter
def GetNumberPics():
urlInstagram = "https://api.instagram.com/v1/tags/HASHTAG?access_token=ACCESSTOKEN"
response = urllib.urlopen(urlInstagram)
return json.loads(response.read())['data']['media_count']
def GetPics(url):
urlInstagram = url
response = urllib.urlopen(urlInstagram)
pics = json.loads(response.read())
return pics
in this next piece I find out how many pictures are on Instagram with that hashtag, and I divide it by 20. This because, as far as I understood, it is the number of picture data that I'll receive on each API call. So by doing this I should understand how many times I would have to make the API call to get the data of all pictures.
nPics = GetNumberPics()
print nPics
times = nPics / 20
print times
FirstUrl = 'https://api.instagram.com/v1/tags/HASHTAG/media/recent?client_id=CLIENTID'
pics = GetPics(FirstUrl)
making a list of all users:
users = []
for i in range(20):
users.append(pics['data'][i]['user']['username'])
getting the next url, as received in the first api call:
nextUrl = pics['pagination']['next_url']
making the api call for the times calculated before - I'm printing i just to see how many times I do the API call.
for i in range(times):
print i
pics = GetPics(nextUrl)
for l in range(len(pics['data'])):
users.append(pics['data'][l]['user']['username'])
nextUrl = pics['pagination']['next_url']
counting the users and printing out the ten users that used the most that hashtag:
counts = Counter(users)
print(counts).most_common(10)
I get an error which I can't understand, when I arrive to the 89th call, when using the hashtag "inerasmus":
Traceback (most recent call last):
File "C:\Users\Michele\Desktop\programming\EIE\tweetNumber.py", line 55, in <module>
nextUrl = pics['pagination']['next_url']
KeyError: 'next_url'
I hope it is a useful question also for someone else. Thank you very much!
Why don't you use their Python API? From what I see, you can do all of this with the python library. Also, there are already some people on Github that have messed around with the API. Here is one and another.

Categories

Resources