I was trying to scrape Instagram and I have already achieved my goal of scraping, but the result I get is perfect but I want it to be stored in the list in a list.
Code:-
Post links = ['https://www.instagram.com/p/BesW08pHfUt', 'https://www.instagram.com/p/BQZyTtej4yj']
for post_link in post_links:
_ = API.getMediaComments(get_media_id(post_link), max_id = 100)
for c in reversed(API.LastJson['comments']):
comment.append(c["user"]["username"])
The comments I get from each post links from Instagram
'https://www.instagram.com/p/BesW08pHfUt':- 'headhotel', 'famegalore', 'motivationpoem', 'malicioussatan'
'https://www.instagram.com/p/BQZyTtej4yj':- 'monarch_motivation', 'headhotel', 'motivationpoem'
The output I get
['headhotel', 'famegalore', 'motivationpoem', 'malicioussatan', 'monarch_motivation', 'headhotel', 'motivationpoem']
The output I want
[['headhotel', 'famegalore', 'motivationpoem', 'malicioussatan'], ['monarch_motivation', 'headhotel', 'motivationpoem']]
I know this is kind of easy but I have been coded this scraper in 2 days so I have got a bit of confused!
I'm not familiar with that API, but I think you want to do something like this:
for post_link in post_links:
_ = API.getMediaComments(get_media_id(post_link), max_id = 100)
sublist = []
for c in reversed(API.LastJson['comments']):
sublist.append(c["user"]["username"])
comment.append(sublist)
That creates a new sublist on each iteration of the outer loop, which the inner loop fills, and then we append the sublist to the main comment list.
Related
I am struggling with the following, any help would be highly appreciated. The path I chose to solve the problem might be clunky, even outdated, but it is the best I could do. So, I am trying to get recent tweets BASED on a query and ONLY from the people I follow on Twitter. So I ran two different queries:
1)
followers = client.get_users_following(id = '', max_results = 100)
and 2)
tweets = client.search_recent_tweets(query=query, tweet_fields=['author_id', 'created_at'], max_results=100)
I managed to get the responses into json objects, then normalise and at the end I get two dataframes:
A) a dataframe df['id']--where the 'id' is the unique username of the Twitter user, result of the first query("get_users_following"); here I converted the 'id' type from "object"to "int"
B) a dataframe with the following columns ['author_id'], ['text'], ['created_at'], ['id'] ---where 'author_id' is the unique username of the Twitter user, the same as the 'id' from the previous dataframe
All good until the point where I am trying to iterate through the 'author_id' to see if it matches my list of the 'id's of the people I follow and whenever it does, I would like to add the text of that particular tweet to a list and start analysing the data.
The code I am struggling with is below and the thing is that somehow that the error I get is in fact an empty list.
all = []
x =len(df['id'])
for number in twe['author_id']:
for j in range(x):
if number == df['id'][j]:
all.append(twe['text'][number])
else:
j+=1
print(all)
print(len(all))
I checked and there were people that I follow that were tweeting on a particular topic or another.
Any thoughts would be highly appreciated.
LATER EDIT:
In the meanwhile I worked a bit more on the for loop, but still the same empty list as a result.
al = []
print(f1.shape)
x =len(f1['id'])
print(x)
y = len(twe['text'])
print(y)
i = 0
j = 0
for (i,j) in [(i,j) for i in range(x) for j in range(y)]:
if f1['id'][i] == twe['author_id'][j]:
al.append(['text'][j])
else:
if j<y:
j+=1
else:
i+=1
print(len(al))
I want to scrape every page on the following website: https://www.top40.nl/top40/2020/week-34 (for each year and weeknumber) by clicking on the song, then move to 'songinfo' and then scrape all data in the table listed there. For this question, I only scraped the title so far.
This the url I use:
url = 'https://www.top40.nl/top40/'
However, when I print the songs_list, it will only return the last title on the website. As such, I believe I am overwriting.
Hopefully someone can explain me which mistake(s) I am making and if there is any easier way to scrape the table on each page, very happy to hear.
Please find my python code below:
for year in range(2015,2016):
for week in range(1,2):
page_url = url+str(year) + '/' + 'week-' + str(week)
driver.get(page_url)
lists = driver.find_elements_by_xpath("//a[#data-linktype='title']")
links = []
for l in lists:
print(l.get_attribute('href'))
links.append(l.get_attribute('href'))
for link in links:
driver.get(link)
driver.find_element_by_xpath("//a[#href='#songinfo']").click()
songs = driver.find_elements_by_xpath(""".//*[#id="songinfo"]/table/tbody/tr[2]/td""")
songs_list = []
for s in songs:
print(s.get_attribute('innerHTML'))
songs_list.append(s.get_attribute('innerHTML'))```
The line songs_list = [] is inside the for link in links loop, so with each new iteration it gets set to an empty list (and then you append to this new, empty list). Once you end all your loops, you only see the songs_list created.
The simplest fix is to place the songs_list = [] line outside all for loops, ex:
songs_list = []
for year in range(2015,2016):
for week in range(1,2):
# etc
Im building a google sheet to keep track of stock prices for the stocks i own. I have an API running thats connected to Google Sheets and my own python application.
My google sheet looks like this
Stock | Previous close
AAPL | 316.73
NVDA | 348.71
SPOT | 191.00
i currently have the code running as follows.
import requests
import gspread
from oauth2client.service_account import ServiceAccountCredentials
sheet = client.open("Stock").sheet1
AAPL = sheet.cell(2,1).value
url = ('https://ca.finance.yahoo.com/quote/'+AAPL+'?p='+AAPL+'&.tsrc=fin-srch')
response = requests.get(url)
htmltext = response.text
splitlist = htmltext.split("Previous Close")
afterfirstsplit =splitlist[1].split("\">")[2]
aftersecondsplit = afterfirstsplit.split("</span>")
datavalue = aftersecondsplit[0]
sheet.update_cell(2,2,datavalue)
# this would update the value within my google sheet to the previous close price
For each individual stock, i would copy and paste, change the stock symbol, to find the value of the next quote.
I know theres a way to use FOR statements to automate this process. I tried that with the following but it wouldnt update as needed. I reached a wall at this point and would appreciate any help or insight on how i could automate this function.
tickers = {sheet.cell(2,1).value : [],
sheet.cell(3,1).value : [],
sheet.cell(4,1).value : [],
sheet.cell(5,1).value :[]}
for symbols in tickers:
url = ('https://ca.finance.yahoo.com/quote/'+symbols+'?p='+symbols+'&.tsrc=fin-srch')
response = requests.get(url)
htmltext = response.text
splitlist = htmltext.split("Previous Close")
afterfirstsplit =splitlist[1].split("\">")[2]
aftersecondsplit = afterfirstsplit.split("</span>")
datavalue = aftersecondsplit[0]
sheet.update.cell(2,1,datavalue)
print (datavalue)
Doing this gathers all the values of the current stock prices and it does import it into the excel file but only to one coordinate. I dont know how to increase the '1' within sheet.update.cell(2,1,datavalue), each time within the FOR statement. I believe that is the way to solve this, but if anyone has any other suggestions, im all ears.
In regards to answering this part of your question:
"I don't know how to increase the '1' within sheet.update.cell(2,1,datavalue), each time within the FOR statement."
This is how you increment a counter inside a for loop typically speaking:
counter = 1
for symbol in tickers:
#Your code
sheet.update.cell(2,counter,datavalue)
counter = counter+1
While counter variables are a very common pattern used in most programming language (see Akib Rhast's answer), the more pythonic way to do it is by using the enumerate builtin function:
for column, symbol in enumerate(tickers, start=1):
# do stuff
sheet.update.cell(2,column,datavalue)
what is enumerate?
As the documentation states, enumerate takes something that you can iterate on (like a list) and returns a tuple with the counter as the first element and the elements from the iterator as the second element:
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons, start=1))
# outputs [(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
It also has the advantage of doing so in a memory-efficient manner and is directly tied to your loop.
why is there a comma in my for loop?
This is just syntactic sugar in python that allows you to unpack a tuple or list:
alist = [1, 2, 3]
first, second, third = alist
print(third) # outputs 3
print(second) # outputs 2
print(first) # outputs 1
As enumerate returns a tuple, you are basically assigning each element on that tuple to a different variable at the same time.
Im downloading the followers from 2 twitter accounts and putting them into a list of dictionaries. I downloaded 10 followers from account1 and 10 followers from account2. And with the following code i take the first 4 followers of account 1 and the first 4 of account2 and display them
twitter_accounts = ["account1", "account2"]
res = {}
follower = []
pbar = tqdm_notebook(total=len(twitter_accounts))
for twitter_account in twitter_accounts:
inner_structure = []
for page in tweepy.Cursor(api.followers, screen_name=twitter_account,
skip_status=True, include_user_entities=False).items(10):
val = page._json
inner_dict = {}
inner_dict["name"] = val["name"]
inner_dict["screen_name"] = val["screen_name"]
if inner_dict not in inner_structure:
inner_structure.append(inner_dict)
res[twitter_account] = inner_structure
pbar.update(1)
pbar.close()
for twitter_account in twitter_accounts:
for i in range(4):
display(res[twitter_account][i]['screen_name'])
So the final result will be the displaying of the first 4 followers of acc1 and the first 4 of acc2.
But what i really need to do is take those 8 strings and instead of displaying them storing them into an array.
I tried this way but i get an index out of range error.
for twitter_account in twitter_accounts:
for i in range(4):
for j in range(8):
follower[j]= res[twitter_account][i]['screen_name']
How can i store them in an array without getting the error?
You can either declare the list first, and append elements to it:
followers = []
for twitter_account in twitter_accounts:
for i in range(4):
followers.append(res[twitter_account][i]['screen_name'])
Or use a list comprehension directly (which I believe works but I can't test right now):
followers = [res[twitter_account][i]['screen_name'] for i in range(4) for twitter_account in twitter_accounts]
Whichever you find clearer and more readable (-:
(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.