I am doing some web scraping using selenium and am able to return a phone number and email but unable to append it to my dataframe.
I have tried running the function and it spits out the correct information and I have tried saving the results of the function to a variable, then putting it into the dataframe but it just won't save the way I am trying to get it to save
df = pd.DataFrame(columns=['Phone', 'EmailAddress'])
def phonenumber():
for element in browser.find_elements_by_xpath('.//span[#class = "phone ng-binding ng-scope"]'):
return(element.text)
def email():
for element in browser.find_elements_by_xpath('.//span[#class = "email ng-scope"]'):
return(element.text)
df = df.append({'Phone': phonenumber(), 'EmailAddress': email()}, ignore_index=True)
Right now, the code returns "none" in the dataframe
You can append each element in the for loop into the respective empty lists for each function, return them from the functions and then use them to create the dataframe:
def phonenumber():
ph = []
for element in browser.find_elements_by_xpath('.//span[#class = "phone ng-binding ng-scope"]'):
ph.append(element.text)
return ph
def email():
mail = []
for element in browser.find_elements_by_xpath('.//span[#class = "email ng-scope"]'):
mail.append(element.text)
return mail
ph = phonenumber()
mail = email()
Now use the appended lists to create the dataframe. This is assuming that the length of the lists is equal.
df = pd.DataFrame({'Phone':ph, 'EmailAddress':mail})
Related
Edit, since I realize it also has the vedio url,
My question is how can I only get the photo url in the following loop?
I want to add a attribute called photourl which is the full url from the media.
import snscrape.modules.twitter as sntwitter
import pandas as pd
# Creating list to append tweet data to
attributes_container = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
if i>150:
break
attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content, tweet.media])
# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet","media"])
When I used the snscrape to scrape tweet from the twitter,
I want to filter the photo image from the photo graph. I get the media object like the following:
media=[Photo(previewUrl='https://pbs.twimg.com/media/FePrYL7WQAQDKEB?format=jpg, fullUrl='https://pbs.twimg.com/media/FePrYL7WQAQDKEB?format=jpg&name=large')]
So How can I just get the PreviewUrl'https://pbs.twimg.com/media/FePrYL7WQAQDKEB?format=jpg, and full url sperately',
use the python code?
Thanks
you can change your for loop to:
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
if i>150:
break
try:
tweetMedia = tweet.media[0].fullUrl # .previewUrl if you want previewUrl
except:
tweetMedia = tweet.media # or None or '' or any default value
attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content, tweetMedia])
and then you'll have the urls [if there are any] for each tweet row.
If you want it all inside the append statement, you can just change that to:
attributes_container.append([
tweet.user.username, tweet.date, tweet.likeCount,
tweet.sourceLabel, tweet.content,
(tweet.media[0].fullUrl if tweet.media
and hasattr(tweet.media[0], 'fullUrl')
else tweet.media)
])
[instead of adding the try...except]
I have a page that is meant to show company financials based on customers input(they input which company they're looking for). Once they submit, in the view function I want to create 5 API urls which then get the json data, create a list with the dates from 1 API result (they will all contain the same dates, so I will use the same list for all), then create new dictionaries with specific data from each api call, as well as the list of dates. I then want to create dataframes for each dictionary, then render each as html to the template.
My first attempt at this I attempted to do all requests.get and jsons.load calls one after another in a try block within " if request.method == 'POST' " block. This worked well when only grabbing data from one API call, but did not work with 5. I would get the local variable referenced before assigned error, which makes me think either the multiple requests.get or json.loads was creating the error.
My current attempt(which was created out of curiosity to see if it worked this way) does work as expected, but is obv not correct as it is calling the API multiple times in the for loop, as shown. (I have taken out some code for simplicity)
def get_financials(request, *args, **kwargs):
pd.options.display.float_format = '{:,.0f}'.format
IS_financials = {} #Income statement dictionary
BS_financials = {} #Balance sheet dictionary
dates = []
if request.method == 'POST':
ticker = request.POST['ticker']
IS_Url = APIURL1
BS_URL = APIURL2
try:
IS_r = requests.get(IS_Url)
IS = json.loads(IS_r.content)
for year in IS:
y = year['date']
dates.append(y)
for item in range(len(dates)):
IS_financials[dates[item]] = {}
IS_financials[dates[item]]['Revenue'] = IS[item]['revenue'] / thousands
IS_financials[dates[item]]["Cost of Revenue"] = IS[item]['costOfRevenue'] / thousands
IS_fundementals = pd.DataFrame.from_dict(IS_financials, orient="columns")
for item in range(len(dates)):
BS_r = requests.get(BS_URL)
BS = json.loads(BS_r.content)
BS_financials[dates[item]] = {}
BS_financials[dates[item]]['Cash and Equivalents'] = BS[item]['cashAndCashEquivalents'] / thousands
BS_financials[dates[item]]['Short Term Investments'] = BS[item]['shortTermInvestments'] / thousands
BS_fundementals = pd.DataFrame.from_dict(BS_financials, orient="columns")
except Exception as e:
apiList = "Error..."
return render(request, 'financials.html', {'IS': IS_fundementals.to_html(), 'BS': BS_fundementals.to_html()})
else:
return render(request, 'financials.html', {})
I'm trying to think of the proper way to do this. I'm new to django/python and not quite sure the best practice for a problem like this would be. I thought about making separate functions for each API, but then I would be unable to render them all on the same page. Can I use nested functions? Where only the main function renders to template, and all inner functions simply return the dataframe to outer function? Would class based views be better for something like this? I have never worked with class based views yet so would be a bit of a learning curve.
Another question I have is how to change the html in the table that is rendered from dataframe? The table/font that is currently rendered is quite large.
Thanks for any tips/advice!
It's not common to use pandas only for it's .to_html() method, but I have invoked pandas in a django method for less.
A more common approach is to loop over the IS and BS objects using django template's loop methods to generate the html tables.
To make this method more efficient move the BS api call out of the date loop, As long as the API call is not changed by the date.
Reasonable timeouts on the api calls would help also.
def get_financials(request, *args, **kwargs):
pd.options.display.float_format = '{:,.0f}'.format
IS_financials = {} #Income statement dictionary
BS_financials = {} #Balance sheet dictionary
dates = []
if request.method == 'POST':
ticker = request.POST['ticker']
IS_Url = APIURL1
BS_URL = APIURL2
try:
IS_r = requests.get(IS_Url, timeout=10)
IS = json.loads(IS_r.content)
BS_r = requests.get(BS_URL, timeout=10)
BS = json.loads(BS_r.content)
for year in IS:
y = year['date']
dates.append(y)
for item in range(len(dates)):
IS_financials[dates[item]] = {}
IS_financials[dates[item]]['Revenue'] = IS[item]['revenue'] / thousands
IS_financials[dates[item]]["Cost of Revenue"] = IS[item]['costOfRevenue'] / thousands
IS_fundementals = pd.DataFrame.from_dict(IS_financials, orient="columns")
for item in range(len(dates)):
BS_financials[dates[item]] = {}
BS_financials[dates[item]]['Cash and Equivalents'] = BS[item]['cashAndCashEquivalents'] / thousands
BS_financials[dates[item]]['Short Term Investments'] = BS[item]['shortTermInvestments'] / thousands
BS_fundementals = pd.DataFrame.from_dict(BS_financials, orient="columns")
except Exception as e:
apiList = "Error..."
return render(request, 'financials.html', {'IS': IS_fundementals.to_html(), 'BS': BS_fundementals.to_html()})
else:
return render(request, 'financials.html', {})
I need to loop through commits and get name, date, and messages info from
GitHub API.
https://api.github.com/repos/droptable461/Project-Project-Management/commits
I have many different things but I keep getting stuck at string indices must be integers error:
def git():
#name , date , message
#https://api.github.com/repos/droptable461/Project-Project-Management/commits
#commit { author { name and date
#commit { message
#with urlopen('https://api.github.com/repos/droptable461/Project Project-Management/commits') as response:
#source = response.read()
#data = json.loads(source)
#state = []
#for state in data['committer']:
#state.append(state['name'])
#print(state)
link = 'https://api.github.com/repos/droptable461/Project-Project-Management/events'
r = requests.get('https://api.github.com/repos/droptable461/Project-Project-Management/commits')
#print(r)
#one = r['commit']
#print(one)
for item in r.json():
for c in item['commit']['committer']:
print(c['name'],c['date'])
return 'suc'
Need to get person who did the commit, date and their message.
item['commit']['committer'] is a dictionary object, and therefore the line:
for c in item['commit']['committer']: is transiting dictionary keys.
Since you are calling [] on a string (the dictionary key), you are getting the error.
Instead that code should look more like:
def git():
link = 'https://api.github.com/repos/droptable461/Project-Project-Management/events'
r = requests.get('https://api.github.com/repos/droptable461/Project-Project-Management/commits')
for item in r.json():
for key in item['commit']['committer']:
print(item['commit']['committer']['name'])
print(item['commit']['committer']['date'])
print(item['commit']['message'])
return 'suc'
I wrote 2 functions so I can get champion ID knowing champion Name but then I wanted to get champion Name knowing champion ID but I cannot figure it out how to extract the name because of how the data structured.
"data":{"Aatrox":{"version":"8.23.1","id":"Aatrox","key":"266","name":"Aatrox"
so in my code I wrote ['data']['championName'(in this case Aatrox)]['key'] to get the champion ID/key. But how can I reverse it if for example I don't know the champion Name but champions ID. How can I get the champion Name if after writing ['data'] I need to write champion Name so I can go deeper and get all the champions info like ID, title etc..
link: http://ddragon.leagueoflegends.com/cdn/8.23.1/data/en_US/champion.json
Code:
def requestChampionData(championName):
name = championName.lower()
name = name.title()
URL = "http://ddragon.leagueoflegends.com/cdn/8.23.1/data/en_US/champion/" + name + ".json"
response = requests.get(URL)
return response.json()
def championID(championName):
championData = requestChampionData(championName)
championID = str(championData['data'][championName]['key'])
return championID
since python values are passed by reference you can make a new dict with keys as the champion id pointing to the values of the previous dict, that way you dont duplicate too much data. but be carefull if you change data in one dict the data will be changed in the other one too
def new_dict(d):
return { val["id"]:val for val in d.values() }
I solved my problem with this code:
def championNameByID(id):
championData = requestChampionData()
allChampions = championData['data']
for champion in allChampions:
if id == allChampions[champion]['key']:
championName = allChampions[champion]['name']
return championName
How can i split by big dataframe into smaller dataframe and able to print all the dataframe separately on web? any idea on edit code can place a loop in context?
here is my code:
def read_raw_data(request):
Wb = pd.read_excel(r"LookAhead.xlsm", sheetname="Step")
Step1 = Wb.replace(np.nan, '', regex=True)
drop_column =
Step1_Result.drop(['facility','volume','indicator_product'], 1)
uniquevaluesproduct = np.unique(drop_column[['Product']].values)
total_count=drop_column['Product'].nunique()
row_array=[]
for name, group in drop_column.groupby('Product')
group=group.values.tolist()
row_array.append(group)
i=1
temp=row_array[0]
while i<total_count:
newb = temp + row_array[i]
temp=newb
i = i + 1
b = ['indicator', 'Product']
test=pd.DataFrame.from_records(temp, columns=b)
table = test.style.set_table_attributes('border="" class = "dataframe table table-hover table-bordered"').set_precision(10).render()
context = { "result": table}
return render(request, 'result.html', context)
If you want to show a big dataframe in different pages, I recommend you using a Paginator. The documentation has a good example on how to implement it.
https://docs.djangoproject.com/en/1.10/topics/pagination/#using-paginator-in-a-view