Module 'pylab' has no attribute 'scatter' - python

I am working on a linear regression model for stock ticker data, but I can't get Pylab working properly. I have successfully plotted the data, but I want to get a line of best fit for the data I have. (Not for any particular purpose, just a random set of data to use linear regression on.)
import pylab
import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
def chartStocks(*tickers):
# Run loop for each ticker passed in as an argument
for ticker in tickers:
# Convert URL into text for parsing
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
csv_url = href
res = urllib.request.urlopen(csv_url)
csv = res.read()
csv_str = str(csv)
# Parse the CSV to create a list of data points
point = []
points = []
curDay = 0
day = []
commas = 0
lines = csv_str.split("\\n")
lineOne = True
for line in lines:
commas = 0
if lineOne == True:
lineOne = False
else:
for c in line:
if c == ",":
commas += 1
if commas == 4:
point.append(c)
elif commas == 5:
for x in point:
if x == ",":
point.remove(x)
point = ''.join(point)
point = float(point)
points.append(point)
day.append(curDay)
curDay += 1
point = []
commas = 0
points = list(reversed(points))
# Plot the data
pylab.scatter(day,points)
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('title')
k, b = pylab.polyfit(day,points,1)
yVals = k * day + b
pylab.plot(day,yVals,c='r',linewidth=2)
pylab.title('title')
pylab.show()
chartStocks('AAPL')
For some reason I get an attribute error, and I'm not sure why. Am I improperly passing in data to pylab.scatter()? I'm not totally sure if passing in a list for x and y values is the correct approach. I haven't been able to find anyone else who has run into this issue, and .scatter is definitely part of Pylab, so I'm not sure whats going on.

I think that there is a version clash. Try:
plt.pyplot.scatter(day,points)

When you use pylab it imports some other packages. So when you do import pylab you get numpy with the prefix np so you would need np.polyfit. As this question shows I think it's clearer to the readers of the code if you just import numpy directly to do this.

Related

how to stop repeating same text in loops python

from re import I
from requests import get
res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
kek = []
for x in res:
kek.append(x)
lnk = res[kek[0]]['downloads']
anime_name = res[kek[0]]['show']
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{anime_name}:\n\n{quality}: {links}\n\n"
print(data)
in this code how can i prevent repeating of anime name
if i add this outside of the loop only 1 link be printed
you can separate you string, 1st half outside the loop, 2nd inside the loop:
print(f"{anime_name}:\n\n")
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{quality}: {links}\n\n"
print(data)
Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict)
from requests import get
data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
for show, info in data.items():
print(show, '\n')
for download in info['downloads']:
print(download['magnet'])
print(download['res'])
print('\n')
Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.

Webscraping NSE Options prices using Python BeautifulSoup, regarding encoding correction

Dec 2020 update:
I have:
Achieved full automation, minute level data collection for entire FnO universe.
Auto adapts to changing FnO universe, exits and new entries.
Shuts down in non-market hours.
Shuts down on holidays, including newly declared holidays.
Starts automatically during yearly Muhurat Trading data.
I am a bit new to web scraping and not used to 'tr' & 'td' stuff and thus this doubt. I am trying to replicate this Python 2.7 code in my Python 3 from this thread 'https://www.quantinsti.com/blog/option-chain-extraction-for-nse-stocks-using-python'.
This old code uses .ix for indexing which I can correct using .iloc easily. However, the line <tr = tr.replace(',' , '')> show up error 'a bytes-like object is required, not 'str'' even if I write it before <tr = utf_string.encode('utf8')>.
I have checked this other link from stackoverflow and couldn't solve my problem
I think I have spotted why this is happening. It's because of the previous for loop used previously to define variable tr. If I omit this line, then I get a DataFrame with the numbers with some attached text. I can filter this with a loop over the entire DataFrame, but a better way must be by properly using the replace() function. I can't figure this bit out.
Here is my full code. I have marked the critical sections of the code I have referred using ######################### exclusively in a line so that the line can be found out quickly (even by Ctrl + F):
import requests
import pandas as pd
from bs4 import BeautifulSoup
Base_url = ("https://nseindia.com/live_market/dynaContent/"+
"live_watch/option_chain/optionKeys.jsp?symbolCode=2772&symbol=UBL&"+
"symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")
page = requests.get(Base_url)
#page.status_code
#page.content
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
table_it = soup.find_all(class_="opttbldata")
table_cls_1 = soup.find_all(id = "octable")
col_list = []
# Pulling heading out of the Option Chain Table
#########################
for mytable in table_cls_1:
table_head = mytable.find('thead')
try:
rows = table_head.find_all('tr')
for tr in rows:
cols = tr.find_all('th')
for th in cols:
er = th.text
#########################
ee = er.encode('utf8')
col_list.append(ee)
except:
print('no thread')
col_list_fnl = [e for e in col_list if e not in ('CALLS', 'PUTS', 'Chart', '\xc2\xa0')]
#print(col_list_fnl)
table_cls_2 = soup.find(id = "octable")
all_trs = table_cls_2.find_all('tr')
req_row = table_cls_2.find_all('tr')
new_table = pd.DataFrame(index=range(0,len(req_row)-3),columns = col_list_fnl)
row_marker = 0
for row_number, tr_nos in enumerate(req_row):
if row_number <= 1 or row_number == len(req_row)-1:
continue # To insure we only choose non empty rows
td_columns = tr_nos.find_all('td')
# Removing the graph column
select_cols = td_columns[1:22]
cols_horizontal = range(0,len(select_cols))
for nu, column in enumerate(select_cols):
utf_string = column.get_text()
utf_string = utf_string.strip('\n\r\t": ')
#########################
tr = tr.replace(',' , '') # Commenting this out makes code partially work, getting numbers + text attached to the numbers in the table
# That is obtained by commenting out the above line with tr variable & running the entire code.
tr = utf_string.encode('utf8')
new_table.iloc[row_marker,[nu]] = tr
row_marker += 1
print(new_table)
For the first section:
er = th.text should be er = th.get_text()
Link to get_text documentation
For the latter section:
Looking at it, your "tr" variable at this point is the last tr tag found in the soup using for tr in rows. This means the tr you are trying to call replace on is a navigable string, not a string.
tr = tr.get_text().replace(',' , '') should work for the first iteration, however as you have overwritten it in the first iteration it will break in the next iteration.
Additionally, thank you for the depth of your question. While you did not pose it as a question, the length you went to describe the trouble you are having as well as the code you have tried is greatly appreciated.
If you replace the below lines of codes
tr = tr.replace(',' , '')
tr = utf_string.encode('utf8')
new_table.iloc[row_marker,[nu]] = tr
with the following code then it should work.
new_table.iloc[row_marker,[nu]] = utf_string.replace(',' , '')
As the replace function doesn't work with the Unicode. You can also consider using below code to decode the column names
col_list_fnl = [e.decode('utf8') for e in col_list if e not in ('CALLS', 'PUTS', 'Chart', '\xc2\xa0')]
col_list_fnl
I hope this helps.

getting specific part from a page source python

I am trying to extract a specific part from a page using regex but it isn't working.
This is the part I want to be extracted from the page:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
So far I've tried this :
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
mystrx = re.search(r'^{"clickTrackingParams".*"voteStatus":"LIKE"}}]}}', html_source)
but it didn't work out for me.
Try this:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
# Find first occurence
end = html_source.find(snd)
# Get closest index
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
Which Outputs:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
If you want to get the second occurence, you can try something along the lines of:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
def find_nth(string, to_find, n):
"""
Finds nth match from string
"""
# find all occurences
matches = [idx.start() for idx in re.finditer(to_find, string)]
# return nth match
return matches[n]
# finds second match
end = find_nth(html_source, snd, 1)
# Gets closest index to end
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
Note: In the second example you can run into IndexError's if you request an occurence outside of the found matches. You will need to handle this behaviour yourself.

Extract specific data from an embedded javascript in webpage

I want to extract only the latitudes from the link: "http://hdfc.com/branch-locator" using the method given below.
The latitudes are given inside a javascript variable called 'location'.
The code is:
from lxml import html
import re
URL = "http://hdfc.com/branch-locator"
var_lat = re.compile('(?<="latitude":).+(?=")')
main_page = html.parse(URL).getroot()
lat = main_page.xpath("//script[#type='text/javascript']")[1]
ans = re.search(var_lat,str(lat))
print ans
But the output comes as "None". What changes should I make to the code without changing the approach to the problem?
I think a few small changes are required
in the line
lat = main_page.xpath("//script[#type='text/javascript']")[1] # This should be 10
The line
ans = re.search(var_lat,str(lat))
should be
ans = re.search(var_lat, lat.text)
str(lat) is going to call __str__ function of the object lat, which is not same as lat.text
In general a good idea to actually go through all lats first and then go about searching for the desired string. So this should be -
lat = main_page.xpath("//script[#type='text/javascript']")
for l in lat:
if l.text is None:
continue
# print l.text
ans = re.search(var_lat,(l.text))
if ans is not None:
break
print ans
Sorry, edited to fix the issue. Note: This may not be the exact solution you want - but should give you the first instance where the required regex is matched. You might want to process ans further.
The code that I have written below works for an embedded javascript in a webpage.
from lxml import html
from json import dump
import re
dumped_data = []
class theAddress:
latude = ""
URL = "http://hdfc.com/branch-locator"
var_lat = re.compile('(?<="latitude":").+?(?=")')
main_page = html.parse(URL).getroot()
residue = main_page.xpath("//script[#type='text/javascript']/text()")[1]
all_latude = re.findall(var_lat,residue)
for i in range(len(all_latude)):
obj = theAddress()
obj.latude = all_latude[i]
dumped_data.append(obj.__dict__)
f = open('hdfc_add.json','w')
dump(dumped_data, f, indent = 1)
It also makes use of json module to store scraped data in a proper format.

How to Crawl Multiple Websites to find common Words (BeautifulSoup,Requests,Python3)

I'm wondering how to crawl multiple different websites using beautiful soup/requests without having to repeat my code over and over.
Here is my code right now:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)
What I am trying to do is ideally crawl 5 different websites, find all of the individual words on these websites, find the frequency of each word on each website, ADD all the frequencies together for each particular word, then combine all of this data into one dataframe that can be exported using Pandas.
Hopefully the output would look like this
Word Frequency
the 200
man 300
is 400
tired 300
My code can only do this for ONE website at a time right now and I'm trying to avoid repeating my code.
Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. I was wondering if anyone had a faster way or any advice? Thank you!
Make a function:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
cnt = Counter()
def GetData(url):
Website1 = requests.get(url)
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
cnt.update(a.most_common())
websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
GetData(url)
makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
Just loop and update a main Counter dict:
main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)
The update method unlike a normal dict.update adds to the values, it does not replace the values
On a style note, use lowercase for variable names and use underscore's make_a_frame
Try:
comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)

Categories

Resources