I have an array of strings (stock ticker symbols) that I have scraped from twitter. I scrape stock ticker symbols from one person's feed, however, sometimes the feed will have multiple tweets about the same stock ticker and will therefore repeat multiple times in my array. How do I stop the stock ticker from repeating in my array?
Here is my code
import csv
import urllib.request
from bs4 import BeautifulSoup
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
here is what prints out
['AYTU', 'AYTU', 'AYTU', 'AYTU', 'INDU', 'JPM', 'BAC', 'INPX', 'MSFT', 'SPX', 'HMNY', 'YTEN', 'INPX', 'MACK', 'KDMN', 'AMBA', 'KDMN', 'KDMN', 'MACK']
use set comprehension instead of the list comprehension that you're using:
tweets = {i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')}
you can transform your set to list using the code below, if you need to
tweets = list(tweets)
You can use an empty dictionary.
In the loop, you can perform a check:
if the dictionary does not contain key of the current element then insert it to tweets and the dictionary.
You can do a simple check on each iteration of the for loop:
tweets = []
for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b'):
if i.text not in tweets:
tweets.append(i.text)
print(tweets)
Related
I have a list of stock indexes (indizies) and several lists of stock tickers (e.g. gdaxi, mdaxi).
I want to download the stocks from yahoo in two loops.
Background: In the real program the user can choose which index, indexes he wants to download.
The problem is, that the type of index_name is a string and for the second loop index_name has to be a list. But the second loop takes index_name as a string.
Result :It trys to download the csv for g,d,a,x,i
Question: How can I transform index_name from string to list?
from pandas_datareader import data as pdr
indizies = ['GDAXI', 'MDAXI']
gdaxi = ["ADS.DE", "AIR.DE", "ALV.DE"]
mdaxi = ["AIXA.DE", "AT1.DE"]
for index_name in indizies:
for ticker in index_name:
df = pdr.get_data_yahoo(ticker)
df.to_csv(f'{ticker}.csv')
In ticker in index_name you are iterating over the letters in given strings.
I guess you to change your code to something like:
from pandas_datareader import data as pdr
gdaxi = ["ADS.DE", "AIR.DE", "ALV.DE"]
mdaxi = ["AIXA.DE", "AT1.DE"]
indizies = [gdaxi, mdaxi]
for index_name in indizies:
for ticker in index_name:
df = pdr.get_data_yahoo(ticker)
df.to_csv(f'{ticker}.csv')```
I'm collecting some market data from Binance's API. My goal is to collect the list of all markets and use the 'status' key included in each row to detect if the market is active or not. If it's not active, I must search the last trade to collect the date of the market's shutdown.
I wrote this code
import requests
import pandas as pd
import json
import csv
url = 'https://api.binance.com/api/v3/exchangeInfo'
trade_url = 'https://api.binance.com/api/v3/trades?symbol='
response = requests.get(url)
data = response.json()
df = data['symbols'] #list of dict
json_data=[]
with open(r'C:\Users\Utilisateur\Desktop\json.csv', 'a' , encoding='utf-8', newline='') as j :
wr=csv.writer(j)
wr.writerow(["symbol","last_trade"])
for i in data['symbols'] :
if data[i]['status'] != "TRADING" :
trades_req = requests.get(trade_url + i)
print(trades_req)
but I got this error
TypeError: unhashable type: 'dict'
How can I avoid it?
That's because i is a dictionary. If data['symbols'] is a list of dictionaries, when you do in the loop:
for i in data['symbols']:
if data[i]['status'] ...
you are trying to hash i to use it as a key of data. I think you want to know the status of each dictionary on the list. That is:
for i in data['symbols']:
if i['status'] ...
In such a case, it would be better to use more declarative variable names, e.g., d, s, symbol instead of i.
I've managed to expose the right data (some of it is calculated on the fly in the page so was a bit more complex than I thought) but I now need to get it in a JSON string and despite many attempts I'm stuck!
This Python script is as follows (using Selenium & BeautifulSoup):
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path = r'C:/Users/user/Downloads/chromedriver.exe')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data=soup.find_all("div", {"class":"date_display"})
#print(data)
#out = {}
for data in data:
bin_colour = data.find('h3').text
bin_date = parser.parse(data.find('p').text).strftime('%Y-%m-%d')
print(bin_colour)
print(bin_date)
print()
browser.quit()
This results in:
Grey Bin
2021-06-30
Green Bin
2021-06-23
Clear Sack
2021-06-23
Food Bin
2021-06-23
It might (probably) not be the best code/approach so am open to your suggestions. The main goal is to end up with:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
Hope this makes sense, I've tried various ways of getting the data into the right format but just seem to lose it all so after many hours of trying I'm hoping you guys can help.
Update:
Both of MendelG's solutions worked perfectly. Vitalis's solution gave four outputs, the last being the required output - so thank you both for very quick and working solutions - I was close, but couldn't see the wood for the trees!
To get the data in a dictionary format, you can try:
out = {}
for data in tag:
out[data.find("h3").text] = parser.parse(data.find("p").text).strftime("%Y-%m-%d")
print(out)
Or, use a dictionary comprehension:
print(
{
data.find("h3").text: parser.parse(data.find("p").text).strftime("%Y-%m-%d")
for data in tag
}
)
Output:
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
You can create an empty dictionary, add values there and print it.
Solution
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("div", {"class":"date_display"})
result = {}
for item in data:
bin_colour = item.find('h3').text
bin_date = parser.parse(item.find('p').text).strftime('%Y-%m-%d')
result[bin_colour]=bin_date
print(result)
OUTPUT
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
You can make it in a similar way if you'll need and an output in list, but you'll need to .append values, similarly as I did here Trouble retrieving elements and looping pages using next page button
If you need a double quotes use this print:
print(json.dumps(result))
It will print:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
You could collect all the listed dates using requests and re. You regex out the various JavaScript objects containing the dates for each collection type. You then need to add 1 to each month value to get month in the range 1-12; which can be done with regex named groups. These can be converted to actual dates for later filtering.
Initially storing all dates in a dictionary with key as collection type and values as a list of collection dates, you can use zip_longest to create a DataFrame. You can then use filtering to find the next collection date for a given collection.
I use a couple of helper functions to achieve this.
import requests
from dateutil import parser
from datetime import datetime
from pandas import to_datetime, DataFrame
from itertools import zip_longest
def get_dates(dates):
dates = [re.sub(r'(?P<g1>\d+),(?P<g2>\d+),(?P<g3>\d+)$', lambda d: parser.parse('-'.join([d.group('g1'), str(int(d.group('g2')) + 1), d.group('g3')])).strftime('%Y-%m-%d'), i)
for i in re.findall(r'Date\((\d{4},\d{1,2},\d{1,2}),', dates)]
dates = [datetime.strptime(i, '%Y-%m-%d').date() for i in dates]
return dates
def get_next_collection(collection, df):
return df[df[collection] >= to_datetime('today')][collection].iloc[0]
collection_types = ['grey', 'green', 'clear', 'food']
r = requests.get('https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1')
collections = {}
for collection in collection_types:
dates = re.search(r'var {0}(?:(?:bin)|(?:sack)) = (\[.*?\])'.format(collection), r.text, re.S).group(1)
collections[collection] = get_dates(dates)
df = DataFrame(zip_longest(collections['grey'], collections['green'],
collections['clear'], collections['food']),
columns = collection_types)
get_next_collection('grey', df)
You could also use a generator and islice, as detailed by #Martijn Pieters
, to work direct of the dictionary entries (holding the collection dates) and limit how many future dates you are interested in e.g.
filtered = (i for i in collections['grey'] if i >= date.today())
list(islice(filtered, 3))
Altered import lines are:
from itertools import zip_longest, islice
from datetime import datetime, date
You then don't need the pandas imports or creation of a DataFrame.
How do i get the resulting url: https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
...from this page ...
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001633917&owner=exclude&count=40
... by specifing date = '2018-04-25 and I want 8-k for Filing? Do I loop though or is there a one liner code that will get me the result?
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
date='2018-04-25'
CIK='1633917'
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + CIK + '&owner=exclude&count=100'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
a=soup.find('table', class_='tableFile2').findAll('tr')
for i in a:
print i
There is no one liner code to get what you want. You'll have to loop through the rows and then check if the values match.
But, there is a slightly better approach which narrows down the rows. You can directly select the rows which match one of the values. For example, you can select all the rows which have date = '2018-04-25' and then check if the Filing matches.
Code:
for date in soup.find_all('td', text='2018-04-25'):
row = date.find_parent('tr')
if row.td.text == '8-K':
link = row.a['href']
print(link)
Output:
/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
So, here, instead of looping over all the rows, you simply loop over the rows having the date you want. In this case, there is only one such row, and hence we loop only once.
I have a list of several stock tickers:
ticker = (GE,IBM,GM,F,PG,CSCO)
That I want to pass to the URL in my python program.
url = "https://www.quandl.com/api/v3/datasets/WIKI/FB.json"
I'm trying to pass a new ticker into the URL on each subsequent pass thru my program. I'm struggling with how to pass each new ticker in the list of ticker into the URL as it loops thru the program. Program needs to grab a new ticker from the list and replace the one in the URL.
Example: After the first pass, program should grab GE from the list and replace FB in the URL and continue looping thru until all tickers have been passed to URL. Not sure how best to deal with the part of the program. Any help would be appreciated.
import requests
url_tpl = "https://www.quandl.com/api/v3/datasets/WIKI/{ticker}.json"
# Here your results will be stored
jsons = {}
for ticker in ('FB', 'GE', 'IBM', 'GM', 'F' , 'PG', 'CSCO'):
res = requests.get(url_tpl.format(ticker=ticker))
if res.status_code == 200:
jsons[ticker] = res.json()
else:
print('error while fetching {ticker}, response code: '
'{status}'.format(ticker=ticker, status=res.status_code))