Problem with loop in python - Multiple entries of same data - python

I am appending a .csv file with python. The data is scraped from the web. I am through with almost everything related to scraping.
The problem is coming when I am trying to append the file. It enters multiple >100s of entries of same data. So I am sure there is a problem with the loop/ for or if statements that i am not able to identify and solve.
The condition checks for similarity in data scraped from web and already existing data in file.
If data doesn't match then program writes a new row, else it breaks or continues.
Note: csvFileArray is an array which checks data from existing file.txt. for example print(csvFileArray[0]) gives:
{'Date': '19/05/21', 'Time': '14:51:00', 'Status': 'Waitlisted', 'School': 'MIT Sloan', 'Details': 'GPA: 3.4 Round: Round 2 | Texas'}
Below is the code that has a problem.
file = open('file.csv', 'a')
writer = csv.writer(file)
#loop for page numbers
for page in range(15, 17):
print("Getting page {}..".format(page))
params["paged"] = page
data = requests.post(url, data=params).json()
soup = BeautifulSoup(data["markup"], "html.parser")
for entry in soup.select(".livewire-entry"):
datime = entry.select_one(".adate")
status = entry.select_one(".status")
name = status.find_next("strong")
details = entry.select_one(".lw-details")
datime = datime.get_text(strip=True)
datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
time = datime.time() #returns time
date = datime.date() #returns date
for firstentry in csvFileArray:
condition = (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True)))
if condition:
continue
else:
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
print("-" * 80)
file.close()

I'm guessing you want to write the line only if the condition is true for ALL of the csvFileArray entries. Right now, you're writing it for EVERY csvFileArray that doesn't match.
for entry in soup.select(".livewire-entry"):
datime = entry.select_one(".adate")
status = entry.select_one(".status")
name = status.find_next("strong")
details = entry.select_one(".lw-details")
datime = datime.get_text(strip=True)
datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
time = datime.time() #returns time
date = datime.date() #returns date
should_write = True
for firstentry in csvFileArray:
if (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True))):
should_write = False
break
if should_write:
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
You could also use a list comprehension for this, but because your condition is large, that gets hard to read:
if not any(
(((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True)))
for firstentry in csvFileArray):
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')

Related

Issues Scraping multiple webpages with BeautifulSoup

I am scraping a URL (example: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-4.html) and the number on the end of the URL is the page number. I am trying to scrape multiple pages, so I used the following code to loop through the multiple pages:
for page in range(4, 7): #Range designates the page numbers for the URL
r = s.get(f'https://bitinfocharts.com/top-100-richest-dogecoin-addresses-{page}.html') #Format the page number into url
print(page)
When I run the code in my script and print the page, it returns 4, 5 and 6, meaning that it should be working. However whenever I run the full code, it only gives me the results for the 6th page.
What I think may be happening is the code is finalizing on the last number and formatting that into the URL, whenever it should formatting each number into the URL instead.
I have tried looking at other people with similar issues but haven't been able to find a solution. I believe this may be a code formatting error but I am not exactly sure. Any advice is greatly appreciated. Thank you.
Here is the remainder of my code:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
import os
import pandas as pd
import openpyxl
# define 1-1-2020 as a datetime object
after_date = datetime(2021, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
for page in range(4, 7): #Range designates the page numbers for the URL
r = s.get(f'https://bitinfocharts.com/top-100-richest-dogecoin-addresses-{page}.html') #Format the page number into url
print(page)
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
if last_out_str != "": # check to make sure the date field isn't empty
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z") # load date into datetime object for comparison
if last_out > after_date: # if check to see if the date is after last_out
address_links.append(url + '-full') #add adddress_links to the list, -full makes the link show all data
print(address_links)
for url in address_links: #loop through the urls in address_links list
r = s.get(url)
soup = bs(r.content, 'lxml')
ad2 = (soup.title.string) #grab the web title which is used for the filename
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
sections = soup.find_all(class_='table-striped')
for section in sections: #This contains the data which is imported into the 'gf' dataframe or the 'info' xlsx sheet
oldprofit = section.find_all('td')[11].text #Get the profit
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
balance = section.find_all('td')[0].text #Get the wallet balance
amount_recieved = section.find_all('td')[3].text #Get amount recieved
ins = amount_recieved[amount_recieved.find('(') + 1:amount_recieved.find(')')] #Filter out text from
# amount recieved
ins = ins.replace('ins', '')
ins = ins.replace(' ', '')
ins = float(ins)
first_recieved = section.find_all('td')[4].text #Get the data of the first incoming transaction
fr = first_recieved.replace('first', '')
fr = fr.replace(':', '')
fr = fr.replace(' ', '')
last_recieved = section.find_all('td')[5].text #Get the date of the last incoming transaction
lr = last_recieved.replace('last', '')
lr = lr.replace(':', '')
lr = lr.replace(' ', '')
amount_sent = section.find_all('td')[7].text #Get the amount sent
outs = amount_sent[amount_sent.find('(') + 1:amount_sent.find(')')] #Filter out the text
outs = outs.replace('outs', '')
outs = outs.replace(' ', '')
outs = float(outs)
first_sent = section.find_all('td')[8].text #Get the first outgoing transaction date
fs = first_sent.replace('first', '') #clean up first outgoing transaction date
fs = fs.replace(':', '')
fs = fs.replace(' ', '')
last_sent = section.find_all('td')[9].text #Get the last outgoing transaction date
ls = last_sent.replace('last', '') #Clean up last outgoing transaction date
ls = ls.replace(':', '')
ls = ls.replace(' ', '')
dbalance = section.find_all('td')[0].select('b') #get the balance of doge
dusd = section.find_all('td')[0].select('span')[1] #get balance of USD
for data in dbalance: #used to clean the text up
balance = data.text
for data1 in dusd: #used to clean the text up
usd = data1.text
# Compare profit to goal, if profit doesn't meet the goal, the URL is not scraped
goal = float(30000)
if profit < goal:
continue
#Select wallets with under 2000 transactions
trans = float(ins + outs) #adds the amount of incoming and outgoing transactions
trans_limit = float(2000)
if trans > trans_limit:
continue
# Create Info Dataframe using the data from above
info = {
'Balance': [balance],
'USD Value': [usd],
'Wallet Profit': [profit],
'Amount Recieved': [amount_recieved],
'First Recieved': [fr],
'Last Recieved': [lr],
'Amount Sent': [amount_sent],
'First Sent': [fs],
'Last Sent': [ls],
}
gf = pd.DataFrame(info)
a = 'a'
if a:
df = \
pd.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text, attrs={"id": "table_maina"},
index_col=None, header=[0])[0] #uses pandas to read the dataframe and save it
directory = '/Users/chris/Desktop/Files' #directory for the file to go to
file = f'{filename}.xlsx'
writer = pd.ExcelWriter(os.path.join(directory, file), engine='xlsxwriter')
with pd.ExcelWriter(writer) as writer:
df.to_excel(writer, sheet_name='transactions')
gf.to_excel(writer, sheet_name='info')
Check your indentation - In your question the loops are on the same level, so loop that make the requests is iterating over all the pages but results are never processed until iterating is done. That is why it only works for the last page.
Move your loops, that should handle the response and extract elements into your first loop:
...
for page in range(4, 7): #Range designates the page numbers for the URL
r = s.get(f'https://bitinfocharts.com/top-100-richest-dogecoin-addresses-{page}.html') #Format the page number into url
print(page)
soup = bs(r.content, 'lxml')
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
...
for url in address_links:
...

AWS DynamoDB BOTO3 Confusing Scan

Basically, if i loop a datetime performing an scan with date range per-day, like:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
if (response['Count']) == 0:
return
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
if (
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None) < parser.parse('2021-01-01T00:00:00+00:00').replace(tzinfo=None)
or
parser.parse(response['Items'][0]['date_column']).replace(tzinfo=None).replace(tzinfo=None) > parser.parse('2021-06-07T23:59:59+00:00').replace(tzinfo=None)
):
break
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)
At the end of Day1 to Day2 loop, it retrieve me X rows,
But if i perform the same scan at the same way (paginating), with the same range (Day1 to Day2), without doing a loop, it retrieve me Y rows,
And to become better, when i perform a table.describe_table(TableName='table1'), row_count field comes with Z rows, i literally dont understand what is going on!
Based on help of above guys, i found my error, basically i'm not passing the filter again when performing pagination so the fixed code are:
table_hook = dynamodb_resource.Table('table1')
date_filter = Key('date_column').between('2021-01-01T00:00:00+00:00', '2021-01-01T23:59:59+00:00')
response = table_hook.scan(FilterExpression=date_filter)
incoming_data = response['Items']
_counter = 1
while 'LastEvaluatedKey' in response:
response = table_hook.scan(FilterExpression=date_filter,
ExclusiveStartKey=response['LastEvaluatedKey'])
incoming_data.extend(response['Items'])
_counter+=1
print("|-> Getting page %s" % _counter)

Code efficiency/performance improvement in Pushshift Reddit web scraping loop

I am extracting Reddit data via the Pushshift API. More precisely, I am interested in comments and posts (submissions) in subreddit X with search word Y, made from now until datetime Z (e.g. all comments mentioning "GME" in subreddit /rwallstreetbets). All these parameters can be specified. So far, I got it working with the following code:
import pandas as pd
import requests
from datetime import datetime
import traceback
import time
import json
import sys
import numpy as np
username = "" # put the username you want to download in the quotes
subreddit = "gme" # put the subreddit you want to download in the quotes
search_query = "gamestop" # put the word you want to search for (present in comment or post) in the quotes
# leave either one blank to download an entire user's, subreddit's, or search word's history
# or fill in all to download a specific users history from a specific subreddit mentioning a specific word
filter_string = None
if username == "" and subreddit == "" and search_query == "":
print("Fill in either username or subreddit")
sys.exit(0)
elif username == "" and subreddit != "" and search_query == "":
filter_string = f"subreddit={subreddit}"
elif username != "" and subreddit == "" and search_query == "":
filter_string = f"author={username}"
elif username == "" and subreddit != "" and search_query != "":
filter_string = f"subreddit={subreddit}&q={search_query}"
elif username == "" and subreddit == "" and search_query != "":
filter_string = f"q={search_query}"
else:
filter_string = f"author={username}&subreddit={subreddit}&q={search_query}"
url = "https://api.pushshift.io/reddit/search/{}/?size=500&sort=desc&{}&before="
start_time = datetime.utcnow()
def redditAPI(object_type):
global df_comments
df_comments = pd.DataFrame(columns=["date", "comment", "score", "id"])
global df_posts
df_posts = pd.DataFrame(columns=["date", "post", "score", "id"])
print(f"\nLooping through {object_type}s and append to dataframe...")
count = 0
previous_epoch = int(start_time.timestamp())
while True:
# Ensures that loop breaks at March 16 2021 for testing purposes
if previous_epoch <= 1615849200:
break
new_url = url.format(object_type, filter_string)+str(previous_epoch)
json_text = requests.get(new_url)
time.sleep(1) # pushshift has a rate limit, if we send requests too fast it will start returning error messages
try:
json_data = json.loads(json_text.text)
except json.decoder.JSONDecodeError:
time.sleep(1)
continue
if 'data' not in json_data:
break
objects = json_data['data']
if len(objects) == 0:
break
df2 = pd.DataFrame.from_dict(objects)
for object in objects:
previous_epoch = object['created_utc'] - 1
count += 1
if object_type == "comment":
df2.rename(columns={'created_utc': 'date', 'body': 'comment'}, inplace=True)
df_comments = df_comments.append(df2[['date', 'comment', 'score']])
elif object_type == "submission":
df2.rename(columns={'created_utc': 'date', 'selftext': 'post'}, inplace=True)
df_posts = df_posts.append(df2[['date', 'post', 'score']])
# Convert UNIX to datetime
df_comments["date"] = pd.to_datetime(df_comments["date"],unit='s')
df_posts["date"] = pd.to_datetime(df_posts["date"],unit='s')
# Drop blank rows (the case when posts only consists of an image)
df_posts['post'].replace('', np.nan, inplace=True)
df_posts.dropna(subset=['post'], inplace=True)
# Drop duplicates (see last comment on https://www.reddit.com/r/pushshift/comments/b7onr6/max_number_of_results_returned_per_query/)
df_comments = df_comments.drop_duplicates()
df_posts = df_posts.drop_duplicates()
print("\nDone. Saved to dataframe.")
Unfortunately, I do have some performance issues. Due to the fact that I paginate based on created_utc - 1 (and since I do not want to miss any comments/posts), the initial dataframe will contain duplicates (since there won't be 100 (=API limit) new comments/posts every new second). If I run the code for a long time frame (e.g. current time - 1 March 2021), this will result in a huge dataframe which takes considerably long to process.
As the code is right now, the duplicates are added to the dataframe and only after the loop, they are removed. Is there a way to make this more efficient? E.g. to check within the for loop whether the object already exists in the dataframe? Would this make a difference, performance wise? Any input would be very much appreciated.
It is possible to query the data so that there are no duplicates in the first place.
You are using the before parameter of the API, allowing to get only records strictly before the timestamp. This means we can send as before on each iteration the timestamp of the earliest record that we already have. In this case in response we are only gonna have records that we haven't seen yet, so no duplicates.
In code that would look something like this:
import pandas as pd
import requests
import urllib
import time
import json
def get_data(object_type, username='', subreddit='', search_query='', max_time=None, min_time=1615849200):
# start from current time if not specified
if max_time is None:
max_time = int(time.time())
# generate filter string
filter_string = urllib.parse.urlencode(
{k: v for k, v in zip(
['author', 'subreddit', 'q'],
[username, subreddit, search_query]) if v != ""})
url_format = "https://api.pushshift.io/reddit/search/{}/?size=500&sort=desc&{}&before={}"
before = max_time
df = pd.DataFrame()
while before > min_time:
url = url_format.format(object_type, filter_string, before)
resp = requests.get(url)
# convert records to dataframe
dfi = pd.json_normalize(json.loads(resp.text)['data'])
if object_type == 'comment':
dfi = dfi.rename(columns={'created_utc': 'date', 'body': 'comment'})
df = pd.concat([df, dfi[['id', 'date', 'comment', 'score']]])
elif object_type == 'submission':
dfi = dfi.rename(columns={'created_utc': 'date', 'selftext': 'post'})
dfi = dfi[dfi['post'].ne('')]
df = pd.concat([df, dfi[['id', 'date', 'post', 'score']]])
# set `before` to the earliest comment/post in the results
# next time we call requests.get(...) we will only get comments/posts before
# the earliest that we already have, thus not fetching any duplicates
before = dfi['date'].min()
# if needed
# time.sleep(1)
return df
Testing by getting the comments and checking for duplicate values (by id):
username = ""
subreddit = "gme"
search_query = "gamestop"
df_comments = get_data(
object_type='comment',
username=username,
subreddit=subreddit,
search_query=search_query)
df_comments['id'].duplicated().any() # False
df_comments['id'].nunique() # 2200
I would suggest a bloom filter to check if values have already been passed through.
There is a package on PyPi, which implements this very easily. To use the bloom filter you just have to add a "key" to the filter, this can be a combination of the username and comment. This way you can check if you have already added comments to your data frame. I suggest that you use the bloom filter as early as possible in your method, i.e. after you get a response from the API.

Filesize is unproportional

A webscraper written in Python extracts waterleveldata. One read per hour.
When written to a .txt-file using the code below each line is appended with datetime, thus each line takes up something like 20 characters.
Example: "01/01-2010 11:10,-32"
Using the code below results in a file containing data from 01/01-2010 00:10 to 28/02-2010 23:50 which equals something like 60 days. 60 days, with a reading per hour results in 1440 lines and approx. 30000 characters. Microsoft word, however, tell me the file contains 830000 characters on 42210 lines, which fits very well with an observed filesize of 893 kB.
Apparently some lines and characters are hidden somewhere. I cant seem to find them anywhere.
import requests
import time
totaldata =[]
filnavn='Vandstandsdata_Esbjerg_KDI_TEST_recode.txt'
file = open(filnavn,'w')
file.write("")
file.close()
from datetime import timedelta, date
from bs4 import BeautifulSoup
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2010, 1, 1)
end_date = date(2010, 3, 1)
values=[]
datoer=[]
for single_date in daterange(start_date, end_date):
valuesTemp=[]
datoerTemp=[]
dato = single_date.strftime("%d-%m-%y")
url = "http://kysterne.kyst.dk/pages/10852/waves/showData.asp?targetDay="+dato+"&ident=6401&subGroupGuid=16410"
page = requests.get(url)
if page.status_code == 200:
soup = BeautifulSoup(page.content, 'html.parser')
dataliste = list(soup.find_all(class_="dataTable"))
#dataliste =list(dataliste.find_all('td'))
#dataliste =dataliste[0].getText()
print(url)
dataliste = str(dataliste)
dataliste = dataliste.splitlines()
dataliste = dataliste[6:] #18
#print(dataliste[0])
#print(dataliste[1])
for e in range (0,len(dataliste),4): #4
#print(dataliste[e])
datoerTemp.append(dataliste[e])
#print(" -------- \n")
for e in range (1,len(dataliste),4): #4
valuesTemp.append(dataliste[e])
for e in valuesTemp:
#print (e)
e=e[4:]
e=e[:-5]
values.append(e)
for e in datoerTemp:
#print (e)
e=e[4:]
e=e[:-5]
datoer.append(e)
file = open(filnavn,'a')
for i in range(0,len(datoer),6):
file.write(datoer[i]+","+values[i]+"\n")
print("- skrevet til fil\n")
file.close()
print("done")
Ah, heureka.
Seconds before posting this question I realized I forgot to reset the list.
I added:
datoer=[]
everything now works as intended.
The old code would write data from a given day and all data of all previous days, for each loop in the code.
I hope someone can use this newbie-experience.

BeautifulSoup could not get everything

2 weeks ago, I could read everything in the source code of this url: http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazon
However, today, when I am running the same code again, all the historical price could not appear in soup.... Do you know how to fix this problem?
Here's my python code (it worked well!)
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = 'http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazon'
soup = BeautifulSoup(urlopen(url),'html.parser')
lst = soup.find_all('tbody')
for tbody in lst:
trs = tbody.find_all('tr')
for elem in trs:
tr_class = elem.get('class')
if tr_class != None:
if tr_class[0] == 'highest_price' or tr_class[0] == 'lowest_price':
tds = elem.find_all('td')
td_label = tds[0].get_text().split(' ')[0]
td_price = tds[1].get_text()
td_date = tds[2].get_text()
print td_label, td_price, td_date
else:
tds = elem.find_all('td')
td_label = tds[0].get_text().split(' ')[0]
if td_label == 'Average':
td_price = tds[1].get_text()
print td_label, td_price
ps = soup.find_all('p')
for p in ps:
p_class = p.get('class')
if p_class != None and len(p_class) == 2 and p_class[0] == 'smalltext' and p_class[1] == 'grey':
p_text = p.get_text()
m = re.search('since([\w\d,\s]+)\.', p_text)
if m:
date = m.group(1)
dt = datetime.datetime.strptime(date, ' %b %d, %Y')
print datetime.date.strftime(dt, '%Y-%m-%d')
break
From reading the source code, it seems like the historical price data is accessed via JavaScript. As such, you'll need to find a way to emulate a real browser. Personally, I use Selenium for these kinds of tasks.
I am not really sure about the solution, but you should generally avoid of so much list indexing and find_all clauses. The reason is that the position or number of elements change much more easily than things like class, ids and so on. So I would recommend to use rather css selectors.

Categories

Resources