Shaking off duplicates while parsing

Shaking off duplicates while parsing - python

I've made a parser written in python which is doing it's job perfectly except for some duplicates coming along. Moreover, when I open csv file I can see that every result is surrounded by square braces. Is there any workaround to get rid of duplicates data and square braces on the fly? Here is what I tried with:
import csv
import requests
from lxml import html
def parsingdata(mpg):
data = set()
outfile=open('RealYP.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
pg=1
while pg<=mpg:
url="https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page="+str(pg)
page=requests.get(url)
tree=html.fromstring(page.text)
titles = tree.xpath('//div[#class="info"]')
items = []
for title in titles:
comb = []
Name = title.xpath('.//span[#itemprop="name"]/text()')
Address = title.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone = title.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
try:
comb.append(Name[0])
comb.append(Address[0])
comb.append(Phone[0])
except:
continue
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
Now it is working fine.
Edit: Rectified portion taken from bjpreisler

This script removes dups when I am working with a .csv file. Check if this works for you :)
with open(file_out, 'w') as f_out, open(file_in, 'r') as f_in:
# write rows from in-file to out-file until all the data is written
checkDups = set() # set for removing duplicates
for line in f_in:
if line in checkDups: continue # skip duplicate
checkDups.add(line)
f_out.write(line)

You are currently writing a list (items) to the csv which is why it is in brackets. To avoid this, use another for loop that could look like this:
for title in titles:
comb = []
Name = title.xpath('.//span[#itemprop="name"]/text()')
Address = title.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone = title.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
if Name:
Name = Name[0]
if Address:
Address = Address[0]
if Phone:
Phone = Phone[0]
comb.append(Name)
comb.append(Address)
comb.append(Phone)
print comb
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
This should write each item separately to your csv. It turns out the items you were appending to comb were lists themselves, so this extracts them.

And the concise version of this scraper I found lately is:
import csv
import requests
from lxml import html
url = "https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={0}"
def parsingdata(link):
outfile=open('YellowPage.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
for page_link in [link.format(i) for i in range(1, 4)]:
page = requests.get(page_link).text
tree = html.fromstring(page)
for title in tree.xpath('//div[#class="info"]'):
Name = title.findtext('.//span[#itemprop="name"]')
Address = title.findtext('.//span[#itemprop="streetAddress"]')
Phone = title.findtext('.//div[#itemprop="telephone"]')
print([Name, Address, Phone])
writer.writerow([Name, Address, Phone])
parsingdata(url)

Related

How to delete data from csv file using python

I am scraping a website for the course number and the course name. But if a course number does not have a name or vice versa, the data should be skipped from the final output. I do not know how to do that.
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
source = urlopen('https://www.rit.edu/study/computing-security-bs')
csv_file1 = open('scrape.csv', 'w')
csv_writer = csv.writer(csv_file1)
csv_writer.writerow(['Course Number', 'Course Name'])
soup = BeautifulSoup(source, 'lxml')
table = soup.find('div', class_='processed-table')
#print(table)
curriculum = table.find('curriculum')
#print(curriculum.prettify())
next = curriculum.find('table', class_='table-curriculum')
#print(next.prettify())
for course_num in next.find_all('tr', class_='hidden-row rows-1'):
num = course_num.find_all('td')[0]
real = num.get_text()
# print(real)
realstr = real.encode('utf-8')
name = course_num.find('div', class_='course-name')
realname = name.get_text()
# print(realname)
realnamestr = realname.encode('utf-8')
csv_writer.writerow([realstr, realnamestr])
csv_file1.close()
This is my csv
csv
I want to get rid of the last 4 rows.

As #zvone suggested, a continue will do the job here. Writing this answer as you mentioned you are not aware of the keyword.
Before, csv_writer.writerow([realstr, realnamestr]) just put an if to check the realstr and continue:
if realstr.stip() == "":
continue
I think you should still go through the continue, break and else keywords and how they can be helpful in controlling your loops.
Another approach would be to put data into csv_writer only when realstr has some value. So:
if realstr.strip != "":
csv_writer.writerow([realstr, realnamestr])

How to find items in multiple strings that match with items from a list

Hi I am trying to make a filter for a piratebay movie rss feed, which filters out the movies I already acquired and keeps the ones I do not currently have. It will then later on download the torrent from the magnet link provided. The problem is I can't figure out how to filter out the movies I have from the ones I don't, as I am trying to filter a list from a string and do not know a way around it. Here is a run-able example, with the code I want to add in notes:
import feedparser
import ssl
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
feed = feedparser.parse('https://thepiratebay.org/rss/top100/207')
feed_title = feed['feed']['title']
feed_entries = feed.entries
f = open("movies.txt", "r+")
fr = f.readlines()
print(fr)
for entry in feed.entries[:25]:
el = entry.title.lower()
# if fr in el:
# remove_from_titles()
# else:
article_title = el
article_link = entry.link
print(article_title)
print(article_link)
movies.txt file:
aquaman
spiderman

Try to use set instead of list. If feed set is A and file titles B then the tittles in A that are not in B is A.difference(B)

Can you try the following:
with open("movies.txt", "r+") as f:
fr = f.readlines()
if article_title.lower() not in movies_list:
print(article_title)
# do your downloading stuff here
# update your movies.txt file
with open("movies.txt", "a") as f:
f.write('\n' + 'article_title')

Can't figure out how to properly output my data

I'm a relative novice at python but yet, somehow managed to build a scraper for Instagram. I now want to take this one step further and output the 5 most commonly used hashtags from an IG profile into my CSV output file.
Current output:
I've managed to isolate the 5 most commonly used hashtags, but I get this result in my csv:
[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron',
2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]
Desired output:
What I'm looking to end up with in the end is having 5 columns at the end of my .CSV outputting the X-th most commonly used value.
So something in the lines of this:
I've Googled for a while and managed to isolate them separately, but I always end up with '('#thekidfromakron', 2)' as an output. I seem to be missing some part of the puzzle :(.
Here is what I'm working with at the moment:
import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()
def get_csv_header(top_numb):
fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
return fieldnames
def write_csv_header(filename, headers):
with open(filename, 'w', newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=headers)
writer.writeheader()
return
def read_user_name(t_file):
with open(t_file) as f:
user_list = f.read().splitlines()
return user_list
if __name__ == '__main__':
# HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
# Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
USER_FILE = 'userlist.txt'
# HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
DATA_FILE = 'users_with_er.csv'
MAX_POST = 12 # MAX POST
print('Starting the engagement calculations... Please wait until it finishes!')
users = read_user_name(USER_FILE)
""" Writing data to csv file """
csv_headers = get_csv_header(MAX_POST)
write_csv_header(DATA_FILE, csv_headers)
for user in users:
post_info = {'USER': user}
url = 'https://www.instagram.com/' + user + '/'
#for troubleshooting, un-comment the next two lines:
#print(user)
#print(url)
try:
r = requests.get(url)
if r.status_code != 200:
print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
continue
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
except ValueError:
print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
continue
except IndexError as error:
# Output expected IndexErrors.
print(timestamp, error)
continue
if j['graphql']['user']['edge_followed_by']['count'] <=0:
print(timestamp,'user {0} has no followers! Skipping...'.format(user))
continue
if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
continue
if j['graphql']['user']['is_private'] is True:
print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
continue
media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
accountname = j['graphql']['user']['username']
followercount = j['graphql']['user']['edge_followed_by']['count']
bio = j['graphql']['user']['biography']
i = 0
total_likes = 0
total_comments = 0
all_captiontext = ''
while i <= 11:
total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
caption_detail = captions['edges'][0]['node']['text']
all_captiontext += caption_detail
i += 1
engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
engagement_rate = (((total_likes + total_comments) / followercount)/12*100)
#isolate and count hashtags
hashtags = re.findall(r'#\w*', all_captiontext)
hashtags_counted = Counter(hashtags)
most_common = hashtags_counted.most_common(5)
with open('users_with_er.csv', 'a', newline='', encoding='utf-8') as data_out:
print(timestamp,'Writing Data for user {0}...'.format(user))
post_info["USER"] = accountname
post_info["FOLLOWERCOUNT"] = followercount
post_info["MEDIA COUNT"] = media_count
post_info["TOTAL LIKES"] = total_likes
post_info["TOTAL COMMENTS"] = total_comments
post_info["ER"] = engagement_rate
post_info["ER IN %"] = engagement_rate_percentage
post_info["BIO"] = bio
post_info["ALL CAPTION TEXT"] = all_captiontext
post_info["HASHTAGS COUNTED"] = hashtags_counted
csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
csv_writer.writerow(post_info)
""" Done with the script """
print('ALL DONE !!!! ')
The code that goes before this simply scrapes the webpage, and compiles all the captions from the last 12 posts into "all_captiontext".
Any help to solve this (probably simple) issue would be greatly appreciated as I've been struggling with this for days (again, I'm a noob :') ).

Replace line
post_info["MOST COMMON HASHTAGS"] = most_common
with:
for i, counter_tuple in enumerate(most_common):
tag_name = counter_tuple[0].replace('#','')
label = "Top %d" % (i + 1)
post_info[label] = tag_name
There's also a bit of code missing. For example, your code doesn't include csv_headers variable, which I suppose would be
csv_headers = post_info.keys()
It also seems that you're opening a file to write just one row. I don't think that's intended, so what you would like to do is to collect the results into a list of dictionaries. A cleaner solution would be to use pandas' dataframe, which you can output straight into a csv file.

most_common being the output of the call to hashtags_counted.most_common, I had a look at the doc here: https://docs.python.org/2/library/collections.html#collections.Counter.most_common
Output if formatted the following : [(key, value), (key, value), ...] and ordered in decreasing importance of number of occurences.
Hence, to get only the name and not the number of occurence, you should replace:
post_info["MOST COMMON HASHTAGS"] = most_common
by
post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]
You have a list of tuple. This statement builds on the fly the list of the first element of each tuple, keeping the sorting order.

Loop in python script with xpath. Why do I only get results form last url?

Why do I only get the results form the last url?
The idea is that I get a list of results of both urls.
Also, with the printing in csv I get eacht time an empty row. How do I remove this row?
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
outfile = open("./tv_test1.csv", "wb")
writer = csv.writer(outfile)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])

As I explained in the question's comments, you need to put the second for loop inside (at the end) the first one. Otherwise, only the last rows results will be saved/written to the CSV-format file.
You don't need to open the file in each loop (a with statement will close it automagically). It is, as well, important to highlight that if you open a file with write flags it will overwrite, and if it's inside a loop it will overwrite each time it's opened.
I'd refactor your code as follows:
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
with open("./tv_test1.csv", "wb") as outfile:
writer = csv.writer(outfile)
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])

Parse HTML table data to JSON and save to text file in Python 2.7

I'm trying to extract the data on the crime rate across states from
this webpage, link to web page
http://www.disastercenter.com/crime/uscrime.htm
I am able to get this into text file. But I would like to get the
response in Json format. How can I do this in python.
Here is my code:
import urllib
import re
from bs4 import BeautifulSoup
link = "http://www.disastercenter.com/crime/uscrime.htm"
f = urllib.urlopen(link)
myfile = f.read()
soup = BeautifulSoup(myfile)
soup1=soup.find('table', width="100%")
soup3=str(soup1)
result = re.sub("<.*?>", "", soup3)
print(result)
output=open("output.txt","w")
output.write(result)
output.close()

The following code will get the data from the two tables and output all of it as a json formatted string.
Working Example (Python 2.7.9):
from lxml import html
import requests
import re as regular_expression
import json
page = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = html.fromstring(page.text)
tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'),
tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')]
tabs = []
for table in tables:
tab = []
for row in table:
for col in row:
var = col.text_content()
var = var.strip().replace(" ", "")
var = var.split('\n')
if regular_expression.match('^\d{4}$', var[0].strip()):
tab_row = {}
tab_row["Year"] = var[0].strip()
tab_row["Population"] = var[1].strip()
tab_row["Total"] = var[2].strip()
tab_row["Violent"] = var[3].strip()
tab_row["Property"] = var[4].strip()
tab_row["Murder"] = var[5].strip()
tab_row["Forcible_Rape"] = var[6].strip()
tab_row["Robbery"] = var[7].strip()
tab_row["Aggravated_Assault"] = var[8].strip()
tab_row["Burglary"] = var[9].strip()
tab_row["Larceny_Theft"] = var[10].strip()
tab_row["Vehicle_Theft"] = var[11].strip()
tab.append(tab_row)
tabs.append(tab)
json_data = json.dumps(tabs)
output = open("output.txt", "w")
output.write(json_data)
output.close()

This might be what you want, if you can use the requests and lxml modules. The data structure presented here is very simple, adjust this to your needs.
First, get a response from your requested URL and parse the result into an HTML tree:
import requests
from lxml import etree
import json
response = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = etree.HTML(response.text)
Assuming you want to extract both tables, create this XPath and unpack the results. totals is "Number of Crimes" and rates is "Rate of Crime per 100,000 People":
xpath = './/table[#width="100%"][#style="background-color: rgb(255, 255, 255);"]//tbody'
totals, rates = tree.findall(xpath)
Extract the raw data (td.find('./') means first child item, whatever tag it has) and clean the strings (r'' raw strings are needed for Python 2.x):
raw_data = []
for tbody in totals, rates:
rows = []
for tr in tbody.getchildren():
row = []
for td in tr.getchildren():
child = td.find('./')
if child is not None and child.tag != 'br':
row.append(child.text.strip(r'\xa0').strip(r'\n').strip())
else:
row.append('')
rows.append(row)
raw_data.append(rows)
Zip together the table headers in the first two rows, then delete the redundant rows, seen as the 11th & 12th steps in slice notation:
data = {}
data['tags'] = [tag0 + tag1 for tag0, tag1 in zip(raw_data[0][0], raw_data[0][1])]
for raw in raw_data:
del raw[::12]
del raw[::11]
Store the rest of the raw data and create a JSON file (optional: eliminate whitespace with separators=(',', ':')):
data['totals'], data['rates'] = raw_data[0], raw_data[1]
with open('data.json', 'w') as f:
json.dump(data, f, separators=(',', ':'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shaking off duplicates while parsing - python

Related

How to delete data from csv file using python

How to find items in multiple strings that match with items from a list

Can't figure out how to properly output my data

Loop in python script with xpath. Why do I only get results form last url?

Parse HTML table data to JSON and save to text file in Python 2.7

Categories

Resources