I am trying to scrape data from this link
https://www.seloger.com/
and I get this error, I don't understand what's wrong because I already tried this code before and it worked
import re
import requests
import csv
import json
with open("selog.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "Type", "Prix", "Code_postal", "Ville", "Departement", "Nombre_pieces", "Nbr_chambres", "Type_cuisine", "Surface"])
for i in range(1, 500):
url = str('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=' + str(i))
r = requests.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
p = re.compile('var ava_data =(.*);\r\n\s+ava_data\.logged = logged;', re.DOTALL)
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
x = re.sub(r'\s{2,}|\\r\\n', '', x)
data = json.loads(x)
f = csv.writer(open("Seloger.csv", "wb+"))
for product in data['products']:
ID = product['idannonce']
prix = product['prix']
surface = product['surface']
code_postal = product['codepostal']
nombre_pieces = product['nb_pieces']
nbr_chambres = product['nb_chambres']
Type = product['typedebien']
type_cuisine = product['idtypecuisine']
ville = product['ville']
departement = product['departement']
etage = product['etage']
writer.writerow([ID, Type, prix, code_postal, ville, departement, nombre_pieces, nbr_chambres, type_cuisine, surface])
this the error :
Traceback (most recent call last):
File "Seloger.py", line 20, in <module>
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
IndexError: list index out of range
This line is wrong:
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
what you need to find in text?
for working scraped on text you need change above line to:
x = r.text.strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
and then finding something you need
The error occurs because sometimes there is no match, and you are trying to access a non-existing item in an empty list. The same result can be reproduced with print(re.findall("s", "d")[0]).
To fix the issue, replace x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\') line with
x = ''
xm = p.search(r.text)
if xm:
x = xm.group(1).strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
NOTES
When you use p.findall(r.text)[0], you want to get the first match in the input, so re.search is best here as it only returns the first match
To obtain the substirng captured in the first capturing group, you need to use matchObject.grou[p(1)
if xm: is important: if there is no match, x will remain an empty string, else, it will be assigned the modified value in Group 1.
Related
I wrote a code to convert some text in pdf file into a pandas dataframe. Code works very well normally, but when I try to fit it into class and define function for it, it returns with error.
import pdfplumber
import pandas as pd
import re
cols = ["Declaration Number", "Declaration Date", "Warehouse", "Quantity", "Number of boxes", "Product name", "Invoice Number"]
dataset = []
quant = []
date = []
decl_date = []
decl = re.compile(r'\d{8}AN\d{6}')
decd = re.compile(r'\d{2}\.\d{2}\.\d{4}')
whse = re.compile(r'ANTREPO | LÄ°MAN')
qty = re.compile(r'\d.KAP')
prod = re.compile(r'Ticari')
invNo = re.compile(r'Fatura')
class pdf():
def __init__(self):
self.kap = None
self.kg = None
def FirstPage():
with pdfplumber.open("44550500AN087999.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
for line in text.split('\n'):
if decl.search(line):
decl_num = line.split()[-1]
if decd.search(line):
decl_date = []
date = []
decl_date.append(line.split())
date = decl_date[1][-1]
if whse.search(line):
warehouse = line.split()
if qty.search(line):
quant = line.split()
kap = quant[0] + " " + quant[1]
kg = quant[2] + " " + quant[3]
when I run it it returns with several errors:
For instance:
<ipython-input-26-bc082b4afef0> in FirstPage()
20 date = []
21 decl_date.append(line.split())
---> 22 date = decl_date[1][-1]
23 if whse.search(line):
24 warehouse = line.split()
IndexError: list index out of range
I am probably defining the variables wrong but I am a newby so, anyone have any idea what am I doing wrong?
You are only putting one element into decl_date, and then trying to access the second element inside that list, which does not exist.
Your use of line.split() seems incorrect to me. The way you have used them essentially only puts the string into a 1-element list "string" -> ["string"].
I assume you want to split the string by using the regex in each if-statement, in that case change line.split() to pattern.split(line)[index], swapping out pattern and index
I'm a relative novice at python but yet, somehow managed to build a scraper for Instagram. I now want to take this one step further and output the 5 most commonly used hashtags from an IG profile into my CSV output file.
Current output:
I've managed to isolate the 5 most commonly used hashtags, but I get this result in my csv:
[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron',
2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]
Desired output:
What I'm looking to end up with in the end is having 5 columns at the end of my .CSV outputting the X-th most commonly used value.
So something in the lines of this:
I've Googled for a while and managed to isolate them separately, but I always end up with '('#thekidfromakron', 2)' as an output. I seem to be missing some part of the puzzle :(.
Here is what I'm working with at the moment:
import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()
def get_csv_header(top_numb):
fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
return fieldnames
def write_csv_header(filename, headers):
with open(filename, 'w', newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=headers)
writer.writeheader()
return
def read_user_name(t_file):
with open(t_file) as f:
user_list = f.read().splitlines()
return user_list
if __name__ == '__main__':
# HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
# Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
USER_FILE = 'userlist.txt'
# HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
DATA_FILE = 'users_with_er.csv'
MAX_POST = 12 # MAX POST
print('Starting the engagement calculations... Please wait until it finishes!')
users = read_user_name(USER_FILE)
""" Writing data to csv file """
csv_headers = get_csv_header(MAX_POST)
write_csv_header(DATA_FILE, csv_headers)
for user in users:
post_info = {'USER': user}
url = 'https://www.instagram.com/' + user + '/'
#for troubleshooting, un-comment the next two lines:
#print(user)
#print(url)
try:
r = requests.get(url)
if r.status_code != 200:
print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
continue
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
except ValueError:
print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
continue
except IndexError as error:
# Output expected IndexErrors.
print(timestamp, error)
continue
if j['graphql']['user']['edge_followed_by']['count'] <=0:
print(timestamp,'user {0} has no followers! Skipping...'.format(user))
continue
if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
continue
if j['graphql']['user']['is_private'] is True:
print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
continue
media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
accountname = j['graphql']['user']['username']
followercount = j['graphql']['user']['edge_followed_by']['count']
bio = j['graphql']['user']['biography']
i = 0
total_likes = 0
total_comments = 0
all_captiontext = ''
while i <= 11:
total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
caption_detail = captions['edges'][0]['node']['text']
all_captiontext += caption_detail
i += 1
engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
engagement_rate = (((total_likes + total_comments) / followercount)/12*100)
#isolate and count hashtags
hashtags = re.findall(r'#\w*', all_captiontext)
hashtags_counted = Counter(hashtags)
most_common = hashtags_counted.most_common(5)
with open('users_with_er.csv', 'a', newline='', encoding='utf-8') as data_out:
print(timestamp,'Writing Data for user {0}...'.format(user))
post_info["USER"] = accountname
post_info["FOLLOWERCOUNT"] = followercount
post_info["MEDIA COUNT"] = media_count
post_info["TOTAL LIKES"] = total_likes
post_info["TOTAL COMMENTS"] = total_comments
post_info["ER"] = engagement_rate
post_info["ER IN %"] = engagement_rate_percentage
post_info["BIO"] = bio
post_info["ALL CAPTION TEXT"] = all_captiontext
post_info["HASHTAGS COUNTED"] = hashtags_counted
csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
csv_writer.writerow(post_info)
""" Done with the script """
print('ALL DONE !!!! ')
The code that goes before this simply scrapes the webpage, and compiles all the captions from the last 12 posts into "all_captiontext".
Any help to solve this (probably simple) issue would be greatly appreciated as I've been struggling with this for days (again, I'm a noob :') ).
Replace line
post_info["MOST COMMON HASHTAGS"] = most_common
with:
for i, counter_tuple in enumerate(most_common):
tag_name = counter_tuple[0].replace('#','')
label = "Top %d" % (i + 1)
post_info[label] = tag_name
There's also a bit of code missing. For example, your code doesn't include csv_headers variable, which I suppose would be
csv_headers = post_info.keys()
It also seems that you're opening a file to write just one row. I don't think that's intended, so what you would like to do is to collect the results into a list of dictionaries. A cleaner solution would be to use pandas' dataframe, which you can output straight into a csv file.
most_common being the output of the call to hashtags_counted.most_common, I had a look at the doc here: https://docs.python.org/2/library/collections.html#collections.Counter.most_common
Output if formatted the following : [(key, value), (key, value), ...] and ordered in decreasing importance of number of occurences.
Hence, to get only the name and not the number of occurence, you should replace:
post_info["MOST COMMON HASHTAGS"] = most_common
by
post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]
You have a list of tuple. This statement builds on the fly the list of the first element of each tuple, keeping the sorting order.
Im writing a script where one of its functions is to read a CSV file that contain URLs on one of its rows. Unfortunately the system that create those CSVs doesn't put double-quotes on values inside the URL column so when the URL contain commas it breaks all my csv parsing.
This is the code I'm using:
with open(accesslog, 'r') as csvfile, open ('results.csv', 'w') as enhancedcsv:
reader = csv.DictReader(csvfile)
for row in reader:
self.uri = (row['URL'])
self.OriCat = (row['Category'])
self.query(self.uri)
print self.URL+","+self.ServerIP+","+self.OriCat+","+self.NewCat"
This is a sample URL that is breaking up the parsing - this URL comes on the row named "URL". (note the commas at the end)
ams1-ib.adnxs.com/ww=1238&wh=705&ft=2&sv=43&tv=view5-1&ua=chrome&pl=mac&x=1468251839064740641,439999,v,mac,webkit_chrome,view5-1,0,,2,
The following row after the URL always come with a numeric value between parenthesis. Ex: (9999) so this could be used to define when the URL with commas end.
How can i deal with a situation like this using the csv module?
You will have to do it a little more manually. Try this
def process(lines, delimiter=','):
header = None
url_index_from_start = None
url_index_from_end = None
for line in lines:
if not header:
header = [l.strip() for l in line.split(delimiter)]
url_index_from_start = header.index('URL')
url_index_from_end = len(header)-url_index_from_start
else:
data = [l.strip() for l in line.split(delimiter)]
url_from_start = url_index_from_start
url_from_end = len(data)-url_index_from_end
values = data[:url_from_start] + data[url_from_end+1:] + [delimiter.join(data[url_from_start:url_from_end+1])]
keys = header[:url_index_from_start] + header[url_index_from_end+1:] + [header[url_index_from_start]]
yield dict(zip(keys, values))
Usage:
lines = ['Header1, Header2, URL, Header3',
'Content1, "Content2", abc,abc,,abc, Content3']
result = list(process(lines))
assert result[0]['Header1'] == 'Content1'
assert result[0]['Header2'] == '"Content2"'
assert result[0]['Header3'] == 'Content3'
assert result[0]['URL'] == 'abc,abc,,abc'
print(result)
Result:
>>> [{'URL': 'abc,abc,,abc', 'Header2': '"Content2"', 'Header3': 'Content3', 'Header1': 'Content1'}]
Have you considered using Pandas to read your data in?
Another possible solution would be to use regular expressions to pre-process the data...
#make a list of everything you want to change
old = re.findall(regex, f.read())
#append quotes and create a new list
new = []
for url in old:
url2 = "\""+url+"\""
new.append(url2)
#combine the lists
old_new = list(zip(old,new))
#Then use the list to update the file:
f = open(filein,'r')
filedata = f.read()
f.close()
for old,new in old_new:
newdata = filedata.replace(old,new)
f = open(filein,'w')
f.write(newdata)
f.close()
I have a list, abbreviations, filled with string objects. I am trying to call the .index of a string in my list. When I call the .index method with a string I get a ValueError: 'LING' is not in list, when it clearly is in the list.
My code:
for item in abbreviations:
print item
print abbreviations.index("LING")
Why does 'LING' not exist when it clearing does? I have added my following lines of code, which searches 'abbreviations' for the index of a string. I am baffled -- "LING" is clearly in my abbreviations list.
EDIT (Additional Code):
import csv
myfile = open("/Users/it/Desktop/Classbook/classAbrevs.csv", "rU")
lines = [tuple(row) for row in csv.reader(myfile)]
longSubjectNames = []
abbreviations = []
masterAbrevs = []
for item in lines:
longSubjectNames.append(item[0])
abbreviations.append(item[1])
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as myfile:
masterSchedule = tuple(open("/Users/it/Desktop/Classbook/masterClassList.txt", 'r'))
for masterline in masterSchedule:
masterline.strip()
masterSplitLine = masterline.split("|")
subjectAbrev = ""
if masterSplitLine[0] != "STATUS":
subjectAbrev = ''.join([i for i in masterSplitLine[2] if not i.isdigit()])
masterAbrevs.append(subjectAbrev)
finalAbrevs = []
for subject in masterAbrevs:
if (subject[-1] == 'W') and (subject[-2:] != 'UW'):
subject = subject[:-1]
finalAbrevs.append(subject)
x = 0
for item in abbreviations:
print item
print abbreviations.index("LING")
for item in finalAbrevs:
if masterSplitLine[0] != "STATUS":
concat = abbreviations.index(str(finalAbrevs[x]).strip())
print "The abbreviation for " + str(item) + " is: " + longSubjectNames[concat]
x = x + 1
The output of:
masterAbrevs = []
for item in lines:
longSubjectNames.append(item[0])
abbreviations.append(item[1])
print '-'.join(abbreviations)
is:
ACA-ACCY-AFST-AMST-ANAT-ANTH-APSC-ARAB-AH-FA-ASTR-BIOC-BISC-BME-BMSC-BIOS-BADM-CHEM-CHIN-CE-CLAS-CCAS-COMM-CSCI-CFA-CNSL-CPED-DNSC-EALL-ECON-EDUC-ECE-EHS-ENGL-EAP-EMSE-ENRP-EPID-EXSC-FILM-FINA-FORS-FREN-GEOG-GEOL-GER-GREK-HCS-HSCI-HLWL-HSML-HEBR-HIST-HOMP-HONR-HDEV-HOL-HSSJ-ISTM-IDIS-IAD-INTD-IAFF-IBUS-ITAL-JAPN-JSTD-KOR-LATN-LAW-LSPA-LING -MGT-MKTG-MBAD-MATH-MAE-MED-MICR-MMED-MSTD-MUS-NSC-ORSC-PSTD-PERS-PHAR-PHIL-PT-PA-PHYS-PMGT-PPSY-PSC-PORT-PSMB-PSYD-PSYC-PUBH-PPPA-REL-SEAS-SMPA-SLAV-SOC-SPAN-SPED-SPHR-STAT-SMPP-SUST-TRDA-TSTD-TURK-UW-WLP-WSTU
Traceback (most recent call last):
File "/Users/it/Desktop/Classbook/sortClasses.py", line 25, in <module>
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as anything:
IOError: [Errno 2] No such file or directory: '/Users/it/Desktop/Classbook/masterClassList.txt'
myfile = open("/Users/it/Desktop/Classbook/classAbrevs.csv", "rU")
lines = [tuple(row) for row in csv.reader(myfile)]
longSubjectNames = []
abbreviations = []
masterAbrevs = []
for item in lines:
longSubjectNames.append(item[0])
abbreviations.append(item[1])
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as myfile:
The problem is here;
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as myfile:
You defined myfile before here,
myfile = open("/Users/it/Desktop/Classbook/classAbrevs.csv", "rU")
So actually abbreviations = [] is not taking data from classAbrevs.csv.Because it's taking data from masterClassList.txt as you defined myfile with this line;
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as myfile
That's why your string not in that list.Also this line;
for item in lines:
longSubjectNames.append(item[0])
abbreviations.append(item[1])
Are you sure is item[1] has all of the strings that you want?
And I tried these codes I just copy-pasted it from your's and here is the result;
The problem is, from the result you ran:
"LING\t" is shown in your list, not "LING"
with running this I get the desired index:
abbreviations.index("LING\t")
71
To correct this, there are many methods to strip the \t, I'm showing one of those:
abbreviations.append(item[1].strip())
By correcting this line, your item[1] will strip the \t before appending to your abbreviations list.
I got advice from Jamie Bull and PM 2Ring to use the CSV module for the output of my web scraper . I'm nearly done but have an issue with some parsed items that are separated by a colon or hyphen. I'm wanting those items split into two items in the current list.
Current output:
GB,16,19,255,1,26:40,19,13,4,2,6-12,0-1,255,57,4.5,80,21,3.8,175,23-33,4.9,3,14,1,4,38.3,8,65,1,0
Sea,36,25,398,1,33:20,25,8,13,4,4-11,1-1,398,66,6.0,207,37,5.6,191,19-28,6.6,1,0,0,2,33.0,4,69,2,1
Desired output:(The issues/differences are in bold)
GB,16,19,255,1,26,40,19,13,4,2,6,12,0,1,255,57,4.5,80,21,3.8,175,23,33,4.9,3,14,1,4,38.3,8,65,1,0
Sea,36,25,398,1,33,20,25,8,13,4,4,11,1,1,398,66,6,207,37,5.6,191,19,28,6.6,1,0,0,2,33,4,69,2,1
I am unsure where or how to make these changes. I also don't know if regex is needed. Obviously I could handle this in notepad or Excel but my goal is to handle all this in Python.
If you run the program, the above results are from the 2014 season, week 1.
import requests
import re
from bs4 import BeautifulSoup
import csv
year_entry = raw_input("Enter year: ")
week_entry = raw_input("Enter week number: ")
week_link = requests.get("http://sports.yahoo.com/nfl/scoreboard/?week=" + week_entry + "&phase=2&season=" + year_entry)
page_content = BeautifulSoup(week_link.content)
a_links = page_content.find_all('tr', {'class': 'game link'})
csvfile = open('NFL_2014.csv', 'a')
writer = csv.writer(csvfile)
for link in a_links:
r = 'http://www.sports.yahoo.com' + str(link.attrs['data-url'])
r_get = requests.get(r)
soup = BeautifulSoup(r_get.content)
stats = soup.find_all("td", {'class':'stat-value'})
teams = soup.find_all("th", {'class':'stat-value'})
scores = soup.find_all('dd', {"class": 'score'})
try:
away_game_stats = []
home_game_stats = []
statistic = []
game_score = scores[-1]
game_score = game_score.text
x = game_score.split(" ")
away_score = x[1]
home_score = x[4]
home_team = teams[1]
away_team = teams[0]
away_team_stats = stats[0::2]
home_team_stats = stats[1::2]
away_game_stats.append(away_team.text)
away_game_stats.append(away_score)
home_game_stats.append(home_team.text)
home_game_stats.append(home_score)
for stats in away_team_stats:
text = stats.text.strip("").encode('utf-8')
away_game_stats.append(text)
writer.writerow(away_game_stats)
for stats in home_team_stats:
text = stats.text.strip("").encode('utf-8')
home_game_stats.append(text)
writer.writerow(home_game_stats)
except:
pass
csvfile.close()
Any help is greatly appreciated. This is my first program and searching this board has been a great resource.
Thanks,
JT
You can use regular expressions to split the strings and then "flatten" the list in order to avoid the grouping by quotation marks like this:
Substitute
writer.writerow(away_game_stats)
with
away_game_stats = [re.split(r"-|:",x) for x in away_game_stats]
writer.writerow([x for y in away_game_stats for x in y])
(and same for writer.writerow(home_game_stats))
import re
print re.sub(r"-|:",",",test_string)
See demo.
https://regex101.com/r/aQ3zJ3/2