I've been working on a python script that will scrape certain webpages.
The beginning of the script looks like this:
# -*- coding: UTF-8 -*-
import urllib2
import re
database = ''
contents = open('contents.html', 'r')
for line in contents:
entry = ''
f = re.search('(?<=a href=")(.+?)(?=\.htm)', line)
if f:
entry = f.group(0)
page = urllib2.urlopen('https://indo-european.info/pokorny-etymological-dictionary/' + entry + '.htm').read()
m = re.search('English meaning( )+\s+(.+?)</font>', page)
if m:
title = m.group(2)
else:
title = 'N/A'
This accesses each page and grabs a title from it. Then I have a number of blocks of code that test whether certain text is present in each page, here is an example of one:
abg = re.findall('\babg\b', page);
if len(abg) == 0:
abg = 'N'
else:
abg = 'Y'
Then, finally, still in the for loop, I add this information to the variable database:
database += '\n' + str('<F>') + str(entry) + '<TITLE="' + str(title) + '"><FQ="N"><SQ="N"><ABG="' + str(abg) + '"></F>'
Note that I have used str() for each variable because I was getting a "can't concatenate strings and lists" error for some reason.
Once the for loop is completed, I write the database variable to a file:
f = open('database.txt', 'wb')
f.write(database)
f.close()
When I run this in the command line, it times out or never completes running. Any ideas as to what might be causing the issue?
EDIT: I fixed it. It seems the program was getting slowed down by the fact that I was having the database variable store the result of each line's iteration through the loop. All I had to do to fix the issue was change the write function to happen during the for loop.
Related
Here is some code I'm working on requiring string and the opening and closing of files.
#Importing required Packages---------------------------------------------
import string
# Importing Datasets-----------------------------------------------------
allNames = open("allNames.csv", "r")
onlyNames = open("onlyNames.csv", "r")
#=========Tasks==========================================================
# [1] findName(name, outputFile)-----------------------------------------
# Works ####
def findName(name, outputFile):
outfile = open(outputFile + ".csv", "w") # Output file
outfile.write("Artist \tSong \tYear\n") # Initial title lines
alreadyAdded = [] # List of lines already added to remove duplicates
for aline in allNames: # Looping through allNames.csv
fields = aline.split("\t") # Splitting elements of a line into a list
if fields[-1] == name + "\n": # Selecting lines with only the specified name (last element)
dataline = fields[0] + "\t" + fields[1] + "\t" + fields[3] # Each line in the .csv file
if dataline not in alreadyAdded: # Removing Duplicates
outfile.write(dataline + "\n") # Writing the file
alreadyAdded.append(dataline) # Adding lines already added
outfile.close()
# findName("Mary Anne", "mary anne")
# findName("Jack", "jack")
# findName("Mary", "mary")
# findName("Peter", "peter")
The code serves its intended purpose as I get an exported file. However, this only works for one function at a time, for example if I try to run both findName("Mary Anne", "mary anne") and findName("Jack", "jack") at the same time, the second instance of the function does not work. Moreover, all subsequent functions on the project file do not work unless I comment out this code.
Let me know what the issue is, thank you!
I'm writing a script to search through multiple text files with mac addresses in them to find what port they are associated with. I need to do this for several hundred mac addresses. The function runs the first time through fine. After that though the new mac address doesn't get passed to the function it remains as the same one it already used and the functions for loop only seems to run once.
import re
import csv
f = open('all_switches.csv','U')
source_file = csv.reader(f)
m = open('macaddress.csv','wb')
macaddress = csv.writer(m)
s = open('test.txt','r')
source_mac = s.read().splitlines()
count = 0
countMac = 0
countFor = 0
def find_mac(sneaky):
global count
global countFor
count = count +1
for switches in source_file:
countFor = countFor + 1
# print sneaky only goes through the loop once
switch = switches[4]
source_switch = open(switch + '.txt', 'r')
switch_read = source_switch.readlines()
for mac in switch_read:
# print mac does search through all the switches
found_mac = re.search(sneaky, mac)
if found_mac is not None:
interface = re.search("(Gi|Eth|Te)(\S+)", mac)
if interface is not None:
port = interface.group()
macaddress.writerow([sneaky, switch, port])
print sneaky + ' ' + switch + ' ' + port
source_switch.close()
for macs in source_mac:
match = re.search(r'[a-fA-F0-9]{4}[.][a-fA-F0-9]{4}[.][a-fA-F0-9]{4}', macs)
if match is not None:
sneaky = match.group()
find_mac(sneaky)
countMac = countMac + 1
print count
print countMac
print countFor
I've added the count countFor and countMac to see how many times the loops and functions run. Here is the output.
549f.3507.7674 the name of the switch Eth100/1/11
677
677
353
Any insight would be appreciated.
source_file is opened globally only once, so the first time you execute call find_mac(), the for switches in source_file: loop will exhaust the file. Since the file wasn't closed and reopened, the next time find_mac() is called the file pointer is at the end of the file and reads nothing.
Moving the following to the beginning of find_mac should fix it:
f = open('all_switches.csv','U')
source_file = csv.reader(f)
Consider using with statements to ensure your files are closed as well.
I am using python 2.4.4 (old machine, can't do anything about it) on a UNIX machine. I am extremely new to python/programming and have never used a UNIX machine before. This is what I am trying to do:
extract a single sequence from a FASTA file (proteins + nucleotides) to a temporary text file.
Give this temporary file to a program called 'threader'
Append the output from threader (called tempresult.out) to a file called results.out
Remove the temporary file.
Remove the tempresult.out file.
Repeat using the next FASTA sequence.
Here is my code so far:
import os
from itertools import groupby
input_file = open('controls.txt', 'r')
output_file = open('results.out', 'a')
def fasta_parser(fasta_name):
input = fasta_name
parse = (x[1] for x in groupby(input, lambda line: line[0] == ">"))
for header in parse:
header = header.next()[0:].strip()
seq = "\n".join(s.strip() for s in parse.next())
yield (header, '\n', seq)
parsedfile = fasta_parser(input_file)
mylist = list(parsedfile)
index = 0
while index < len(mylist):
temp_file = open('temp.txt', 'a+')
temp_file.write(' '.join(mylist[index]))
os.system('threader' + ' temp.txt' + ' tempresult.out' + ' structures.txt')
os.remove('temp.txt')
f = open('tempresult.out', 'r')
data = str(f.read())
output_file.write(data)
os.remove('tempresult.out')
index +=1
output_file.close()
temp_file.close()
input_file.close()
When I run this script I get the error 'Segmentation Fault'. From what I gather this is to do with me messing with memory I shouldn't be messing with (???). I assume it is something to do with the temporary files but I have no idea how I would get around this.
Any help would be much appreciated!
Thanks!
Update 1:
Threader works fine when I give it the same sequence multiple times like this:
import os
input_file = open('control.txt', 'r')
output_file = open('results.out', 'a')
x=0
while x<3:
os.system('threader' + ' control.txt' + ' tempresult.out' + ' structures.txt')
f = open('tempresult.out', 'r')
data = str(f.read())
output_file.write(data)
os.remove('result.out')
x += 1
output_file.close()
input_file.close()
Update 2: In the event that someone else gets this error. I forgot to close temp.txt before invoking the threader program.
I've never used Python and have copied some script (with permission) from someone online, so I'm not sure why the code is dropping. I'm hoping someone can understand it and put it right for me!
from os import walk
from os.path import join
#First some options here.
!RootDir = "C:\\Users\\***\\Documents\\GoGames"
!OutputFile = "C:\\Users\\***\\Documents\\GoGames\\protable.csv"
Properties = !!['pb', 'pw', 'br', 'wr', 'dt', 'ev', 're']
print """
SGF Database Maker
==================
Use this program to create a CSV file with sgf info.
"""
def getInfo(filename):
"""Read out file info here and return a dictionary with all the
properties needed."""
result = !![]
file = open(filename, 'r')
data = file.read(1024) read at most 1kb since we assume all relevant info is in the beginning
file.close()
for prop in Properties:
try:
i = data.lower().index(prop)
except !ValueError:
result.append((prop, ''))
continue
try:
value = data![data.index('![', i)+1 : data.index(']', i)]
except !ValueError:
value = ''
result.append((prop, value))
return dict(result)
!ProgressCounter = 0
file = open(!OutputFile, "w")
file.write('^Filename^;^PB^;^BR^;^PW^;^WR^;^RE^;^EV^;^DT^\n')
for root, dirs, files in walk(!RootDir):
for name in files:
if name![-3:].lower() != "sgf":
continue
info = getInfo(join(root, name))
file.write('^'+join(root, name)+'^;^'+info!['pb']+'^;^'+info!['br']+'^;^'+info!['pw']+'^;^'+info!['wr']+'^;^'+info!['re']+'^;^'+info!['ev']+'^;^'+info!['dt']+'^\n')
!ProgressCounter += 1
if (!ProgressCounter) % 100 == 0:
print str(!ProgressCounter) + " games processed."
file.close()
print "A total of " + str(!ProgressCounter) + " have been processed."
Using Netbeans IDE I get the following error:
!RootDir = "C:\\Users\\***\\Documents\\GoGames"
^
SyntaxError: mismatched input '' expecting EOF
I have previously been able to step through the code as far as file.close(), where I go an error "does not match outer indentation level".
Anyone able to put the syntax of this code right for me?
Remove the exclamation marks in front of variable names, list declarations (!![]) and in except clauses (except !ValueError), this is not valid Python syntax.
What I am trying to do:
I am trying to use 'Open' in python and this is the script I am trying to execute. I am trying to give "restaurant name" as input and a file gets saved (reviews.txt).
Script: (in short, the script goes to a page and scrapes the reviews)
from bs4 import BeautifulSoup
from urllib import urlopen
queries = 0
while queries <201:
stringQ = str(queries)
page = urlopen('http://www.yelp.com/biz/madison-square-park-new-york?start=' + stringQ)
soup = BeautifulSoup(page)
reviews = soup.findAll('p', attrs={'itemprop':'description'})
authors = soup.findAll('span', attrs={'itemprop':'author'})
flag = True
indexOf = 1
for review in reviews:
dirtyEntry = str(review)
while dirtyEntry.index('<') != -1:
indexOf = dirtyEntry.index('<')
endOf = dirtyEntry.index('>')
if flag:
dirtyEntry = dirtyEntry[endOf+1:]
flag = False
else:
if(endOf+1 == len(dirtyEntry)):
cleanEntry = dirtyEntry[0:indexOf]
break
else:
dirtyEntry = dirtyEntry[0:indexOf]+dirtyEntry[endOf+1:]
f=open("reviews.txt", "a")
f.write(cleanEntry)
f.write("\n")
f.close
queries = queries + 40
Problem:
It's using append mode 'a' and according to documentation, 'w' is the write mode where it overwrites. When i change it to 'w' nothing happens.
f=open("reviews.txt", "w") #does not work!
Actual Question:
EDIT: Let me clear the confusion.
I just want ONE review.txt file with all the reviews. Everytime I run the script, I want the script to overwrite the existing review.txt with new reviews according to my input.
Thank you,
If I understand properly what behavior you want, then this should be the right code:
with open("reviews.txt", "w") as f:
for review in reviews:
dirtyEntry = str(review)
while dirtyEntry.index('<') != -1:
indexOf = dirtyEntry.index('<')
endOf = dirtyEntry.index('>')
if flag:
dirtyEntry = dirtyEntry[endOf+1:]
flag = False
else:
if(endOf+1 == len(dirtyEntry)):
cleanEntry = dirtyEntry[0:indexOf]
break
else:
dirtyEntry = dirtyEntry[0:indexOf]+dirtyEntry[endOf+1:]
f.write(cleanEntry)
f.write("\n")
This will open the file for writing only once and will write all the entries to it. Otherwise, if it's nested in for loop, the file is opened for each review and thus overwritten by the next review.
with statement ensures that when the program quits the block, the file will be closed. It also makes code easier to read.
I'd also suggest to avoid using brackets in if statement, so instead of
if(endOf+1 == len(dirtyEntry)):
it's better to use just
if endOf + 1 == len(dirtyEntry):
If you want to write every record to a different new file, you must name it differently, because this way you are always overwritting your old data with new data, and you are left only with the latest record.
You could increment your filename like this:
# at the beginning, above the loop:
i=1
f=open("reviews_{0}.txt".format(i), "a")
f.write(cleanEntry)
f.write("\n")
f.close
i+=1
UPDATE
According to your recent update, I see that this is not what you want. To achieve what you want, you just need to move f=open("reviews.txt", "w") and f.close() outside of the for loop. That way, you won't be opening it multiple times inside a loop, every time overwriting your previous entries:
f=open("reviews.txt", "w")
for review in reviews:
# ... other code here ... #
f.write(cleanEntry)
f.write("\n")
f.close()
But, I encourage you to use with open("reviews.txt", "w") as described in Alexey's answer.