Parse a 90MB data archive in Python - python

I'm a newbie to Python and trying to build a program that will allow me to parse several hundred documents by speaker and their speech (data is hearing transcripts of a semi-regular structure). After parsing, I write the results into a .csv file, then write another file that parses speech into paragraphs and makes another .csv. Here is the code (Acknowledgements to my colleague on his part in development of this, which was massive):
import os
import re
import csv
from bs4 import BeautifulSoup
path = "path in computer"
os.chdir(path)
with open('hearing_name.htm', 'r') as f:
hearing = f.read()
Hearing = BeautifulSoup(hearing)
Hearing = Hearing.get_text()
Hearing = Hearing.split("erroneous text")
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))
print speakers
position = []
for speaker in speakers:
x = hearing.find(speaker)
position.append(x)
def find_speaker(hearing, speakers):
position = []
for speaker in speakers:
x = hearing.find(speaker)
if x==-1:
x += 1000000
position.append(x)
first = min(position)
name = speakers[position.index(min(position))]
name_length = len(name)
chunk = [name, hearing[0:first], hearing[first+name_length:]]
return chunk
chunks = []
print hearing
names = []
while len(hearing)>10:
chunk_try = find_speaker(hearing, speakers)
hearing = chunk_try[2]
chunks.append(chunk_try[1])
names.append(chunk_try[0].strip())
print len(hearing)#0
#print dialogue[0:5]
chunks.append(hearing)
chunks = chunks[1:]
print len(names) #138
print len(chunks) #138
data = zip(names, chunks)
with open('filename.csv','wb') as f:
w=csv.writer(f)
w.writerow(['Speaker','Speach'])
for row in data:
w.writerow(row)
paragraphs = str(chunks)
print (paragraphs)
Paragraphs = paragraphs.split("\\n")
data1 = zip(Paragraphs)
with open('Paragraphs.csv','wb') as f:
w=csv.writer(f)
w.writerow(['Paragraphs'])
for row in data1:
w.writerow(row)
Obviously, the code above can do what I need one hearing at a time, but my question is
how can I automate this to the point were I can either do large batches or all of the files at once (578 hearings in total)? I've tried the below (which has worked for me in the past when compiling large sets of data), but this time I get no results (memory leak?)
Tested Compiling Code:
hearing = [filename for filename in os.listdir(path)]
hearings = []
#compile hearings
for file in hearing:
input = open(file, 'r')
hearings.append(input.read())
Thanks in advance for your help.

First you need to take the first set of code, generalize it and make it into a giant function. The will involve replacing any hardcoded path and file names in it with variables named appropriately.
Give the new driver function argments that correspond to each of the path(s) and file name(s) you replaced. Calling this function will preform all the steps need to process one input file and produce all the output files that result from doing that.
You can test whether you've done this correctly by calling the driver function and passing it the file names that it used to be hardcoded and see if it produces the same output as it did before.
Once that is done, import the file the function is in (which is now called a module) into your batch processing script and invoke the new driver function you added multiple times, passing different input and output file names to it each time.
I've done the first step for you (and fixed the mixed indenting). Note however that it's untested since that's impossible for me to actually do:
import os
import re
import csv
from bs4 import BeautifulSoup
def driver(folder, input_filename, output_filename1, output_filename2):
os.chdir(folder)
with open(input_filename, 'r') as f:
hearing = f.read()
Hearing = BeautifulSoup(hearing)
Hearing = Hearing.get_text()
Hearing = Hearing.split("erroneous text")
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))
print speakers
position = []
for speaker in speakers:
x = hearing.find(speaker)
position.append(x)
def find_speaker(hearing, speakers):
position = []
for speaker in speakers:
x = hearing.find(speaker)
if x==-1:
x += 1000000
position.append(x)
first = min(position)
name = speakers[position.index(min(position))]
name_length = len(name)
chunk = [name, hearing[0:first], hearing[first+name_length:]]
return chunk
chunks = []
print hearing
names = []
while len(hearing)>10:
chunk_try = find_speaker(hearing, speakers)
hearing = chunk_try[2]
chunks.append(chunk_try[1])
names.append(chunk_try[0].strip())
print len(hearing)#0
#print dialogue[0:5]
chunks.append(hearing)
chunks = chunks[1:]
print len(names) #138
print len(chunks) #138
data = zip(names, chunks)
with open(output_filename1,'wb') as f:
w=csv.writer(f)
w.writerow(['Speaker','Speach'])
for row in data:
w.writerow(row)
paragraphs = str(chunks)
print (paragraphs)
Paragraphs = paragraphs.split("\\n")
data1 = zip(Paragraphs)
with open(output_filename2,'wb') as f:
w=csv.writer(f)
w.writerow(['Paragraphs'])
for row in data1:
w.writerow(row)
return True # success
if __name__ == '__main__':
driver('path in computer', 'hearing_name.htm', 'filename.csv', 'Paragraphs.csv')

You can use a lot less memory with no real down-side if you process these files individually. Rather than read the whole file then add that to a list for future processing, process a file, then move on to the next.
As for the no results, I'm not totally sure. Are you not getting any errors?

Related

Match Domains to DNS Resolver Name - Python

I am still new to Python, and have been working on this for work, and a few side projects with it for automating my Plex Media Management tasks.
I am trying to write a python script that would allow me to take a set list of domains from a csv file, match them to their dns name: Example (Plex.tv using 'NS' would return jeremy.ns.cloudflare.com)
My main goal is to read in the list of domains from a csv
run my code to match those domains to a dns resolver name
write those to either a new CSV file, and then zip the two together, which is what I have in my code.
I am having a few problems along the way.
Visual Code doesn't allow import dns.resolver (not a huge issue, but if you know the fix for that it would save me from having to run it from command line)
Matching Domains to their DNS resolver is throwing the error "AttributeError: 'list' object has no attribute 'is_absolute'"
import csv
import socket
import dns.resolver
import os
from os.path import dirname, abspath
# Setting Variables
current_path = dirname(abspath(__file__))
domainFName = '{0}/domains.csv'.format(current_path)
outputFile = '{0}/output.csv'.format(current_path)
dnsList = '{0}/list2.csv'.format(current_path)
case_list = []
fields = ['Domains', 'DNS Resolvers']
caseList = []
dnsResolve = []
# Read in all domains from csv into list
with open(domainFName, 'r') as file:
for line in csv.reader(file):
case_list.append(line)
print(case_list)
# Match domains to the DNS Resolver Name
for domains in case_list:
answer = dns.resolver.resolve(domains, 'NS')
server = answer.target
dnsResolve.append(server)
# Write the dns Resolver names into a new csv file
with open(dnsList,'w', newline="") as r:
writers = csv.writer(r)
writers.writerows(caseList)
# Write the domains and dns resolvers to new output csv
with open(outputFile,'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(fields)
writer.writerow(zip(case_list,caseList))
exit()
Thanks for any help
After a discussion with a co-worker, I was able to resolve my issue, and just for the sake of it, if anyone wants to use this code for a similar need (we use it for DMARC), I will post the whole code:
import dns.resolver
import csv
import os
from os.path import dirname, abspath
# Setting Variables
current_path = dirname(abspath(__file__))
domainFName = '{0}/domains.csv'.format(current_path)
outputFile = '{0}/output.csv'.format(current_path)
dnsList = '{0}/dnslist.csv'.format(current_path)
backupCSV = '{0}/backup-output.csv'.format(current_path)
case_list = []
dns_list = []
fields = ['Domains', 'DNS Resolvers']
csv_output = zip(case_list, dns_list)
domainAmount = 0
rd = 00
dnresolve = 00
part = 0
percentL = []
percents = [10,20,30,40,50,60,70,80,90,95,96,97,98,99]
percentList = []
floatingList = []
floatPart = []
x = 00
keyAzure = 'azure'
keyCSC = 'csc'
while x < .99:
x += .01
floatingList.append(x)
# THIS IS THE CODE FOR WRITING CSV FILES INTO LISTS - LABELED AS #1
print("FILES MUST BE CSV, WILL RETURN AN ERROR IF NOT. LEAVE OFF .CSV")
# Here we will gather the input of which csv file to use. If none are entered, it will use domains.csv
print("Enter your output file name (if blank will use default):")
UserFile = str(input("Enter your filename: ") or "domains")
fullFile = UserFile + '.csv'
domainFName = fullFile.format(current_path)
# Here will will specify the output file name. If the file is not created, it will create it
# If the user enters not data, the default will be used, output.csv
print("Enter your output file name (if blank will use default):")
UserOutput = str(input("Enter your filename: ") or "output")
fullOutput = UserOutput + '.csv'
outputFIle = fullOutput.format(current_path)
# Read in all domains from csv into list
with open(domainFName, 'r') as file:
for line in csv.reader(file):
case_list.append(line)
domainAmount += 1
print("Starting the resolver:")
print("You have " + str(domainAmount) + " Domains to resolve:")
# THIS IS THE END OF THE CODE FOR WRITING CSV FILES INTO LISTS - LABELED AS #1
# THE CODE BELOW IS WORKING FOR FINDING THE DNS RESOLVERS - LABELED AS #2
# Function for matching domains to DNS resolvers
def dnsResolver (domain):
try:
answers = dns.resolver.resolve(domain, 'NS')
for server in answers:
dns_list.append(server.target)
except:
dns_list.append("Did Not Resolve")
print("Now resolving domains to their DNS name:")
print("This will take a few minutes. Check out the progress bar for your status:")
print("I have resolved 0% Domains:")
# This code is for finding the percentages for the total amount of domains to find progress status
def percentageFinder(percent, whole):
return (percent * whole) / 100
def percentGetter(part, whole):
return (100 * int(part)/int(whole))
for x in percents:
percentList.append(int(percentageFinder(x,domainAmount)))
percentL = percentList
#End code for percentage finding
for firstdomain in case_list:
for domain in firstdomain:
dnsResolver(domain)
if dnsResolver != "Did Not Resolve":
rd += 1
else:
dnresolve += 1
# Using w+ to overwrite all Domain Names &
with open(dnsList,'w+', newline="") as r:
writers = csv.writer(r)
writers.writerows(dns_list)
# This is used for showing the percentage of the matching you have done
part += 1
if part in percentL:
total = int(percentGetter(part, domainAmount))
print("I Have Resolved {}".format(total) + "%" + " Domains:")
else:
pass
print("Resolving has completed. Statistics Below:")
print("------------------------------------------")
print("You had " + str(rd) + " domains that resolved.")
print("You had " + str(dnresolve) + " domains that did NOT resolve")
# THIS IS THE END OF THE WORKING CODE - LABELED AS #2
# Write the dns Resolver names into a new csv file
print("Now writing your domains & their DNS Name to an Output File:")
with open(outputFile,'w+', newline="\n") as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(fields)
for row in csv_output:
writer.writerow(row)
print("Writing a backup CSV File")
# Using this to create a backup in case to contain all domains, and all resolvers
# If someone runs the script with a small list of domains, still want to keep a
# running list of everything in case any questions arise.
# This is done by using 'a' instead of 'w' or 'w+' done above.
with open(backupCSV,'w', newline="") as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(fields)
for row in csv_output:
writer.writerow(row)
print("Your backup is now done processing. Exiting program")
# Sort the files by keyword, in this case the domain being azure or csc
for r in dns_list:
if keyAzure in r:
for x in keyAzure:
FileName = x
print(FileName)
exit()

Combine two python scripts for web search

I'm trying to download files from a site and due to search result limitations (max 300), I need to search each item individually. I have a csv file that has a complete list which I've written some basic code to return the ID# column.
With some help, I've got another script that iterates through each search result and downloads a file. What I need to do now is to combine the two so that it will search each individual ID# and download the file.
I know my loop is messed up here, I just can't figure out where and if I'm even looping in the right order
import requests, json, csv
faciltiyList = []
with open('Facility List.csv', 'r') as f:
csv_reader = csv.reader(f, delimiter=',')
for searchterm in csv_reader:
faciltiyList.append(searchterm[0])
url = "https://siera.oshpd.ca.gov/FindFacility.aspx"
r = requests.get(url+"?term="+str(searchterm))
searchresults = json.loads(r.content.decode('utf-8'))
for report in searchresults:
rpt_id = report['RPT_ID']
reporturl = f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1"
r = requests.get(reporturl)
a = r.headers['Content-Disposition']
filename = a[a.find("filename=")+9:len(a)]
file = open(filename, "wb")
file.write(r.content)
r.close()
The original code I have is here:
import requests, json
searchterm="ALAMEDA (COUNTY)"
url="https://siera.oshpd.ca.gov/FindFacility.aspx"
r=requests.get(url+"?term="+searchterm)
searchresults=json.loads(r.content.decode('utf-8'))
for report in searchresults:
rpt_id=report['RPT_ID']
reporturl=f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1"
r=requests.get(reporturl)
a=r.headers['Content-Disposition']
filename=a[a.find("filename=")+9:len(a)]
file = open(filename, "wb")
file.write(r.content)
r.close()
The searchterm ="ALAMEDA (COUNTY)" results in more than 300 results, so I'm trying to replace "ALAMEDA (COUNTY)" with a list that'll run through each name (ID# in this case) so that I'll get just one result, then run again for the next on the list
CSV - just 1 line
Tested with a CSV file with just 1 line:
406014324,"HOLISTIC PALLIATIVE CARE, INC.",550004188,Parent Facility,5707 REDWOOD RD,OAKLAND,94619,1,ALAMEDA,Not Applicable,,Open,1/1/2018,Home Health Agency/Hospice,Hospice,37.79996,-122.17075
Python code
This script reads the IDs from the CSV file. Then, it fetches the results from URL and finally writes the desired contents to the disk.
import requests, json, csv
# read Ids from csv
facilityIds = []
with open('Facility List.csv', 'r') as f:
csv_reader = csv.reader(f, delimiter=',')
for searchterm in csv_reader:
facilityIds.append(searchterm[0])
# fetch and write file contents
url = "https://siera.oshpd.ca.gov/FindFacility.aspx"
for facilityId in facilityIds:
r = requests.get(url+"?term="+str(facilityId))
reports = json.loads(r.content.decode('utf-8'))
# print(f"reports = {reports}")
for report in reports:
rpt_id = report['RPT_ID']
reporturl = f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1"
r = requests.get(reporturl)
a = r.headers['Content-Disposition']
filename = a[a.find("filename=")+9:len(a)]
# print(f"filename = {filename}")
with open(filename, "wb") as o:
o.write(r.content)
Repl.it link

Output from function to text/CSV file?

I am counting the number of contractions in a certain set of presidential speeches, and want to output these contractions to a CSV or text file. Here's my code:
import urllib2,sys,os,csv
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
import math, functools
import summarize
reload(sys)
def processURL_short(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
return item_str
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
data = {}
count = 0
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
with open('contraction_counts.csv','w',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
a.writerows(data)
Running that for loop prints out
79 obama_speech-4427
101 obama_speech-4424
101 obama_speech-4453
182 obama_speech-4612
224 obama_speech-5502
I want to export that to a text file, where the numbers on the left are one column, and the president/speech number are in the second column. My with statement just writes each individual row to a separate file, which is definitely suboptimal.
You can try something like this, this is a generic method, modify as you see fit
import csv
with open('somepath/file.txt', 'wb+') as outfile:
w = csv.writer(outfile)
w.writerow(['header1', 'header2'])
for i in you_data_structure: # eg list or dictionary i'm assuming a list structure
w.writerow([
i[0],
i[1],
])
or if a dictionary
import csv
with open('somepath/file.txt', 'wb+') as outfile:
w = csv.writer(outfile)
w.writerow(['header1', 'header2'])
for k, v in your_dictionary.items(): # eg list or dictionary i'm assuming a list structure
w.writerow([
k,
v,
])
Your problem is that you open the output file inside the loop in w mode, meaning that it is erased on each iteration. You can easily solve it in 2 ways:
mode the open outside of the loop (normal way). You will open the file only once, add a line on each iteration and close it when exiting the with block:
with open('contraction_counts.csv','w',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
a.writerows(data)
open the file in a (append) mode. On each iteration you reopen the file and write at the end instead of erasing it - this way uses more IO resources because of the open/close, and should be used only if the program can break and you want to be sure that all that was written before the crash has actually been saved to disk
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
with open('contraction_counts.csv','a',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
a.writerows(data)

extracting data from CSV file using a reference

I have a csv file with several hundred organism IDs and a second csv file with several thousand organism IDs and additional characteristics (taxonomic information, abundances per sample, etc)
I am trying to write a code that will extract the information from the larger csv using the smaller csv file as a reference. Meaning it will look at both smaller and larger files, and if the IDs are in both files, it will extract all the information form the larger file and write that in a new file (basically write the entire row for that ID).
so far I have written the following, and while the code does not error out on me, I get a blank file in the end and I don't exactly know why. I am a graduate student that knows some simple coding but I'm still very much a novice,
thank you
import sys
import csv
import os.path
SparCCnames=open(sys.argv[1],"rU")
OTU_table=open(sys.argv[2],"rU")
new_file=open(sys.argv[3],"w")
Sparcc_OTUs=csv.writer(new_file)
d=csv.DictReader(SparCCnames)
ids=csv.DictReader(OTU_table)
for record in ids:
idstopull=record["OTUid"]
if idstopull[0]=="OTUid":
continue
if idstopull[0] in d:
new_id.writerow[idstopull[0]]
SparCCnames.close()
OTU_table.close()
new_file.close()
I'm not sure what you're trying to do in your code but you can try this:
def csv_to_dict(csv_file_path):
csv_file = open(csv_file_path, 'rb')
csv_file.seek(0)
sniffdialect = csv.Sniffer().sniff(csv_file.read(10000), delimiters='\t,;')
csv_file.seek(0)
dict_reader = csv.DictReader(csv_file, dialect=sniffdialect)
csv_file.seek(0)
dict_data = []
for record in dict_reader:
dict_data.append(record)
csv_file.close()
return dict_data
def dict_to_csv(csv_file_path, dict_data):
csv_file = open(csv_file_path, 'wb')
writer = csv.writer(csv_file, dialect='excel')
headers = dict_data[0].keys()
writer.writerow(headers)
# headers must be the same with dat.keys()
for dat in dict_data:
line = []
for field in headers:
line.append(dat[field])
writer.writerow(line)
csv_file.close()
if __name__ == "__main__":
big_csv = csv_to_dict('/path/to/big_csv_file.csv')
small_csv = csv_to_dict('/path/to/small_csv_file.csv')
output = []
for s in small_csv:
for b in big_csv:
if s['id'] == b['id']:
output.append(b)
if output:
dict_to_csv('/path/to/output.csv', output)
else:
print "Nothing."
Hope that will help.
You need to read the data into a data structure, assuming OTUid is unique you can store this into a dictionary for fast lookup:
with open(sys.argv[1],"rU") as SparCCnames:
d = csv.DictReader(SparCCnames)
fieldnames = d.fieldnames
data = {i['OTUid']: i for i in d}
with open(sys.argv[2],"rU") as OTU_table, open(sys.argv[3],"w") as new_file:
Sparcc_OTUs = csv.DictWriter(new_file, fieldnames)
ids = csv.DictReader(OTU_table)
for record in ids:
if record['OTUid'] in data:
Sparcc_OTUs.writerow(data[record['OTUid']])
Thank you everyone for your help. I played with things and consulted with an advisor, and finally got a working script. I am posting it in case it helps someone else in the future.
Thanks!
import sys
import csv
input_file = csv.DictReader(open(sys.argv[1], "rU")) #has all info
ref_list = csv.DictReader(open(sys.argv[2], "rU")) #reference list
output_file = csv.DictWriter(
open(sys.argv[3], "w"), input_file.fieldnames) #to write output file with headers
output_file.writeheader() #write headers in output file
white_list={} #create empty dictionary
for record in ref_list: #for every line in my reference list
white_list[record["Sample_ID"]] = None #store into the dictionary the ID's as keys
for record in input_file: #for every line in my input file
record_id = record["Sample_ID"] #store ID's into variable record_id
if (record_id in white_list): #if the ID is in the reference list
output_file.writerow(record) #write the entire row into a new file
else: #if it is not in my reference list
continue #ignore it and continue iterating through the file

Large text file to csv, can't open text file

I'm trying to convert this 3,1 GB text file from https://snap.stanford.edu/data/
into a csv file. All the data is structured like:
name: something
age: something
gender: something
which makes it a pretty large text file with some million lines.
I have tried to write a py script to convert it but for some reason it won't read the lines in my for each loop.
Here is the code:
import csv
def trycast(x):
try:
return float(x)
except:
try:
return int(x)
except:
return x
cols = ['product_productId', 'review_userId', 'review_profileName', 'review_helpfulness', 'review_score', 'review_time', 'review_summary', 'review_text']
f = open("movies.txt", "wb")
w = csv.writer(f)
w.writerow(cols)
doc = {}
with open('movies.txt') as infile:
for line in infile:
line = line.strip()
if line=="":
w.writerow([doc.get(col) for col in cols])
doc = {}
else:
idx = line.find(":")
key, value = tuple([line[:idx], line[idx+1:]])
key = key.strip().replace("/", "_").lower()
value = value.strip()
doc[key] = trycast(value)
f.close()
I'm not sure if it is because the document is to large, because a regulare notepad program won't be able to open it.
Thanks up front! :-)
In the line f = open("movies.txt", "wb") you're opening the file for writing, and thereby deleting all its content. Later on, you're trying to read from that same file. It probably works fine if you change the output filename. (I am not going to download 3.1 GB to test it. ;) )

Categories

Resources