I'm trying to save a file from a URL into a folder on my computer, but I have 732 URLs (that when saved, gives experimental data) in a list. I'm trying to run a for loop on all those URLs to save each data set into its own file. This is what I'm doing right now:
for i in ExperimentURLs:
myurl123 = str(i)
myreq = urllib.request.urlopen(myurl123)
mydata = myreq.read()
with open('/Users/lauren/Desktop/IDData/file', 'wb') as ofile:
ofile.write(mydata)
ExperimentURLs is my list of URLs, but I don't know how to handle the for loop to save each data set into a new file. Currently, this code only writes a single experiment's data into a file and stops there. If I try to save it to a different file name, it takes a different experiment's data and saves that to the file. Help?
First, you need to automatically generate a new output file name every time through the loop. I'll give you the trivial version below. Also, note that the URLs are already strings; you don't have to convert them.
pos = 0
for myurl123 in ExperimentURLs:
myreq = urllib.request.urlopen(myurl123)
mydata = myreq.read()
out_file = '/Users/lauren/Desktop/IDData/file' + str(pos)
with open(out_file, 'wb') as ofile:
ofile.write(mydata)
pos += 1
Does that solve your problem?
BTW, you can do the two iterations in parallel with
for i, myurl123 in enumerate(ExperimentURLs):
Your mistake is simply at the point of writing the files. Not that the for loop is not working. You are writing to the same file again and again. Here is a modified version, using requests. All you need to do is simply change the file name when saving.
import requests
ExperimentURLs = [
"https://www.google.com",
"https://www.yahoo.com"
]
counter = 0;
for i in ExperimentURLs:
myurl123 = str(i)
r = requests.get(myurl123)
mydata = r.text.encode('utf-8').strip()
fileName = counter
with open("results/"+str(fileName)+".html", 'w') as ofile:
ofile.write(mydata)
counter += 1
Related
I want to extract the text between {textblock_content} and {/textblock_content}.
With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.
f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
How to solve this problem?
Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.
First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:
def get_all_tags_content(
text: str,
tag_begin: str = "{textblock_content}",
tag_end: str = "{/textblock_content}"
) -> list[str]:
useful_text = text
ans = []
# Heavy cicle, needs some optimizations
# Works in O(len(text) ** 2), we can better
while tag_begin in useful_text:
useful_text = useful_text.split(tag_begin, 1)[1]
if tag_end not in useful_text:
break
block_content, useful_text = useful_text.split(tag_end, 1)
ans.append(block_content)
return ans
with open("introtext.txt", "r") as f:
with open("textcontent.txt", "w+") as r:
r.write(str(get_all_tags_content(f.read())))
To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself
When you call file.readlines() the file pointer will reach the end of the file. For further calls of the same, the return value will be an empty list so if you change your code to sth like one of the below code snippets it should work properly:
f = open("introtext.txt")
r = open("textcontent.txt", "w")
f_lines = f.readlines()
for l in f_lines:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
Also, you can implement it through with context manager like the below code snippet:
with open("textcontent.txt", "w") as r:
with open("introtext.txt") as f:
for line in f:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
trying to implement nested "for" loop in CSV files search in way - 'name' found in one CSV file being searched in other file. Here is code example:
import csv
import re
# Open the input file
with open("Authentication.csv", "r") as citiAuthen:
with open("Authorization.csv", "r") as citiAuthor:
#Set up CSV reader and process the header
csvAuthen = csv.reader(citiAuthen, quoting=csv.QUOTE_ALL, skipinitialspace=True)
headerAuthen = next(csvAuthen)
userIndex = headerAuthen.index("'USERNAME'")
statusIndex = headerAuthen.index("'STATUS'")
csvAuthor = csv.reader(citiAuthor)
headerAuthor = next(csvAuthor)
userAuthorIndex = headerAuthor.index("'USERNAME'")
iseAuthorIndex = headerAuthor.index("'ISE_NODE'")
# Make an empty list
userList = []
usrNumber = 0
# Loop through the authen file and build a list of
for row in csvAuthen:
user = row[userIndex]
#status = row[statusIndex]
#if status == "'Pass'" :
for rowAuthor in csvAuthor:
userAuthor = rowAuthor[userAuthorIndex]
print userAuthor
What is happening that "print userAuthor" make just one pass, while it has to make as many passes as there rows in csvAuthen.
What I am doing wrong? Any help is really appreciated.
You're reading the both files line-by-line from storage. When you search csvAuthor the first time, if the value you are searching for is not found, the file pointer remains at the end of the file after the search. The next search will start at the end of the file and return immediately. You could need to reset the file pointer to the beginning of the file before each search. Probably better just to read both files into memory before you start searching them.
I am beginner in the programming world and a would like some tips on how to solve a challenge.
Right now I have ~10 000 .dat files each with a single line following this structure:
Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value
I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.
So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.
Any tips on how to go about that?
Thank you!
Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.
I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.
Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.
You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)
Good luck!
import os
# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
attributes = []
values = []
# first we split the input over the &
AtributeValues = line.split('&')
for attrVal in AtributeValues:
# we split the attribute=value over the '=' sign
# the left part goes to split[0], the value goes to split[1]
split = attrVal.split('=')
attributes.append(split[0])
values.append(split[1])
# return the attributes list and values list
return attributes,values
# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)
# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
f_in = open(inFile, 'r') # only reading the file
f_out = open(wfile, 'ab+') # file is opened for reading and appending
# read the whole file line by line
lines = f_in.readlines()
# loop throug evert line in the file and write its values
for line in lines:
# let's check if the file is empty and write the headers then
first_char = f_out.read(1)
header, values = getAttributesAndValues(line)
# we write the header only if the file is empty
if not first_char:
for attribute in header:
f_out.write(attribute+delim)
f_out.write("\n")
# we write the values
for value in values:
f_out.write(value+delim)
f_out.write("\n")
# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]
# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
writeToCsv('input/'+singleFile)
but since the Attribute1,Attribute2...AttributeN are exactly the same
for every file, I would like to make them into column headers and
remove them from every other line.
input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'
once for the the first file:
','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))
for each file's content:
','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))
Maybe you need to trim the strings additionally; don't know how clean your input is.
Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]
mergeDats.py
import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
f = open("myDats/"+file,"r")
tempData = f.readlines()[0]
tempData = tempData.replace("&","\n")
g.write(tempData)
f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
temp2 = x.split("=")
dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
w = csv.DictWriter(output,my_dict.keys())
w.writeheader()
w.writerow(my_dict)
I am currently in some truble regarding python and reading files. I have to open a file in a while loop and do some stuff with the values of the file. The results are written into a new file. This new file is then read in the next run of the while loop. But in this second run I get no values out of this file... Here is a code snippet, that hopefully clarifies what I mean.
while convergence == 0:
run += 1
prevrun = run-1
if os.path.isfile("./Output/temp/EmissionMat%d.txt" %prevrun) == True:
matfile = open("./Output/temp/EmissionMat%d.txt" %prevrun, "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
else:
matfile = open("./Input/EmissionMat.txt", "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
# now some valid operations, which produce a matrix
emissionmat_file = open("./output/temp/EmissionMat%d.txt" %run, "w")
emissionmat_file.flush()
emissionmat_file.write(str(matrix))
emissionmat_file.close()
Solved it!
matfile.seek(0)
This resets the pointer to the begining of the file and allows me to read the file in the next run correctly.
Why to write to a file and then read it ? Moreover you use flush, so you are doing potentially long io. I would do
with open(originalpath) as f:
mat = f.read()
while condition :
run += 1
write_mat_run(mat, run)
mat = func(mat)
write_mat_run may be done in another thread. You should check io exceptions.
BTW this will probably solve your bug, or at least make it clear.
I can see nothing wrong with your code. The following concrete example worked on my Linux machine:
import os
run = 0
while run < 10:
run += 1
prevrun = run-1
if os.path.isfile("output%d.txt" %prevrun):
matfile = open("output%d.txt" %prevrun, "r")
data = matfile.readlines()
matfile.close()
else:
matfile = open("input.txt", "r")
data = matfile.readlines()
matfile.close()
data = [ s[:-1] + "!\n" for s in data ]
emissionmat_file = open("output%d.txt" %run, "w")
emissionmat_file.writelines(data)
emissionmat_file.close()
It adds an exclamation mark to each line in the file input.txt.
I solved it
before closing the file I do
matfile.seek(0)
This solved my problem. This methods sets the pointer of the reader to the beginning.
I am writing a python script and I just need the second line of a series of very small text files. I would like to extract this without saving the file to my harddrive as I currently do.
I have found a few threads that reference the TempFile and StringIO modules but I was unable to make much sense of them.
Currently I download all of the files and name them sequentially like 1.txt, 2.txt, etc, then go through all of them and extract the second line. I would like to open the file grab the line then move on to finding and opening and reading the next file.
Here is what I do currently with writing it to my HDD:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
linkFile = open('Summary.txt', 'a')
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
linkFile.close()
Just replace the file writing with a call to append() on a list. For example:
summary = []
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
summary.append(cand_summary)
count4 = count4 + 1
As an aside you would normally write count += 1. Also it looks like count4 uses 1-based indexing. That seems pretty unusual for Python.
You open and close the output file in every iteration.
Why not simply do
with open("Summary.txt", "w") as linkfile:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = linecache.getline(file_path, 2).strip() # string module is deprecated
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
Also, linecache is probably not the right tool here since it's optimized for reading multiple lines from the same file, not the same line from multiple files.
Instead, better do
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline.strip()
Also, if you drop the strip() method, you don't have to re-add the \n, but who knows why you have that in there. Perhaps .lstrip() would be better?
Finally, what's with the manual while loop? Why not use a for loop?
Lastly, after your comment, I understand you want to put the result in a list instead of a file. OK.
All in all:
summary = []
for count in xrange(num_files):
file_p = [directory,str(count),'.txt'] # or count+1, if you start at 1
file_path = ''.join(file_p)
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline().strip()
summary.append(cand_summary)