I have a folder with lots of .txt files. I want to merge all .txt file in a single .csv file line by line/row by row.
I have tried the following python codes, they work fine but I have to change .txt file name to add the content into .csv row.
import re
import csv
from bs4 import BeautifulSoup
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/fsdl.txt')
cleantext = BeautifulSoup(raw_html, "lxml").text
#print(cleantext)
print (re.sub('\s+',' ', cleantext))
#appending to csv as row
row = [re.sub('\s+',' ', cleantext)]
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
csvFile.close()
I expect to see better and faster solutions for automatizing the process without changing file names. Any recommendation is welcome.
Accessing a list of filenames
The following should get you closer to what you want.
import os will give you access to the os.listdir() function that lists all the files in a directory. You may need to provide the path to your data folder, if the data files are not in the same folder as your script.
This should look something like:
os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
Using all the filenames in that directory, you can then open each one individually, by parsing through them with a for loop.
import re
import csv
from bs4 import BeautifulSoup
import os
filenames = os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
for file in filenames:
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/' + file)
cleantext = BeautifulSoup(raw_html, "lxml").text
output = re.sub('\s+',' ', cleantext) # saved the result using a variable
print(output) # the variable can be reused
row = [output] # as needed, in different contexts
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
Several other nuances: I removed the csvfile.close() function call at the end. When using with context managers, the context manager automatically closes the file for you when you leave the scope of the context manager code block (i.e. that indented section below the with statement). Having said this, there might be merit to simply opening the csv file, leaving it open, and then opening the txt files one by one and writing their content to the open csv and waiting to close the csv til the very end.
Related
I've written a script in python which is able to fetch the title of different posts from a webpage and write them to a csv file. As the site updates it's content very frequently, I like to append the new result first in that csv file where there are already list of old titles available.
I've tried with:
import csv
import time
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/python"
def get_information(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for title in soup.select(".summary .question-hyperlink"):
yield title.text
if __name__ == '__main__':
while True:
with open("output.csv","a",newline="") as f:
writer = csv.writer(f)
writer.writerow(['posts'])
for items in get_information(url):
writer.writerow([items])
print(items)
time.sleep(300)
The above script which when run twice can append the new results after the old results.
Old data are like:
A
F
G
T
New data are W,Q,U.
The csv file should look like below when I rerun the script:
W
Q
U
A
F
G
T
How can I append the new result first in an existing csv file having old data?
Inserting data anywhere in a file except at the end requires rewriting the whole thing. To do this without reading its entire contents into memory first, you could create a temporary csv file with the new data in it, append the data from the existing file to that, delete the old file and rename the new one.
Here's and example of what I mean (using a dummy get_information() function to simplify testing).
import csv
import os
from tempfile import NamedTemporaryFile
url = 'https://stackoverflow.com/questions/tagged/python'
csv_filepath = 'updated.csv'
# For testing, create a existing file.
if not os.path.exists(csv_filepath):
with open(csv_filepath, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows([item] for item in 'AFGT')
# Dummy for testing.
def get_information(url):
for item in 'WQU':
yield item
if __name__ == '__main__':
folder = os.path.abspath(os.path.dirname(csv_filepath)) # Get dir of existing file.
with NamedTemporaryFile(mode='w', newline='', suffix='.csv',
dir=folder, delete=False) as newf:
temp_filename = newf.name # Save filename.
# Put new data into the temporary file.
writer = csv.writer(newf)
for item in get_information(url):
writer.writerow([item])
print([item])
# Append contents of existing file to new one.
with open(csv_filepath, 'r', newline='') as oldf:
reader = csv.reader(oldf)
for row in reader:
writer.writerow(row)
print(row)
os.remove(csv_filepath) # Delete old file.
os.rename(temp_filename, csv_filepath) # Rename temporary file.
Since you intend to change the position of every element of the table, you need to read the table into memory and rewrite the entire file, starting with the new elements.
You may find it easier to (1) write the new element to a new file, (2) open the old file and append its contents to the new file, and (3) move the new file to the original (old) file name.
I have a python script for scraping some URLs. The URLs are in a list in a txt file.
The python script (only relevant parts) are as follows:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://www.example.com/post/1245'
# rest of the code is here
print quote_page
print url
print title
print description
print actors
print director
I would like to run this script for multiple URLs in a txt file and output to a single txt file.
Any ideas how I can run this for my URLs in txt file?
You will likely want to use the Python with statement (introduced in PEP 343) and the built-in open() function:
# Python 2
import urllib2
import BeautifulSoup
# Python 3
# import urllib3
# from bs4 import BeautifulSoup
# Python 2.6+ and Python 3
with open('urls.txt','r') as url_file, open('output.txt', 'w') as output_file:
url_list = url_file.readlines()
for url_item in url_list:
# quote_page = 'https://www.example.com/post/1245'
quote_page = url_item
# rest of the code is here
# Python 2 and 3
output_file.write(quote_page)
output_file.write(url)
output_file.write(title)
output_file.write(description)
output_file.write(actors)
output_file.write(director)
output_file.write('\n')
In this instance, we:
open() file handles (url_file,output_file) to our input and output text files ('urls.txt','output.txt') at the same time (using 'r' for reading and 'w' for writing, respectively).
Use the with statement to close these files automatically after we are done fully processing our URLs. Normally, we would need to issue separate e.g. url_file.close() and output_file.close() commands (ex. at Step 5).
Put our URLs into a list (url_list = url_file.readlines()).
Loop through our URL list and write() the data we want to our output_file.
close() both of our files automatically (see Step 2).
Note that to simply add data to an existing output_file, you will probably wish to use 'a' (append mode) rather than 'w' (write mode). So e.g. open('output.txt', 'w') as output_file would become open('output.txt', 'a') as output_file. This is important because 'w' (write mode) will truncate the file if the file already exists (i.e. you will lose your original data).
I'm having trouble storing data minus the header into a new file. I don't understand Python enough to debug.
Ultimately, I'd like to extract data from each file and store into one main csv file rather than opening each file individually, while copying and pasting everything into the main csv file I would like.
My code is as follows:
import csv, os
# os.makedirs() command will create a folder titled in green or in apostrophies
os.makedirs('HeaderRemoved', exist_ok=True)
# Loop through every file in the current working directory.
for csvFilename in os.listdir('directory'):
if not csvFilename.endswith('.csv'):
continue #skips non-csv files
print('Removing header from ' + csvFilename + '...')
### Read the CSV file in (skipping first Row)###
csvRows = []
csvFileObj = open(csvFilename)
readerObj = csv.reader(csvFileObj)
for row in readerObj:
if readerObj.line_num == 1:
continue # skips first row
csvRows.append(row)
print (csvRows) #----------->Check to see if it has anything stored in array
csvFileObj.close()
#Todo: Write out the CSV file
csvFileObj = open(os.path.join('HeaderRemoved', 'directory/mainfile.csv'), 'w',
newline='')
csvWriter = csv.writer(csvFileObj)
for row in csvRows:
csvWriter.writerow(row)
csvFileObj.close()
The csv files that are being "scanned" or "read" have text and numbers. I do not know if this might be preventing the script from properly "reading" and storing the data into the csvRow array.
The problem comes from you reusing the same variable when you loop over your file names. See the documentation for listdir, it returns a list of filenames. Then your newfile isn't really pointing to the file anymore, but
to a string filename from the directory.
https://docs.python.org/3/library/os.html#os.listdir
with open(scancsvFile, 'w') as newfile:
array = []
#for row in scancsvFile
for newfile in os.listdir('directory'): # <---- you're reassigning the variable newfile here
if newfile.line_num == 1:
continue
array.append(lines)
newfile.close()
I want to combine several text files into one output files.
my original code is to download 100 text files then each time I filter the text from several words and the write it to the output file.
Here is part of my code that suppose to combine the new text with the output text. The result each time overwrite the output file, delete the previous content and add the new text.
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
read_files = glob.glob(urls[N])
with open("result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
N+=1
and I tried this also
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
file_list = glob.glob(urls[N])
with open('result-1.txt', 'w') as file:
input_lines = fileinput.input(file_list)
file.writelines(input_lines)
N+=1
Is there any suggestions?
I need to concatenate/combine approximately 100 text files into one .txt file In sequence manner. (Each time I read one file and add it to the result.txt)
The problem is that you are re-opening the output file on each loop iteration which will cause it to overwrite -- unless you explicitly open it in append mode.
The glob logic is also unnecessary when you already know the filename.
Try this instead:
with open("result.txt", "wb") as outfile:
for url in urls:
with open(url, "rb") as infile:
outfile.write(infile.read())
I am trying to create bulk text files based on list. A text file has number of lines/titles and aim is to create text files. Following is how my titles.txt looks like along with non-working code and expected output.
titles = open("C:\\Dropbox\\Python\\titles.txt",'r')
for lines in titles.readlines():
d_path = 'C:\\titles'
output = open((d_path.lines.strip())+'.txt','a')
output.close()
titles.close()
titles.txt
Title-A
Title-B
Title-C
new blank files to be created under directory c:\\titles\\
Title-A.txt
Title-B.txt
Title-C.txt
It's a little difficult to tell what you're attempting here, but hopefully this will be helpful:
import os.path
with open('titles.txt') as f:
for line in f:
newfile = os.path.join('C:\\titles',line.strip()) + '.txt'
ff = open( newfile, 'a')
ff.close()
If you want to replace existing files with blank files, you can open your files with mode 'w' instead of 'a'.
The following should work.
import os
titles='C:/Dropbox/Python/titles.txt'
d_path='c:/titles'
with open(titles,'r') as f:
for l in f:
with open(os.path.join(d_path,l.strip()),'w') as _:
pass