reading all the text files from directory - python

I am new to python and I am using following code to pull output as sentiment analysis:
import json
from watson_developer_cloud import ToneAnalyzerV3Beta
import urllib.request
import codecs
import csv
import os
import re
import sys
import collections
import glob
ipath = 'C:/TEMP/' # input folder
opath = 'C:/TEMP/matrix/' # output folder
reader = codecs.getreader("utf-8")
tone_analyzer = ToneAnalyzerV3Beta(
url='https://gateway.watsonplatform.net/tone-analyzer/api',
username='ABCID',
password='ABCPASS',
version='2016-02-11')
path = 'C:/TEMP/*.txt'
file = glob.glob(path)
text = file.read()
data=tone_analyzer.tone(text='text')
for cat in data['document_tone']['tone_categories']:
print('Category:', cat['category_name'])
for tone in cat['tones']:
print('-', tone['tone_name'],tone['score'])
#create file
In the above code all I am trying to do is to read the file and do sentiment analysis all the text file stored in C:/TEMP folder but I keep getting and error :'list' object has no attribute 'read'
Not sure where I am going wrong and I would really appreciate any help with this one. Also, is there a way i can write the output to a CSV file so if I am reading the file
ABC.txt and I create a output CSV file called ABC.csv with output values.
Thank You

glob returns a list of files, you need to iterate over the list, open each file and then call .read on the file object:
files = glob.glob(path)
# iterate over the list getting each file
for fle in files:
# open the file and then call .read() to get the text
with open(fle) as f:
text = f.read()
Not sure what it is exactly you want to write but the csv lib will do it:
from csv import writer
files = glob.glob(path)
# iterate over the list getting each file
for fle in files:
# open the file and then call .read() to get the text
with open(fle) as f, open("{}.csv".format(fle.rsplit(".", 1)[1]),"w") as out:
text = f.read()
wr = writer(out)
data = tone_analyzer.tone(text='text')
wr.writerow(["some", "column" ,"names"]) # write the col names
Then call writerow passing a list of whatever you want to write for each row.

Related

Getting "Xref table not zero-indexed. ID numbers for objects will be corrected" warning

I have the following code (comments explain what is occuring):
import os
from io import StringIO
from PyPDF2 import PdfFileReader
# Path to the directory containing the PDF files
pdf_dir = '/path/to/pdf/files'
# Iterate over the files in the directory
for filename in os.listdir(pdf_dir):
# Check if the file is a PDF file
if filename.endswith('.pdf'):
# Construct the full path to the file
filepath = os.path.join(pdf_dir, filename)
# Open the PDF file and read its contents
with open(filepath, 'rb') as f:
pdf = PdfFileReader(f)
# Extract the text from the PDF file
text = ''
for page in pdf.pages:
text += page.extractText()
# Construct the name of the output text file
txt_filename = filename[:-4] + '.txt'
# Write the text to the output file
with open(txt_filename, 'w') as f:
f.write(text)
When I run the code, it produces a Xref table not zero-indexed. ID numbers for objects will be corrected warning. It is not a hard error, but it makes me wonder if there's a different way I should be doing this.
Thanks for any suggestions.

Merging multiple .txt files in .csv file line by line

I have a folder with lots of .txt files. I want to merge all .txt file in a single .csv file line by line/row by row.
I have tried the following python codes, they work fine but I have to change .txt file name to add the content into .csv row.
import re
import csv
from bs4 import BeautifulSoup
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/fsdl.txt')
cleantext = BeautifulSoup(raw_html, "lxml").text
#print(cleantext)
print (re.sub('\s+',' ', cleantext))
#appending to csv as row
row = [re.sub('\s+',' ', cleantext)]
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
csvFile.close()
I expect to see better and faster solutions for automatizing the process without changing file names. Any recommendation is welcome.
Accessing a list of filenames
The following should get you closer to what you want.
import os will give you access to the os.listdir() function that lists all the files in a directory. You may need to provide the path to your data folder, if the data files are not in the same folder as your script.
This should look something like:
os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
Using all the filenames in that directory, you can then open each one individually, by parsing through them with a for loop.
import re
import csv
from bs4 import BeautifulSoup
import os
filenames = os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
for file in filenames:
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/' + file)
cleantext = BeautifulSoup(raw_html, "lxml").text
output = re.sub('\s+',' ', cleantext) # saved the result using a variable
print(output) # the variable can be reused
row = [output] # as needed, in different contexts
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
Several other nuances: I removed the csvfile.close() function call at the end. When using with context managers, the context manager automatically closes the file for you when you leave the scope of the context manager code block (i.e. that indented section below the with statement). Having said this, there might be merit to simply opening the csv file, leaving it open, and then opening the txt files one by one and writing their content to the open csv and waiting to close the csv til the very end.

Loop through multiple json input files

I have over 200 scraped files in json format and I want to analyse them. I can open them individually, but would like to loop through to save time as I will be doing this a lot.
Can open each file but want to be able to do a loop in some format
e.g.
with codecs.open('c:\\project\\input*.json','r','utf-8') as f:
where '*' is a number.....
import codecs, json, csv, re
#read a json file downloaded with twitterscraper
with codecs.open('c:\\project\\input1.json','r','utf-8') as f:
tweets = json.load(f,encoding='utf-b')
Just put your files into a folder and then loop through the files in the folder like so.
import codecs
import json
import csv
import re
import os
files = []
for file in os.listdir("/mydir"):
if file.endswith(".json"):
files.append(os.path.join("/mydir", file))
for file in files:
with codecs.open(file,'r','utf-8') as f:
tweets = json.load(f,encoding='utf-b')
Add, and use, glob to iterate over files with certain file pattern.
import glob
import codecs
import json
# ... more packages here
for file in glob.glob('c:\\project\\input*.json'):
with codecs.open(file, 'r','utf-8') as f:
tweets = json.load(f, encoding='utf-b')
#... whatever you do next with `tweets`
BTW: utf-b instead of utf-8?

Writting to a docx file from a txt file in python

I've been trying to make my python code to fill a form in word with data that i scraped off the Internet. I wrote the data in a txt file and are now trying to fill the word file with this code:
import zipfile
import os
import tempfile
import shutil
import codecs
def getXml(docxFilename,ReplaceText):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= zip.read("word/document.xml")
for key in ReplaceText.keys():
xmlString = xmlString.replace(str(key), str(ReplaceText.get(key)))
return xmlString
def createNewDocx(originalDocx,xmlString,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
#3tmpDir=tmpDir.decode("utf-8")
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
f=open('test.txt', 'r',)
text=f.read().split("\n")
print text[1]
Pavarde = text[1]
Replace = {"PAVARDE1":Pavarde}
createNewDocx("test.docx",getXml("test.docx",Replace),"test2.docx")
The file is created but I cant open it.
I get the following error:
Illegal xlm character
My guess would be that theres something with the encoding but I cant find a solution.

How to create a csv file in Python, and export (put) it to some local directory

This problem may be tricky.
I want to create a csv file from a list in Python. This csv file does not exist before. And then export it to some local directory. There is no such file in the local directory either. We just create a new csv file, and export (put) the csv file in some local directory.
I found that StringIO.StringIO can generate the csv file from a list in Python, then what are the next steps.
Thank you.
And I found the following code can do it:
import os
import os.path
import StringIO
import csv
dir = r"C:\Python27"
if not os.path.exists(dir):
os.mkdir(dir)
my_list=[[1,2,3],[4,5,6]]
with open(os.path.join(dir, "filename"+'.csv'), "w") as f:
csvfile=StringIO.StringIO()
csvwriter=csv.writer(csvfile)
for l in my_list:
csvwriter.writerow(l)
for a in csvfile.getvalue():
f.writelines(a)
Did you read the docs?
https://docs.python.org/2/library/csv.html
Lots of examples on that page of how to read / write CSV files.
One of them:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
import csv
with open('/path/to/location', 'wb') as f:
writer = csv.writer(f)
writer.writerows(youriterable)
https://docs.python.org/2/library/csv.html#examples

Categories

Resources