This is likely a fundamental Python question, but I'm stumped (still learning). My script uses Pandas to create txt files from csv cells, and works properly. However, I'd like to write the files to a specific directory, listed as save_path below. However, my efforts to put this together keep running into errors.
Here's my (not) working code:
import os
import pandas as pd
save_path = "C:\users\name\folder\txts"
df= pd.read_csv("C:\users\name\folder\test.csv", sep=",")
df2 = df.fillna('')
for index in range(len(df)):
with open(df2["text_number"][index] + '.txt', 'w') as output:
output2 = os.path.join(save_path, output) # I'm uncertain how to structure or place the os.path.join command.
output2.write(df2["text"][index])
The resulting error is below:
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'TextIOWrapper'
Thoughts? Any assistance is greatly appreciated.
You need to first generate the file name and then open it in write mode to put the contents.
for index in range(len(df)):
# create file name
filename = df2["text_number"][index] + '.txt'
# then generate full path using os lib
full_path = os.path.join(save_path, filename)
# now open that file, dont forget to use w+ to create the file if it doesn't exist
with open(full_path, 'w+') as output_file_handler:
# and write the contents
output_file_handler.write(df2["text"][index])
This should work.
(But you might want to check out this answer)
for index in range(len(df)):
filename = df2["text_number"][index] + '.txt'
fp = os.path.join(save_path, filename)
with open(fp, 'w') as output:
output.write(df2["text"][index])
Related
I have a script which does the following: If there is a file that ends with 'Kobo.xlsx' in the same directory as the script, it reads the file, makes some changes to it, defines a 'new_filename' from the name of the file, and spits out a new .xlsx with the new filename. Here is the code:
## Get path of Script ##
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
os.chdir(dname)
files = os.listdir(dname)
for file in files:
if file.endswith('Kobo.xlsx'):
## Define New Filename ##
filename = os.path.basename(file)
size = len(filename)
new_filename = (filename[:size - 9]) + "Sorted.xlsx"
## Pandas Things ... ##
#### Exporting ####
writer = pd.ExcelWriter(new_filename, engine='xlsxwriter')
writer.save()
Sometimes when I run the script, I get the following error:
writer = pd.ExcelWriter(new_filename, engine='xlsxwriter')
NameError: name 'new_filename' is not defined
Sometimes when I run the script, I don't. It seems the error only comes up for certain filenames (different directories don't matter). However, If I go through the command prompt to get the new filename, it works in both cases (for the filename that the script worked with, and that the script didn't work with).
How do I avoid the error indefinitely? Thanks in advance - sorry if it's a silly question.
You don't want to do the write unless you already did the read, so those last two lines need to be indented to be part of the if statement. If the very first file you read does not end in Kobo.xlsx, then you'll try to do the write, but the variable new_filename was never created.
My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')
I am making what i suspect to be a very silly error here but vast majority of what i've found online talks about reading multiple files into a single dataframe or outputting results into a single file which is not my goal here.
Aim: read hundreds of CSV files, one by one, filter each one and output the result in a file using the original file's name in the output/result file (e.g. "Processed_<original_file>.csv*")*, then move on to the next file in the loop, read & filter that, put the results for that in a new output/result file. and so on.
Problem: I either run into a problem where only a single result file is produced (from the last read file in the loop) or if i use the code below , having read various SO pages , i get an invalid argument error.
Error: OSError: [Error 22] invalid argument: 'c:/users/my Directory/sourceFiles\Processed_c:/users/my Directory/sourceFiles\files1.csv'
I know i'm getting my loop & re-naming wrong at the moment but can't figure out how to do this without loading ALL my csvs into a single dataframe using list & concat and outputting everything into a single result file (which is not my aim) --- i want to output each result file into individual files , which share the name of the original file.
ideally given the size & number of files (700+ each 400mb) involve i rather use Pandas as that seems to be more efficient from what ive learnt so far.
import pandas as pd
import glob
import os
path = "c:/users/my Directory/"
csvFiles = glob.glob( path + "/sourceFiles/files*")
for files in csvFiles:
df = pd.read_csv(files, index_col=None, encoding='Latin-1', engine = 'python',
error_bad_lines = False)
df_f = df[df.iloc[:, 2] == "Office"]
filepath = os.path.join(path,'Processed_'+str(files)+'.csv')
df_f.to_csv(filepath)
The error message is nice because it shows you exactly what is wrong--your filename for the output save is wrong because the c:/users/... is repeated twice and concatenated together.
Try something with os.path.basename() to strip file extension and path:
fileout = path + '\\' + 'Processed_' + os.path.splitext(os.path.basename(files))[0] + '.csv'
And most importantly, test it with a couple print statements to see if your ins and outs are what you expect. Just comment-out the analysis lines.
import pandas as pd
import glob
import os
path = "c:/users/my Directory/"
csvFiles = glob.glob( path + "/sourceFiles/files*")
for files in csvFiles:
print(files)
#df = pd.read_csv(files, index_col=None, encoding='Latin-1', engine = 'python',
error_bad_lines = False)
#df_f = df[df.iloc[:, 2] == "Office"]
filepath = path + '\\' + 'Processed_' + os.path.splitext(os.path.basename(files))[0] + '.csv'
print(filepath)
#df_f.to_csv(filepath)
I am trying to move multiple images from one folder to another, using shutil.move() , I have saved image names in a CSV file.
ex: [img1 , img25, img55....]
I Have tried the below code
import pandas as pd
import shutil
cop_folder = path to folder containing images
destination_folder = path wher i want to move the images
df = pd.read_csv('', header = None)
for i in df:
if i in cop_folder:
shutil.move( i, dest_folder)
else:
print('fail')
TypeError: 'in ' requires string as left operand, not int
Try this approach:
import pandas as pd
import os
def move_file(old_file_path, new_directory):
if not os.path.isdir(new_directory):
os.mkdir(new_directory)
base_name = os.path.basename(old_file_path)
new_file_path = os.path.join(new_directory, base_name)
# Deletes a file if that file already exists there, you can change this behavior
if os.path.exists(new_file_path):
os.remove(new_file_path)
os.rename(old_file_path, new_file_path)
cop_folder = 'origin-folder\\'
destination_folder = 'dest_folder\\'
df = pd.read_csv('files.csv', header=None)
for i in df[0]:
filename = os.path.join(cop_folder, i)
move_file(filename, destination_folder)
The file names inside the csv must have an extension. If they don't, then you should use filename = os.path.join(cop_folder, i + '.jpg')
There are a few issues here, firstly you are iterating over a dataframe which will return the column labels not the values - that's what's causing the error the you posted. If you really want to use pandas just to import a CSV then you could change it to for i in df.iterrows() but even then it won't simply return the file name, it will return a series object. You'd probably be better off using the standard CSV module to read the CSV. That way your filenames will be read in as a list and will behave as you intended.
Secondly unless there is something else going on in your code you can't look for files in a folder using the 'in' keyword, you'll need to construct a full filepath by concatenating the filename and the folder path.
I'm trying to come up with a way for the filenames that I'm reading to have the same filename as what I'm writing. The code is currently reading the images and doing some processing. My output will be extracting the data from that process into a csv file. I want both the filenames to be the same. I've come across fname for matching, but that's for existing files.
So if your input file name is in_file = myfile.jpg do this:
my_outfile = "".join(infile.split('.')[:-1]) + 'csv'
This splits infile into a list of parts that are separated by '.'. It then puts them back together minus the last part, and adds csv
your my_outfile will be myfile.csv
Well in python it's possible to do that but, the original file might be corrupted if we were to have the same exact file name i.e BibleKJV.pdf to path BibleKJV.pdf will corrupt the first file. Take a look at this script to verify that I'm on the right track (if I'm totally of disregard my answer):
import os
from PyPDF2 import PdfFileReader , PdfFileWriter
path = "C:/Users/Catrell Washington/Pride"
input_file_name = os.path.join(path, "BibleKJV.pdf")
input_file = PdfFileReader(open(input_file_name , "rb"))
output_PDF = PdfFileWriter()
total_pages = input_file.getNumPages()
for page_num in range(1,total_pages) :
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "BibleKJV.pdf")
output_file = open(output_file_name , "wb")
output_PDF.write(output_file)
output_file.close()
When I ran the above script, I lost all data from the original path "BibleKJV.pdf" thus proving that if the file name and the file delegation i.e .pdf .cs .word etc, are the same then the data, unless changed very minimal, will be corrupted.
If this doesn't give you any help please, edit your question with a script of what you're trying to achieve.