How can I use file names as an input? - python

I wrote a python script (with pandas library) to create txt files. I also use a txt file as an input. It works well but I want to make it more automated.
My code starts like;
girdi = input("Lütfen gir: ")
input2 = girdi+".txt"
veriCNR = pd.read_table(
input2, decimal=",",
usecols=[
"Chromosome",
"Name",
.
.
.
I am entering the name of the files one by one and getting outputs like this:
.
.
.
outputCNR = girdi+".cnr"
sonTabloCNR.to_csv(outputCNR, sep="\t", index=False)
outputCNS = girdi+".cns"
sonTabloCNS.to_csv(outputCNS, sep="\t", index=False)
outputCNG = girdi+".genemetrics.cns"
sonTabloCNG.to_csv(outputCNG, sep="\t", index=False)
As you see I am using input name also for outputs. They are tab seperated txt files with different file extensions.
I want to use all txt files in a folder as an input and run this script for every one of them.
I hope I explained it clearly.
ps. I am not a programmer. Please be explanatory with codes :)

You can put your code creating the files in a function which takes the filepath as input:
def generate_results(input_filepath):
# your code generating the files
then you can use this function in a loop to generate all the result files.
Here's an example using glob to get all the text files in directory
import os
import glob
directory = "path to my directory"
for filepath in glob.glob(os.path.join(directory, "*.txt")):
generate_results(filepath)

If you want to iterate over the files in folder use
os.listdir(path). It returns list of filenames on path (by default it's current directory).
So for example:
import os
supported_extensions = ['.cns', '.cnr']
for filename in os.listdir():
# Make sure file has desired extension
if os.path.splitext(filename)[1] in supported_extensions:
function(filename)

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.
The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

Access data from particular directory with less line of codes

Assume, I have a csv file data.csv located in the following directory 'C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets'. Using this code, I can access my csv file:
## read the csv file from a particular folder
import pandas as pd
import glob
files = glob.glob(r"C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets*.csv")
df = pd.DataFrame()
for f in files:
csv = pd.read_csv(f)
df = df.append(csv)
But as you can see the csv file path is long. So, is there is any way to do the same operation where I can reduce the path location of my data as well as codes line.
use the "dot" notation for a relative path (it does not depend on the programming language)
# example for a "shorter" version of the path
import os
my_current_position = '.' # where you launch the program
files = '' # from above
print(os.path.relpath(files, my_current_position)
Remark relpath is order sensitive
You can use a context manager to open the file, not shorter but more elegant
with open(file, 'r') as fd:
data_table = pd.read_csv(fd)
If you put your script in the same directory as the datasets, you can simply do:
import glob
files = glob.glob("datasets*.csv")

Edit multiple text files, and save as new files

My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')

Write multiple text files to the directory in Python

I was working on saving text to different files. so, now I already created several files and each text file has some texts/paragraph in it. Now, I just want to save these files to a directory. I already created a self-defined directory, but now it is empty. I want to save these text files into my directory.
The partial code is below:
for doc in root:
docID = doc.find('DOCID').text.strip()
text = doc.find('TEXT').text,strip()
f = open("%s" %docID, 'w')
f.write(str(text))
Now, I created all the files with text in it. and I also have a blank folder/directory now. I just don't know how to put these files into the directory.
I would be appreciate it.
========================================================================
[Solved] Thank you guys for your all helping! I figured it out. I just edit my summary here. I got a few problems.
1. my docID was saved as tuple. I need to convert to string without any extra symbol. here is the reference i used: https://stackoverflow.com/a/17426417/9387211
2. I just created a new path and write the text to it. i used this method: https://stackoverflow.com/a/8024254/9387211
Now, I can share my updated code and there is no more problem here. Thanks everyone again!
for doc in root:
docID = doc.find('DOCID').text.strip()
did = ''.join(map(str,docID))
text = doc.find('TEXT').text,strip()
txt = ''.join(map(str,docID))
filename = os.path.join(dst_folder_path, did)
f = open(filename, 'w')
f.write(str(text))
Suppose you have all the text files in home directory (~/) and you want to move them to /path/to/dir folder.
from shutil import copyfile
import os
docid_list = ['docid-1', 'docid-2']
for did in docid_list:
copyfile(did, /path/to/folder)
os.remove(did)
It will copy the docid files in /path/to/folder path and remove the files from the home directory (assuming you run this operation from home dir)
You can frame the file path for open like
doc_file = open(<file path>, 'w')

How can I run a python script on many files to get many output files?

I am new at programming and I have written a script to extract text from a vcf file. I am using a Linux virtual machine and running Ubuntu. I have run this script through the command line by changing my directory to the file with the vcf file in and then entering python script.py.
My script knows which file to process because the beginning of my script is:
my_file = open("inputfile1.vcf", "r+")
outputfile = open("outputfile.txt", "w")
The script puts the information I need into a list and then I write it to outputfile. However, I have many input files (all .vcf) and want to write them to different output files with a similar name to the input (such as input_processed.txt).
Do I need to run a shell script to iterate over the files in the folder? If so how would I change the python script to accommodate this? I.e writing the list to an outputfile?
I would integrate it within the Python script, which will allow you to easily run it on other platforms too and doesn't add much code anyway.
import glob
import os
# Find all files ending in 'vcf'
for vcf_filename in glob.glob('*.vcf'):
vcf_file = open(vcf_filename, 'r+')
# Similar name with a different extension
output_filename = os.path.splitext(vcf_filename)[0] + '.txt'
outputfile = open(output_filename, 'w')
# Process the data
...
To output the resulting files in a separate directory I would:
import glob
import os
output_dir = 'processed'
os.makedirs(output_dir, exist_ok=True)
# Find all files ending in 'vcf'
for vcf_filename in glob.glob('*.vcf'):
vcf_file = open(vcf_filename, 'r+')
# Similar name with a different extension
output_filename = os.path.splitext(vcf_filename)[0] + '.txt'
outputfile = open(os.path.join(output_dir, output_filename), 'w')
# Process the data
...
You don't need write shell script,
maybe this question will help you?
How to list all files of a directory?
It depends on how you implement the iteration logic.
If you want to implement it in python, just do it;
If you want to implement it in a shell script, just change your python script to accept parameters, and then use shell script to call the python script with your suitable parameters.
I have a script I frequently use which includes using PyQt5 to pop up a window that prompts the user to select a file... then it walks the directory to find all of the files in the directory:
pathname = first_fname[:(first_fname.rfind('/') + 1)] #figures out the pathname by finding the last '/'
new_pathname = pathname + 'for release/' #makes a new pathname to be added to the names of new files so that they're put in another directory...but their names will be altered
file_list = [f for f in os.listdir(pathname) if f.lower().endswith('.xls') and not 'map' in f.lower() and not 'check' in f.lower()] #makes a list of the files in the directory that end in .xls and don't have key words in the names that would indicate they're not the kind of file I want
You need to import os to use the os.listdir command.
You can use listdir(you need to write condition to filter the particular extension) or glob. I generally prefer glob. For example
import os
import glob
for file in glob.glob('*.py'):
data = open(file, 'r+')
output_name = os.path.splitext(file)[0]
output = open(output_name+'.txt', 'w')
output.write(data.read())
This code will read the content from input and store it in outputfile.

Categories

Resources