Calling a part of file.txt name in Python - python

I want to call files in a folder, for example the folder contains:
abc-123.txt
def-456.txt
ghi-789.txt
I want to get "abc", "def", "ghi" without renaming the actual .txt file.
My files are like:
1789-Whasington.txt
1793-Whasington.txt
1797-Adams.txt
1801-Jefferson.txt
1805-Jefferson.txt
A want to input the information in a table with pandas to look like this:
President Year
Washington 1798
Washington 1793
Adams 1797
Jefferson 1801
Jefferson 1805

A solution for this can be found here, https://stackoverflow.com/a/678242/14191679. Apply the os.path.splittext to the filepath of your file and instead of printing it, assign it to a variable, such as
path = os.path.splitext("/path/to/some/file.txt")[0]
and use python string manipulation to get the necessary parts of the string you require.
So if you want to get the first 3 letters,
print(path[0:3])
The code above will return the first 3 letters of the string variable, path.

Let's break down the problem a bit:
How do we get the names of files in a folder?
We're not going to care about the name, it's fine to get the full thing, just make sure you know how to list the files in a folder.
With a little research on how to list file names in a folder in Python, we get this:
#This line is crucial, as we will use some code from another library to helpus
import os
# This gives us a list of the names of all the files in the folder provided to the os.listdir() function
my_files = os.listdir("/path/to/folder")
# Now we can loop through and print each file name.
for file in my_files:
print(file)
How do we extract part of the name of the file?
This problem has nothing to with files really, but is rather about how to get a part of a string, so I'd recommend you read through the Python3 string documentation here. In particular, if the part you need from the name always comes before a -, then str.split() should help.
You can then apply the same logic to each file in the loop above.

Here is a simple one-liner list comprehension solution for the above task:
[file.split(".")[0].split("-")[0] for file in os.listdir("path_to_dir")]
There are three main components in the above solution which you can read more about from the links below:
os.listdir - lists files in a given directory
split - splits the string into one or more parts based on some character in the string
List comprehensions - quick way to construct lists on the fly rather than appending to the string in a loop

I hope the below implementation answers your question
import os
import pandas as pd
president = []
year = []
for files in os.path.listdir("path_to_dir"):
# Split filename and extension (.txt)
files = os.path.splitext(files)[0]
extension = os.path.splitext(files)[1]
# Now to extract the first 3 letter
file = file.split('-')
president.append(file[0])
year.append(file[1])
df = pd.DataFrame(list(zip(president, year)),
columns =['President', 'Year'])
df.head()

Related

Modifying the order of reading CSV Files in Python according to the name

I have 1000 CSV files with names Radius_x where x stands for 0,1,2...11,12,13...999. But when I read the files and try to analyse the results, I wish to read them in the same order of whole numbers as listed above. But the code reads as follows (for example): ....145,146,147,148,149,15,150,150...159,16,160,161,...... and so on.
I know that if we rename the CSV files as Radius_xyz where xyz = 000,001,002,003,....010,011,012.....999, the problem could be resolved. Kindly help me as to how I can proceed.
To sort a list of paths numerically in python, first find all the files you are looking to open, then sort that iterable with a key which extracts the number.
With pathlib:
from pathlib import Path
files = list(Path("/tmp/so/").glob("Radius_*.csv")) # Path.glob returns a generator which needs to be put in a list
files.sort(key=lambda p: int(p.stem[7:])) # `Radius_` length is 7
files contains
[PosixPath('/tmp/so/Radius_1.csv'),
PosixPath('/tmp/so/Radius_2.csv'),
PosixPath('/tmp/so/Radius_3.csv'),
PosixPath('/tmp/so/Radius_4.csv'),
PosixPath('/tmp/so/Radius_5.csv'),
PosixPath('/tmp/so/Radius_6.csv'),
PosixPath('/tmp/so/Radius_7.csv'),
PosixPath('/tmp/so/Radius_8.csv'),
PosixPath('/tmp/so/Radius_9.csv'),
PosixPath('/tmp/so/Radius_10.csv'),
PosixPath('/tmp/so/Radius_11.csv'),
PosixPath('/tmp/so/Radius_12.csv'),
PosixPath('/tmp/so/Radius_13.csv'),
PosixPath('/tmp/so/Radius_14.csv'),
PosixPath('/tmp/so/Radius_15.csv'),
PosixPath('/tmp/so/Radius_16.csv'),
PosixPath('/tmp/so/Radius_17.csv'),
PosixPath('/tmp/so/Radius_18.csv'),
PosixPath('/tmp/so/Radius_19.csv'),
PosixPath('/tmp/so/Radius_20.csv')]
NB. files is a list of paths not strings, but most functions which deal with files accept both types.
A similar approach could be done with glob, which would give a list of strings not paths.

Rename directory with constantly changing name

I created a script that is supposed to download some data, then run a few processes. The data source (being ArcGIS Online) always downloads the data as a zip file and when extracted the folder name will be a series of letters and numbers. I noticed that these occasionally change (not entirely sure why). My thought is to run an os.listdir to get the folder name then rename it. Where I run into issues is that the list returns the folder name with brackets and quotes. It returns as ['f29a52b8908242f5b1f32c58b74c063b.gdb'] as the folder name while folder in the file explorer does not have the brackets and quotes. Below is my code and the error I receive.
from zipfile import ZipFile
file_name = "THDNuclearFacilitiesBaseSandboxData.zip"
with ZipFile(file_name) as zip:
# unzipping all the files
print("Unzipping "+ file_name)
zip.extractall("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
print('Unzip Complete')
#removes old zip file
os.remove(file_name)
x = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(str(x), "Test.gdb")
Output:
FileNotFoundError: [WinError 2] The system cannot find the file specified: "['f29a52b8908242f5b1f32c58b74c063b.gdb']" -> 'Test.gdb'
I'm relatively new to python scripting, so if there is an easier alternative, that would be great as well. Thanks!
os.listdir() returns a list files/objects that are in a folder.
lists are represented, when printed to the screen, using a set of brackets.
The name of each file is a string of characters and strings are represented, when printed to the screen, using quotes.
So we are seeing a list with a single filename:
['f29a52b8908242f5b1f32c58b74c063b.gdb']
To access an item within a list using Python, you can using index notation (which happens to also use brackets to tell Python which item in the list to use by referencing the index or number of the item.
Python list indexes starting at zero, so to get the first (and in this case only item in the list), you can use x[0].
x = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(x[0], "Test.gdb")
Having said that, I would generally not use x as a variable name in this case... I might write the code a bit differently:
files = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(files[0], "Test.gdb")
Square brackets indicate a list. Try x[0] that should get rid of the brackets and be just the data.
The return from listdir may be a list with only one value or a whole bunch

How to import "dat" file including only certain strings in python

I am trying to import several "dat" files to python spyder.
dat_file_list_images
Here are dat files listed, and there are two types on the list ended with _1 and _2. I wanna import dat files ended with "_1" only.
Is there any way to import them at once with one single loop?
After I import them, I would like to aggregate all to one single matrix.
import os
files_to_import = [f for f in os.listdir(folder_path)
if f.endswith("1")]
Make sure that you know whether the files have a .dat-extension or not - in Windows Explorer, the default setting is to hide file endings, and this will make your code fail if the files have a different ending.
What this code does is called list comprehension - os.listdir() provides all the files in the folder, and you create a list with only the ones that end with "1".
Uses str.endswith() it will return true if the entered string is ended with checking string
According to this website
Syntax: str.endswith(suffix[, start[, end]])
For your case:
You will need a loop to get filenames as String and check it while looping if it ends with "_1"
yourFilename = "yourfilename_1"
if yourFilename.endswith("_1"):
# do your job here

Looping through files using lists

I have a folder with pseudo directory (/usr/folder/) of files that look like this:
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
target_07751_20181130.tsv.gz
target_07751_20181203.tsv.gz
target_07751_20181204.tsv.gz
target_27103_20181128.tsv.gz
target_27103_20181129.tsv.gz
target_27103_20181130.tsv.gz
I am trying to join the above tsv files to one xlsx file on store code (found in the file names above).
I am reading say file.xlsx and reading that in as a pandas dataframe.
I have extracted store codes from file.xlsx so I have the following:
stores = instore.store_code.astype(str).unique()
output:
07750
07751
27103
So my end goal is to loop through each store in stores and find which filename that corresponds to in directory. Here is what I have so far but I can't seem to get the proper filename to print:
import os
for store in stores:
print(store)
if store in os.listdir('/usr/folder/'):
print(os.listdir('/usr/folder/'))
The output I'm expecting to see for say store_code in loop = '07750' would be:
07750
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
Instead I'm only seeing the store codes returned:
07750
07751
27103
What am I doing wrong here?
The reason your if statement fails is that it checks if "07750" etc is one of the filenames in the directory, which it is not. What you want is to see if "07750" is contained in one of the filenames.
I'd go about it like this:
from collections import defaultdict
store_files = defaultdict(list)
for filename in os.listdir('/usr/folder/'):
store_number = <some string magic to extract the store number; you figure it out>
store_files[store_number].append(filename)
Now store_files will be a dictionary with a list of filenames for each store number.
The problem is that you're assuming a substring search -- that's not how in works on a list. For instance, on the first iteration, your if looks like this:
if "07750" in ["target_07750_20181128.tsv.gz",
"target_07750_20181129.tsv.gz",
"target_07751_20181130.tsv.gz",
... ]:
The string "07755" is not an element of that list. It does appear as a substring, but in doesn't work that way on a list. Instead, try this:
for filename in os.listdir('/usr/folder/'):
if '_' + store + '_' in filename:
print(filename)
Does that help?

Automatically find files that start with similar strings (and find these strings) using Python

I have a directory with a number of files in a format similar to this:
"ABC_01.dat", "ABC_02.dat", "ABC_03-08.dat", "DEF_13.dat", "DEF_14.dat", "DEF_16.dat", "GHI_09.dat", "GHI_12-14.dat"
etc., you get the idea. Essentially, what I want to do is merge all files whose names start with a similar string. At the moment, I do this by manually setting a variable names = ["ABC", "DEF", "GHI"], iterating over it (for name in names) and getting the respective filenames using glob glob.glob(name + "*.dat"). The merging step is later done using pandas. I don't just need the names/prefixes for finding the files; they are used later in my script to set the output files' names.
Is there a way I can automatically generate the variable names if I know that the files are all in the format name_*.dat?
Consider this :
names = set([name.rpartition('_')[0] for name in glob('*_*.dat')])
This will get all unique prefixes before '_'. You will also want to set a correct path in glob() before matching.
You can do this:
result = [filter(lambda x:x.startswith(sn), fileNames) for sn in set([i.split('_')[0] for i in glob.glob("*.*")])]
print result
output:
[['ABC_01.dat', 'ABC_02.dat', 'ABC_03-08.dat'], ['GHI_09.dat', 'GHI_12-14.dat'], ['DEF_13.dat', 'DEF_14.dat', 'DEF_16.dat']]
Now, all files from result[0] are to be merged; similarly for result[1],...

Categories

Resources