Looping through files using lists - python

I have a folder with pseudo directory (/usr/folder/) of files that look like this:
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
target_07751_20181130.tsv.gz
target_07751_20181203.tsv.gz
target_07751_20181204.tsv.gz
target_27103_20181128.tsv.gz
target_27103_20181129.tsv.gz
target_27103_20181130.tsv.gz
I am trying to join the above tsv files to one xlsx file on store code (found in the file names above).
I am reading say file.xlsx and reading that in as a pandas dataframe.
I have extracted store codes from file.xlsx so I have the following:
stores = instore.store_code.astype(str).unique()
output:
07750
07751
27103
So my end goal is to loop through each store in stores and find which filename that corresponds to in directory. Here is what I have so far but I can't seem to get the proper filename to print:
import os
for store in stores:
print(store)
if store in os.listdir('/usr/folder/'):
print(os.listdir('/usr/folder/'))
The output I'm expecting to see for say store_code in loop = '07750' would be:
07750
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
Instead I'm only seeing the store codes returned:
07750
07751
27103
What am I doing wrong here?

The reason your if statement fails is that it checks if "07750" etc is one of the filenames in the directory, which it is not. What you want is to see if "07750" is contained in one of the filenames.
I'd go about it like this:
from collections import defaultdict
store_files = defaultdict(list)
for filename in os.listdir('/usr/folder/'):
store_number = <some string magic to extract the store number; you figure it out>
store_files[store_number].append(filename)
Now store_files will be a dictionary with a list of filenames for each store number.

The problem is that you're assuming a substring search -- that's not how in works on a list. For instance, on the first iteration, your if looks like this:
if "07750" in ["target_07750_20181128.tsv.gz",
"target_07750_20181129.tsv.gz",
"target_07751_20181130.tsv.gz",
... ]:
The string "07755" is not an element of that list. It does appear as a substring, but in doesn't work that way on a list. Instead, try this:
for filename in os.listdir('/usr/folder/'):
if '_' + store + '_' in filename:
print(filename)
Does that help?

Related

Calling a part of file.txt name in Python

I want to call files in a folder, for example the folder contains:
abc-123.txt
def-456.txt
ghi-789.txt
I want to get "abc", "def", "ghi" without renaming the actual .txt file.
My files are like:
1789-Whasington.txt
1793-Whasington.txt
1797-Adams.txt
1801-Jefferson.txt
1805-Jefferson.txt
A want to input the information in a table with pandas to look like this:
President Year
Washington 1798
Washington 1793
Adams 1797
Jefferson 1801
Jefferson 1805
A solution for this can be found here, https://stackoverflow.com/a/678242/14191679. Apply the os.path.splittext to the filepath of your file and instead of printing it, assign it to a variable, such as
path = os.path.splitext("/path/to/some/file.txt")[0]
and use python string manipulation to get the necessary parts of the string you require.
So if you want to get the first 3 letters,
print(path[0:3])
The code above will return the first 3 letters of the string variable, path.
Let's break down the problem a bit:
How do we get the names of files in a folder?
We're not going to care about the name, it's fine to get the full thing, just make sure you know how to list the files in a folder.
With a little research on how to list file names in a folder in Python, we get this:
#This line is crucial, as we will use some code from another library to helpus
import os
# This gives us a list of the names of all the files in the folder provided to the os.listdir() function
my_files = os.listdir("/path/to/folder")
# Now we can loop through and print each file name.
for file in my_files:
print(file)
How do we extract part of the name of the file?
This problem has nothing to with files really, but is rather about how to get a part of a string, so I'd recommend you read through the Python3 string documentation here. In particular, if the part you need from the name always comes before a -, then str.split() should help.
You can then apply the same logic to each file in the loop above.
Here is a simple one-liner list comprehension solution for the above task:
[file.split(".")[0].split("-")[0] for file in os.listdir("path_to_dir")]
There are three main components in the above solution which you can read more about from the links below:
os.listdir - lists files in a given directory
split - splits the string into one or more parts based on some character in the string
List comprehensions - quick way to construct lists on the fly rather than appending to the string in a loop
I hope the below implementation answers your question
import os
import pandas as pd
president = []
year = []
for files in os.path.listdir("path_to_dir"):
# Split filename and extension (.txt)
files = os.path.splitext(files)[0]
extension = os.path.splitext(files)[1]
# Now to extract the first 3 letter
file = file.split('-')
president.append(file[0])
year.append(file[1])
df = pd.DataFrame(list(zip(president, year)),
columns =['President', 'Year'])
df.head()

How can i read specific files in a folder (files within a range)in Python

For example, I have some 43000 txt files in my folder, however, I want to read not all the files but just some of them by giving in a range, like from 1.txt till 14400.txt`. How can I achieve this in Python? For now, I'm reading all the files in a directory like
for each in glob.glob("data/*.txt"):
with open(each , 'r') as file:
content = file.readlines()
with open('{}.csv'.format(each[0:-4]) , 'w') as file:
file.writelines(content)
Any way I can achieve the desired results?
Since glob.glob() returns an iterable, you can simply iterate over a certain section of the list using something like:
import glob
for each in glob.glob("*")[:5]:
print(each)
Just use variable list boundaries and I think this achieves the results you are looking for.
Edit: Also, be sure that you are not trying to iterate over a list slice that is out of bounds, so perhaps a check for that prior might be in order.
If the files have numerically consecutive names starting with 1.txt, you can use range() to help construct the filenames:
for num in range(1, 14400):
filename = "data/%d.txt" % num
I found a solution here: How to extract numbers from a string in Python?
import os
import re
filepath = './'
for filename in os.listdir():
numbers_in_name = re.findall('\d',filename)
if (numbers_in_name != [] and int(numbers_in_name[0]) < 5 ) :
print(os.path.join(filepath,filename))
#do other stuff with the filenames
You can use re to get the numbers in the filename. This prints all filenames where the first number is smaller than 5 for example.

Extract subset from several file names using python

I have a lot of files in a directory with name like:
'data_2000151_avg.txt', 'data_2000251_avg.txt', 'data_2003051_avg.txt'...
Assume that one of them is called fname. I would like to extract a subset from each like so:
fname.split('_')[1][:4]
This will give as a result, 2000. I want to collect these from all the files in the directory and create a unique list. How do I do that?
You should use os.
import os
dirname = 'PathToFile'
myuniquelist = []
for d in os.listdir(dirname):
if d.startswith('fname'):
myuniquelist.append(d.split('_')[1][:4])
EDIT: Just saw your comment on wanting a set. After the for loop add this line.
myuniquelist = list(set(myuniquelist))
If unique list means a list of unique values, then a combination of glob (in case the folder contains files that do not match the desired name format) and set should do the trick:
from glob import glob
uniques = {fname.split('_')[1][:4] for fname in glob('data_*_avg.txt')}
# In case you really do want a list
unique_list = list(uniques)
This assumes the files reside in the current working directory. Append path as necessary to glob('path/to/data_*_avg.txt').
For listing files in directory you can use os.listdir(). For generating the list of unique values best suitable is set comprehension.
import os
data = {f.split('_')[1][:4] for f in os.listdir(dir_path)}
list(data) #if you really need a list

Attempting to read data from multiple files to multiple arrays

I would like to be able to read data from multiple files in one folder to multiple arrays and then perform analysis on these arrays such as plot graphs etc. I am currently having trouble reading the data from these files into multiple arrays.
My solution process so far is as follows;
import numpy as np
import os
#Create an empty list to read filenames to
filenames = []
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
This works so far, what I'd like to do next is to iterate over the filenames in the list using numpy.genfromtxt.
I'm trying to use os.path join to put the individual list entry at the end of the path specified in listdir earlier. This is some example code:
for i in filenames:
file_name = os.path.join('C:\\entryfromabove','i')
'data_'+[i] = np.genfromtxt('file_name',skiprows=2,delimiter=',')
This piece of code returns "Invalid syntax".
To sum up the solution process I'm trying to use so far:
1. Use os.listdir to get all the filenames in the folder I'm looking at.
2. Use os.path.join to direct np.genfromtxt to open and read data from each file to a numpy array named after that file.
I'm not experienced with python by any means - any tips or questions on what I'm trying to achieve are welcome.
For this kind of task you'd want to use a dictionary.
data = {}
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
path = os.path.join('C:\\folderwherefileslive', i)
data[file] = np.genfromtxt(path, skiprows=2, delimiter=',')
# now you could for example access
data['foo.txt']
Notice, that everything you put within single or double quotes ends up being a character string, so 'file_name' will just be some characters, whereas using file_name would use the value stored in variable by that name.

Automatically find files that start with similar strings (and find these strings) using Python

I have a directory with a number of files in a format similar to this:
"ABC_01.dat", "ABC_02.dat", "ABC_03-08.dat", "DEF_13.dat", "DEF_14.dat", "DEF_16.dat", "GHI_09.dat", "GHI_12-14.dat"
etc., you get the idea. Essentially, what I want to do is merge all files whose names start with a similar string. At the moment, I do this by manually setting a variable names = ["ABC", "DEF", "GHI"], iterating over it (for name in names) and getting the respective filenames using glob glob.glob(name + "*.dat"). The merging step is later done using pandas. I don't just need the names/prefixes for finding the files; they are used later in my script to set the output files' names.
Is there a way I can automatically generate the variable names if I know that the files are all in the format name_*.dat?
Consider this :
names = set([name.rpartition('_')[0] for name in glob('*_*.dat')])
This will get all unique prefixes before '_'. You will also want to set a correct path in glob() before matching.
You can do this:
result = [filter(lambda x:x.startswith(sn), fileNames) for sn in set([i.split('_')[0] for i in glob.glob("*.*")])]
print result
output:
[['ABC_01.dat', 'ABC_02.dat', 'ABC_03-08.dat'], ['GHI_09.dat', 'GHI_12-14.dat'], ['DEF_13.dat', 'DEF_14.dat', 'DEF_16.dat']]
Now, all files from result[0] are to be merged; similarly for result[1],...

Categories

Resources