Open many txt files and sort into two dfs

Open many txt files and sort into two dfs - python

I need to open and process hundreds of .txt files based on their names and the folder names that they are contained within, into two data frames.
The folder structure:
I have a single folder, containing a number of sub-folders, each named with the date that the data was recorded, in this format:YYY-MM-DD, example: 2019-0-14
The file structure:
In each of the above folders, there are 576 files. There are two sets of measurements (based on 2 locations), taken each 5 minutes over every 24 hour period (12*24*2 = 576). The files are named as below:
hhmmssILAC3octf.txt for the indoor location
hhmmssOLAC3octf.txt for the outdoor location
Where hhmmss is the hour, minute and second of each 5 minute file and IL is indoors and OL is outdoors.
File contents:
Every file contains 5 rows of data, one for every minute. This data is the same type of data and the same length of data, separated by comas.
What I am trying to achieve:
I need to create two data frames: one for each location, with the date (folder name) and time (file name and position [line 1:5]) as a datetime index, based on the folder it is contained within, the name of the file and line number in the .txt
I also need to rename all the columns/variables once imported with the same names, but prefixed with an indoor or outdoor, based on if its location. Example: indoor_20hz.
I use Python and Pandas myself, but have never tried to solve a problem like this. Please can someone point me in the right direction...
Thank you.

You could start with the following code:
import os
import fnmatch
start_dirctory='.' # change this
df_result= None
for path, dirs, files in os.walk(start_dirctory):
for file in fnmatch.filter(files, '*.txt'):
full_name=os.path.join(path, file)
df_tmp= pd.read_csv(full_name)
# add the line number
df_tmp['line_number']= range(df_tmp.shape[0])
# add the code here that generates the infos
# you additionally need here to the df
# then concatenate the files together
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
As a result, you should have the content of all files in df_result. But you need to make sure, that the files have the same column structure, otherwise you need to fix it above. You also need to add the additional infos you need in place of "# add the infos you need here to the df".

My final solution, albeit im sure this isnt the most elegant way to get the final result:
import os
import fnmatch
import pandas as pd
start_dirctory='DIR' # change this
df_result= None
for path, dirs, files in os.walk(start_dirctory):
for file in fnmatch.filter(files, '*.txt'):
full_name=os.path.join(path, file)
df_tmp= pd.read_csv(full_name, header=None)
df_tmp['date']=os.path.basename(path)
df_tmp['file']=os.path.basename(file)
# df_tmp.set_index([df_tmp['date'], df_tmp['time']], inplace=True)
# add the line number
df_tmp['line_number']= range(df_tmp.shape[0])
# add the code here that generates the infos
# you additionally need here to the df
# then concatenate the files together
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
# Slice filename from 6 to 7 to get location
df_result['location'] = df_result['file'].str.slice(6,7)
# Slice filename from 0 to 6 to get time
df_result['time'] = df_result['file'].str.slice(0,6)
# Combine date and time and format as datetime
df_result['date'] = pd.to_datetime(df_result['date'] + ' ' + df_result['time'], errors='raise', dayfirst=False)
# Round all the datetimes to the nearest 5 min
df_result['date'] = df_result['date'].dt.round('5min')
# Add line number as minutes to the date
df_result['date'] = df_result['date'] + pd.to_timedelta(df_result['line_number'],unit='m')
del df_result['file']
del df_result['line_number']
del df_result['time']
# Make the date the index in df
df_result = df_result.set_index(df_result['date'])
# Delete date in df
del df_result['date']
# Change columns and rename df_result
df_result.columns = ['10hz', '12.5hz', '16hz', '20hz','25hz','31.5hz','40hz','50hz','63hz','80hz','100hz','125hz','160hz','200hz','250hz','315hz','400hz','500hz','630hz','800hz','1000hz','1250hz','1600hz','2000hz','2500hz','3150hz','4000hz','5000hz','6300hz','8000hz','10000hz']

Related

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv

You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)

Manipulating the values of each file in a folder using a dictionary and loop

How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.

Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.

How to generate a dataframe from .txt files in multiple directories?

I have a directory ".../data" in which have multiple subdirectories whose names are a serial number plus some useless information - e.g. "17448_2017_Jul_2017_Oct", where the first number on it is the serial number. Inside each subdirectory, I have four ".txt" files whose lines/rows have the information of date and time, and an attribute of a certain type, say humidity, all named the same way in each subdirectory - e.g. "2019-01-29 03:11:26 54.7". The first eight lines on each .txt file top should be dropped as well.
What I am trying to program: A code that generates a data frame for each serial number with the subdirectory serial number in the subdirectory name in a column called 'Machine', date/time as the data frame index and each type of an attribute as a column such as atr1, atr2, atr3, and atr4.
My first trial was something as:
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados"
for i in os.listdir(path):
if os.path.isfile(os.path.join(path,i)) and '17884' in i:
with open(path + i, 'r') as f:
But, as you can see, I'm completely lost... :/
Thank you so much for your help!

IIUC, you could try sth like this (note this is intended to be a start for test and feedback, because I can't test this on my mobile at the moment):
import os
import pandas as pd
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados/"
df = pd.DataFrame()
for fld in os.listdir(path):
subfld = path + fld
if os.path.isdir(subfld):
aux = pd.DataFrame()
sn = fld.split('_')[0]
for file in os.listdir(subfld):
filepath = os.path.join(subfld, file)
if os.path.isfile(filepath):
new_col = pd.read_fwf(filepath, colspecs=[(0, 19), (20, -1)], skiprows=8, header=None, parse_dates=[0], index_col=0)
aux = pd.concat([aux, new_col], axis=1)
aux['Machine'] = sn
df = df.append(aux)
However, I wonder if your 4 measurement files per folder all have the same index time values, otherwise there will be a problem concatenating them.

Grab files with current year and last 5 years in the name and concatenate into 1 dataframe

I am trying to create a function that will concatenate files going back a certain number of full years and also include the current year file. I have all of the files named the same except for the year at the end (e.g. Data2010, Data2011...Data2018)
Right now I have it set up to pull all the files and concatenate them into one dataframe, but I'm not sure how to write the function that pulls only certain years based on the current year and a number I provide.
*Edit: is it possible to write the function so that this will always work without making edits to the file as the year changes? So the function would read the current year through datetime or something and know what the last 5 years are?
import pandas as pd
import datetime
import os
import glob
qms = os.path.join('X:', 'JY', 'Analyst', 'Data')
today = datetime.datetime.today()
#Pulling all files and concatenating, needs to pull only last 5 + current
warranty_files = glob.glob(os.path.join(qms, '*.csv'))
warranty_list = []
for file_ in warranty_files:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty = pd.concat(warranty_list)
# def get_warranty(years): #want this to be the start of function

If you need to make specific selections, glob also allows you to do this.
I made a folder with 3 text files labeled, Data2010, Data2011, Data2013, and I can pick all the files after 2010 like so:
files = glob.glob("/path/to/folder/"+"Data201[1-9].txt")
for file in files:
print(file)
In other words, you should be able to use regex to further customize file selection. Once you select the right number of files you can then concatenate them into a pd.DataFrame.
Grabbing the current and last five years in my example above would look like this, "Data201[3-8].txt". If there's some text before that portion of the filename, add an asterisk *: "*Data201[3-8].txt". Let me know if something isn't clear!
EDIT: OP asked for an automated way to select their files based on current year. Here's a method that does so. Test it out!
path = "C:\\Users\\David\\Desktop\\test\\"
def get_files(path,n=5):
files = [] #list to append to
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,n+1)] #list last 5 years
for year in last_n_years:
files_ = glob.glob(path + "*Data%s.csv" % year) #grab csv files per year
if files_: #if files_ is not []
for f in files_: #for file in files_
files.append(f) #append each file
return files
files = get_files(path,n=5)
print(files)

Python: Batch rename files in a directory using a predefined list, sorting by date created

I am downloading a number of PDF documents from an online repository, but they are not coming through with the proper naming conventions. The files align with a list of names that I have located in an Excel spreadsheet.
What I would like to do is import the Excel spreadsheet, assign the names to a variable, and then use os.rename() to rename the files I have downloaded as a batch in order to match my list.
When I download the .PDFs, each is given a random naming convention, rather than named by the URL. These are randomly generated each time the link is chosen. This is creating a problem because I cannot sort the documents in the proper order in order to name them in the proper order.
What I would like to do is sort the documents by "date created". By using sleep() I have the documents downloaded in the correct order, matching the instrument numbers, but I cannot figure out how to line them up properly to iterate through the names I would like to change.
Here is a sample of my code:
#Import packages
import pandas as pd
from selenium import webdriver
import os
#Designate file locations / destinations
file = '/Users/username/Desktop/test.xlsx'
directory = '/Users/username/Downloads'
#Obtain instrument names
xl = pd.ExcelFile(file)
df1 = xl.parse('Sheet1', parse_cols=[2], names=['instrument'])
names = df1.instrument
prefix = xyz
#Obtain file location
imported_files = os.listdir(directory)
imported_files.remove('.DS_Store')
df1['importedFiles'] = imported_files
print(df1)
instrument importedFiles
0 146169-1975 2461030_123.PDF
1 147235-1975 2461030_2027.PDF
2 148367-1975 2461030_348.PDF
3 149563-1975 2461030_5327.PDF
4 171413-1977 2461030_555.PDF
5 186305-1977 2461030_5969.PDF
6 186726-1977 2461030_7610.PDF
7 186727-1978 2461030_7878.PDF
8 187748-1978 2461030_8733.PDF
#Set working directory
os.chdir('/Users/username/Downloads')
#Set a loop to rename
for x, y in zip(names, os.listdir('/Users/username/Downloads')):
file_name, file_ext = os.path.splitext(y)
new_names = ('{}_{}{}'.format(prefix, x, file_ext))
print(new_names)
os.rename(y, new_names)
sleep(0.5)
When I print "new_names" the order of the names come out correctly in my console. However, when I take the next step to actually rename the files, the renaming doesn't work because of the randomly generated names coming from the imported files.
How can I make sure that the file names change in the same order that they are coming in? OR how can I change the order of the files so that when I name them, they match the instrument string's coming in?
Thank you!

I was able to find an answer to my own question! So in order to rename the files based on the instrument numbers I was pulling in from the Excel spreadsheet, I first had to reorganized the files I was downloading, which were generating the random numbers.
I followed this video https://www.youtube.com/watch?v=hZP3y-gxyJg and used os.path.getatime on my directory to find the creation time, and then used a renaming loop to name them. This organized the files the way that I wanted, and I was able to rename them in the order I wanted. Here is the code I used:
iterfiles = iter(os.listdir('/Users/username/Desktop'))
next(iterfiles)
for file_time in iterfiles:
time_stamp = os.path.getatime(file_time)
local_time = time.ctime(time_stamp)
ext = 'PDF'
print(local_time)
time_name = ('{}.{}'.format(local_time, ext))
os.rename(file_time, time_name)
sleep(0.5)
#--------RENAME FILES BASED ON NAME----------#
iterinstrument = iter(os.listdir('/Users/username/Desktop'))
next(iterinstrument)
for x, y in zip(instrument_numbers, iterinstrument):
file_name, file_ext = os.path.splitext(y)
number, year = x.split('-')
number = number.zfill(7)
new_names = ('{}-{}{}{}'.format(county, year, number, file_ext))
print(new_names)
os.rename(y, new_names)
sleep(0.5)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Open many txt files and sort into two dfs - python

Related

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

Manipulating the values of each file in a folder using a dictionary and loop

How to generate a dataframe from .txt files in multiple directories?

Grab files with current year and last 5 years in the name and concatenate into 1 dataframe

Python: Batch rename files in a directory using a predefined list, sorting by date created

Categories

Resources