How to generate a dataframe from .txt files in multiple directories? - python

I have a directory ".../data" in which have multiple subdirectories whose names are a serial number plus some useless information - e.g. "17448_2017_Jul_2017_Oct", where the first number on it is the serial number. Inside each subdirectory, I have four ".txt" files whose lines/rows have the information of date and time, and an attribute of a certain type, say humidity, all named the same way in each subdirectory - e.g. "2019-01-29 03:11:26 54.7". The first eight lines on each .txt file top should be dropped as well.
What I am trying to program: A code that generates a data frame for each serial number with the subdirectory serial number in the subdirectory name in a column called 'Machine', date/time as the data frame index and each type of an attribute as a column such as atr1, atr2, atr3, and atr4.
My first trial was something as:
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados"
for i in os.listdir(path):
if os.path.isfile(os.path.join(path,i)) and '17884' in i:
with open(path + i, 'r') as f:
But, as you can see, I'm completely lost... :/
Thank you so much for your help!

IIUC, you could try sth like this (note this is intended to be a start for test and feedback, because I can't test this on my mobile at the moment):
import os
import pandas as pd
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados/"
df = pd.DataFrame()
for fld in os.listdir(path):
subfld = path + fld
if os.path.isdir(subfld):
aux = pd.DataFrame()
sn = fld.split('_')[0]
for file in os.listdir(subfld):
filepath = os.path.join(subfld, file)
if os.path.isfile(filepath):
new_col = pd.read_fwf(filepath, colspecs=[(0, 19), (20, -1)], skiprows=8, header=None, parse_dates=[0], index_col=0)
aux = pd.concat([aux, new_col], axis=1)
aux['Machine'] = sn
df = df.append(aux)
However, I wonder if your 4 measurement files per folder all have the same index time values, otherwise there will be a problem concatenating them.

Related

Merging csv files into one (columnwise) in Python

I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)

create singel dataframe from multiple csv sources and create singel excel file from dataframe

I need help with my Python code.
The goal is:
read in between 100 and 200 CSV files that are in a folder
copy a variable in each CSV file from position (2,2)
create the sum of all values of column 17 in every CSV
to transfer the values in the form of a dataframe
create a new Excel file
transfer the dataframe in the Excel file
My attempt was the following code
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows=2,usecols=[2,16],header=None)
#ID
ID = (df.loc[2][2])
#summ of col.16
dat_Verbr = df[16].sum()
# data in single dataframe
df4 = pd.DataFrame({'SIM-Karte': ID, 'Datenverbrauch': dat_Verbr}, index=[0,1,2,3,4,5])
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
concatenated.to_excel(file_name)
print(' record successfully exported into Excel File')
unfortunately, it doesn't work.
Problem is that only the first id and first sum are imported in the excel file.
How can I work with the index by creating a single dataframe. I don’t know the exact number of csv files, only somewhat between 100 and 200.
I'm a beginner with python.
Can someone help me please?
You can use the updated code. One assumption I made is that there is data in all rows 1 thru 16. If your file has just ;;;;... in the first row, read_csv sometimes makes a mistake. Also, as you are using skiprow = 1, it will not add the value in row 1, column 17 if present. You can need to change the code if that needs to be added. Rest, I have corrected/changed so the code works. Note that in to_excel I have used index=False as I didnt think you need the index to be added. Remove if you want to see the index as well.
# use glob to get all the csv files
# in the folder
import os, glob
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# data in single dataframe
df4 = pd.DataFrame(columns =['SIM-Karte', 'Datenverbrauch'])
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows = 1, usecols=[1,16],header=None)
#ID
ID = (df.iloc[0][1])
#summ of col.16
dat_Verbr = df[16].sum()
df4.loc[len(df4.index)] = [ID, dat_Verbr]
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
df4.to_excel(file_name, index=False)
print(' record successfully exported into Excel File')
Output excel - I had 3 files in the folder

Merge files with a for loop

I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)

How to read several xlsx-files in a folder into a pandas dataframe

I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs like dfs['file_name_here.xlsx'] accesses the data frame output from the read_excel.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension): you can check out the beginning of the file like if t.startswith('data'): too. Maybe combine them if t.startswith('data') and t.endswith(extension):

Open many txt files and sort into two dfs

I need to open and process hundreds of .txt files based on their names and the folder names that they are contained within, into two data frames.
The folder structure:
I have a single folder, containing a number of sub-folders, each named with the date that the data was recorded, in this format:YYY-MM-DD, example: 2019-0-14
The file structure:
In each of the above folders, there are 576 files. There are two sets of measurements (based on 2 locations), taken each 5 minutes over every 24 hour period (12*24*2 = 576). The files are named as below:
hhmmssILAC3octf.txt for the indoor location
hhmmssOLAC3octf.txt for the outdoor location
Where hhmmss is the hour, minute and second of each 5 minute file and IL is indoors and OL is outdoors.
File contents:
Every file contains 5 rows of data, one for every minute. This data is the same type of data and the same length of data, separated by comas.
What I am trying to achieve:
I need to create two data frames: one for each location, with the date (folder name) and time (file name and position [line 1:5]) as a datetime index, based on the folder it is contained within, the name of the file and line number in the .txt
I also need to rename all the columns/variables once imported with the same names, but prefixed with an indoor or outdoor, based on if its location. Example: indoor_20hz.
I use Python and Pandas myself, but have never tried to solve a problem like this. Please can someone point me in the right direction...
Thank you.
You could start with the following code:
import os
import fnmatch
start_dirctory='.' # change this
df_result= None
for path, dirs, files in os.walk(start_dirctory):
for file in fnmatch.filter(files, '*.txt'):
full_name=os.path.join(path, file)
df_tmp= pd.read_csv(full_name)
# add the line number
df_tmp['line_number']= range(df_tmp.shape[0])
# add the code here that generates the infos
# you additionally need here to the df
# then concatenate the files together
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
As a result, you should have the content of all files in df_result. But you need to make sure, that the files have the same column structure, otherwise you need to fix it above. You also need to add the additional infos you need in place of "# add the infos you need here to the df".
My final solution, albeit im sure this isnt the most elegant way to get the final result:
import os
import fnmatch
import pandas as pd
start_dirctory='DIR' # change this
df_result= None
for path, dirs, files in os.walk(start_dirctory):
for file in fnmatch.filter(files, '*.txt'):
full_name=os.path.join(path, file)
df_tmp= pd.read_csv(full_name, header=None)
df_tmp['date']=os.path.basename(path)
df_tmp['file']=os.path.basename(file)
# df_tmp.set_index([df_tmp['date'], df_tmp['time']], inplace=True)
# add the line number
df_tmp['line_number']= range(df_tmp.shape[0])
# add the code here that generates the infos
# you additionally need here to the df
# then concatenate the files together
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
# Slice filename from 6 to 7 to get location
df_result['location'] = df_result['file'].str.slice(6,7)
# Slice filename from 0 to 6 to get time
df_result['time'] = df_result['file'].str.slice(0,6)
# Combine date and time and format as datetime
df_result['date'] = pd.to_datetime(df_result['date'] + ' ' + df_result['time'], errors='raise', dayfirst=False)
# Round all the datetimes to the nearest 5 min
df_result['date'] = df_result['date'].dt.round('5min')
# Add line number as minutes to the date
df_result['date'] = df_result['date'] + pd.to_timedelta(df_result['line_number'],unit='m')
del df_result['file']
del df_result['line_number']
del df_result['time']
# Make the date the index in df
df_result = df_result.set_index(df_result['date'])
# Delete date in df
del df_result['date']
# Change columns and rename df_result
df_result.columns = ['10hz', '12.5hz', '16hz', '20hz','25hz','31.5hz','40hz','50hz','63hz','80hz','100hz','125hz','160hz','200hz','250hz','315hz','400hz','500hz','630hz','800hz','1000hz','1250hz','1600hz','2000hz','2500hz','3150hz','4000hz','5000hz','6300hz','8000hz','10000hz']

Categories

Resources