Creating continuous arrays through a for loop iteration in python - python

I am attempting to create an array that is continually added onto as I loop through multiple csv's. The current code that I have creates multiple arrays within an array, but I would like it to be one continuous array. In the end, I would like the have the DT_temp with its corresponding type_id in a dataframe. Any suggestions as to how I can accomplish this?
import glob
import numpy as np
import pandas as pd
path = ** my folder path
all_files = glob.glob(path + "\*.csv")
type_id = []
DT_temp = []
for filename in all_files:
df = pd.read_csv(filename, index_col = None, header = 0)
type_id.append(df['type_id']) # type_id = a distinct number
df_temp = df[['unix_time_stamp','temperature']].dropna()
DT_temp.append(df_temp['unix_time_stamp'].diff())
DT_frame = pd.DataFrame(type_id)
DT_frame['TimeDiff_Temp'] = DT_temp

Related

Read the all excel files in a folder and split the each file name, add splitted name into the dataframe

All files have a name convention such as NPS_Platform_FirstLabel_Session_Language_Version.xlsx
I want to have additional columns like Platform, FirstLabel, Session, Language, Version these will column names and the values determined by filenames. I coded the following, it works but the value of added columns just came from the last file. For example, assume that the last filename is
NPS_MEM_GAIT_Science_EN_10.xlsx. Therefore, all of the added columns values are MEM, GAIT_Science, etc. Not the corresponding file names.
import glob
import os
import pandas as pd
path = "C:/Users/User/blabla"
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths
df = pd.DataFrame()
for f in all_files:
data = pd.read_excel(f)
df = df.append(data)
file_name = os.path.splitext(os.path.basename(f))[0]
nameList = []
nameList = file_name.rsplit('_')
df['Platform'] = nameList[1]
df['First label']= nameList[2]
df['Session'] = nameList[3]
df['Language'] = nameList[4]
df['Version'] = nameList[5]
df
I started with nameList[1] since I don't want NPS.
Any suggestions or feedback?
I have found a solution, I leave it here since there are more views than I expected.
import glob
import os
import pandas as pd
path = "C:/Users/User/....."
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths
df_files= [pd.read_excel(filename) for filename in all_files]
for dataframe, filename in zip(df_files, all_files):
filename =os.path.splitext(os.path.basename(filename))[0]
filename = filename.rsplit('_')
dataframe['Platform'] = filename[1]
dataframe['First label']= filename[2]
dataframe['Session'] = filename[3]
dataframe['Language'] = filename[4]
dataframe['Version'] = filename[5]
df= pd.concat(files_df, ignore_index=True)
I think the reason is I was just iterating over the files, not the dataframe that I was trying to build. With this, I can iterate over the dataframe and file names at the same time. I have found this solution on https://jonathansoma.com/lede/foundations-2017/classes/working-with-many-files/class/
But still if you can give explicit answer about why the first code does not work as I want, it would be great

Create a loop to process multiple files

I have written the code below but currently I need to retype the same conditions for each file and, as there are over 100 files, this is not ideal.
I couldn't come up with a way to implement this using a loop that will read all of these files and filter the values in MP out. Meanwhile, adding two new columns to each filter file as the written code below would be the only method I know so far.
I try to obtain a new combined data frame with all filter files with their conditions
Please suggest ways of implementing this using a loop:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
df1 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv')
df2 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv')
df3 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv')
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = "2017-04-08"
df1["Inspection"] = "10"
##
df2 =df2["MP"].unique()
df2=pd.DataFrame(df2, columns=['MP'])
df2["Dates"] = "2017-04-09"
df2["Inspection"] = "01"
##
df3 =df3["MP"].unique()
df3=pd.DataFrame(df3, columns=['MP'])
df3["Dates"] = "2017-04-09"
df3["Inspection"] = "02"
Final = pd.concat([df1,df2,df3,df4],axis = 0, sort = False)
Maybe this sample code will help you.
#!/usr/bin/env python3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
from os import path
import glob
import re
def process_file(file_path):
result = None
file_path = file_path.replace("\\","/")
filename = path.basename(file_path)
regex = re.compile("^(\\d{4})(\\d{2})(\\d{2})(\\d{2})")
match = regex.match(filename)
if match:
date = "%s-%s-%s" % (match[1] , match[2] , match[3])
inspection = match[4]
df1 = pd.read_csv(file_path)
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = date
df1["Inspection"] = inspection
result = df1
return result
def main():
# files_list = [
# r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv'
# ]
directory = 'E:\\Unmanned Cars\\Unmanned Cars\\'
files_list = [f for f in glob.glob(directory + "*_052.csv")]
result_list = [ process_file(filename) for filename in files_list ]
Final = pd.concat(result_list, axis = 0, sort = False)
if __name__ == "__main__":
main()
I've created a process_file function for processing each file.
There is used a regular expression for extracting data from filename. Also, the glob module was used for reading the files from a directory with pattern matching and expansion.

How do I add a list of values for a parameter to a group of dataframes, where that parameter has a different value for each dataframe?

I have 15 samples that each have a unique value of a parameter L.
Each sample was tested and provided data which I have placed into separate DataFrames in Pandas.
Each of the DataFrames has a different number of rows, and I want to place the corresponding value of L in each row, i.e. create a column for parameter L.
Note that L is constant in its respective DataFrame.
Is there a way to write a loop that will take a value of L from a list containing all of its values, and create a column in its corresponding sample data DataFrame?
I have so far been copying and pasting each line, and then updating the values and DataFrame names manually, but I suppose that this is not the most effective way of using python/pandas!
Most of the code I have used so far has been based on what I have found online, and my actual understanding of it is quite limited but I have tried to comment where possible.
UPDATED based on first suggested answer.
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Extract dataframes from list
data1_0 = df1[0]
data1_1 = df1[1]
data1_2 = df1[2]
data1_3 = df1[3]
data1_4 = df1[4]
data1_5 = df1[5]
data1_6 = df1[6]
data1_7 = df1[7]
data1_8 = df1[8]
data1_9 = df1[9]
data1_10 = df1[10]
data1_11 = df1[11]
data1_12 = df1[12]
data1_13 = df1[13]
data1_14 = df1[14]
# Add in a new column for values of 'L'
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
data1_0['L'] = L[0]
data1_1['L'] = L[1]
data1_2['L'] = L[2]
data1_3['L'] = L[3]
data1_4['L'] = L[4]
data1_5['L'] = L[5]
data1_6['L'] = L[6]
data1_7['L'] = L[7]
data1_8['L'] = L[8]
data1_9['L'] = L[9]
data1_10['L'] = L[10]
data1_11['L'] = L[11]
data1_12['L'] = L[12]
data1_13['L'] = L[13]
data1_14['L'] = L[14]
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The method I am using (copy and paste lines) works so far, it's just that it doesn't seem to be the most efficient use of my time or the tools I have, and I don't really know how to approach this one with my limited experience of python so far.
I also have several other parameters and datasets that I need to do this for, so any help would be greatly appreciated!
You can do just data1_0['L'] = L0 and so on for the rest of DataFrames. Given a single value on such assignment will fill the whole column with that value automatically, no need to compute length/index.
Untested code:
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Add file name as identifier
df['FNAME'] = os.path.basename(file.name)
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Concatenate the results into single dataframe
data = pd.concat(df1)
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
# Supposing number of files and length of L is the same
repl_dict = {k:v for k,v in zip([os.path.basename(file.name) for file in files], L)}
# Add the new column
data1['L'] = data.FNAME.map(repl_dict)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

adding columns to dataframe based on file name in python

I have several txt files with different file names. I would like to do two things:
1.) Load the data all at once
2.) using partial parts from the file name and add it for the dedicated dataframe as additional column
3.) adding the files together
I have below a really really manual example but want to automize it somehow. How is that possible?
The code looks like the following
import pandas as pd
#load data files
data1 = pd.read_csv('C:/file1_USA_Car_1d.txt')
data2 = pd.read_csv('C:/file2_USA_Car_2d.txt')
data3 = pd.read_csv('C:/file3_USA_Car_1m.txt')
data4 = pd.read_csv('C:/file3_USA_Car_6m.txt')
data5 = pd.read_csv('C:file3_USA_Car_1Y.txt')
df = pd.DataFrame()
print(df)
df = data1
#--> The input for the column below should be taken from the name of the file
df['country'] = 'USA'
df['Type'] = 'Car'
df['duration'] = '1d'
print(df)
Iterate over your files with glob and do some simple splitting on the filenames.
import glob
import pandas as pd
df_list = []
for file in glob.glob('C:/file1_*_*_*.txt'):
# Tweak this to work for your actual filepaths, if needed.
country, typ, dur = file.split('.')[0].split('_')[1:]
df = (pd.read_csv(file)
.assign(Country=country, Type=typ, duration=dur))
df_list.append(df)
df = pd.concat(df_list)
One way of doing it would be to do this:
all_res = pd.DataFrame()
file_list = ['C:/file1_USA_Car_1d.txt', 'C:/file3_USA_Car_1m.txt', 'etc']
for file_name in file_list:
tmp = pd.read_csv(file_name)
tmp['file_name'] = file_name
all_res = all_res.append(tmp)
all_res = all_res.reset_index()
I'd do something like the following:
from pathlib import Path
from operator import itemgetter
import pandas as pd
file_paths = [
Path(path_str)
for path_str in (
'C:/file1_USA_Car_1d.txt', 'C:/file2_USA_Car_2d.txt',
'C:/file3_USA_Car_1m.txt', 'C:/file3_USA_Car_6m.txt',
'C:file3_USA_Car_1Y.txt')
]
def import_csv(csv_path):
df = pd.read_csv(csv_path)
df['country'], df['Type'], df['duration'] = itemgetter(1, 2, 3)(csv_path.stem.split('_'))
return df
dfs = [import_csv(csv_path) for csv_path in file_paths]
This helps encapsulate your desired behavior in a helper function and reduces the things you need to think about.

How can I use pandas to find only rows with differing column values?

I am comparing a single column ('label') from several nearly identical cvs files.
I've written some code that creates a new data frame from the files that I am comparing:
def main(argv):
dirs = sys.argv[1:]
print ("Directorys to process:"+ str(dirs))
files = glob.glob(dirs[0]+"/*.csv")
files = [f.replace(dirs[0]+"/","") for f in files]
print ("files to process:"+str(files))
dfList =[dirs]
dfLabel = pd.DataFrame()
resultdf = pd.DataFrame()
for file in range( 0,len(files)):
filename = files[file]
for index in range(0,len(dirs)):
dirname = dirs[index]
dfItem = pd.read_csv(dirname+"/"+filename)
resultdf[dirname] = dfItem['label']
resultdf.fillna(value=0, inplace=True)
resultdf['mode_average'] = resultdf.mode(axis=1)
# new step to remove rows where all values are equal
resultdf.to_csv("Comparison_of_"+filename,index=False)
if __name__ == "__main__":
main(sys.argv[1:])
This works the way I want it to, but I am really only interested in seeing the rows where one of my input files is different. I am expecting them to be the same in most cases, and there are hundreds or thousands of rows. Is there a built in way to evaluate and return only the rows where one or more of the values in that row are different? The number of files and directories that I run comparisons on may fluctuate.
I solved this with the help of pandasql.
This report shows the row number and the the comparison of the results of all of the labels where one did not match the mode average.
import pandas as pd
import os, sys,glob
import getopt
import pandasql
from pandas import *
from pandasql import sqldf
dirs = sys.argv[1:]
print ("Directorys to process:"+ str(dirs))
files = glob.glob(dirs[0]+"/*.csv")
files = [f.replace(dirs[0]+"/","") for f in files]
print ("files to process:"+str(files))
dfList =[dirs]
resultdf = pd.DataFrame()
for file in range( 0,len(files)):
filename = files[file]
for index in range(0,len(dirs)):
dirname = dirs[index]
dfItem = pd.read_csv(dirname+"/"+filename)
resultdf[dirname] = dfItem['label']
resultdf.fillna(value=0, inplace=True)
resultdf['mode_average'] = resultdf.mode(axis=1)
pysqldf = lambda q: sqldf(q, globals())
for index in range(0,len(dirs)):
dirname = dirs[index]
q = "select _ROWID_,* from resultdf where "+ dirname +" != mode_average"
diffs = pysqldf(q)
if (len(diffs) >0):
print ("Advisor "+dirname+ " had deviations in "+filename)
diffs.to_csv(dirname+"_"+filename+"_deviation.csv",index = False)
print(diffs)
resultdf.to_csv("Comparison_of_"+filename ,index=False)

Categories

Resources