Create a loop to process multiple files - python

I have written the code below but currently I need to retype the same conditions for each file and, as there are over 100 files, this is not ideal.
I couldn't come up with a way to implement this using a loop that will read all of these files and filter the values in MP out. Meanwhile, adding two new columns to each filter file as the written code below would be the only method I know so far.
I try to obtain a new combined data frame with all filter files with their conditions
Please suggest ways of implementing this using a loop:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
df1 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv')
df2 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv')
df3 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv')
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = "2017-04-08"
df1["Inspection"] = "10"
##
df2 =df2["MP"].unique()
df2=pd.DataFrame(df2, columns=['MP'])
df2["Dates"] = "2017-04-09"
df2["Inspection"] = "01"
##
df3 =df3["MP"].unique()
df3=pd.DataFrame(df3, columns=['MP'])
df3["Dates"] = "2017-04-09"
df3["Inspection"] = "02"
Final = pd.concat([df1,df2,df3,df4],axis = 0, sort = False)

Maybe this sample code will help you.
#!/usr/bin/env python3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
from os import path
import glob
import re
def process_file(file_path):
result = None
file_path = file_path.replace("\\","/")
filename = path.basename(file_path)
regex = re.compile("^(\\d{4})(\\d{2})(\\d{2})(\\d{2})")
match = regex.match(filename)
if match:
date = "%s-%s-%s" % (match[1] , match[2] , match[3])
inspection = match[4]
df1 = pd.read_csv(file_path)
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = date
df1["Inspection"] = inspection
result = df1
return result
def main():
# files_list = [
# r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv'
# ]
directory = 'E:\\Unmanned Cars\\Unmanned Cars\\'
files_list = [f for f in glob.glob(directory + "*_052.csv")]
result_list = [ process_file(filename) for filename in files_list ]
Final = pd.concat(result_list, axis = 0, sort = False)
if __name__ == "__main__":
main()
I've created a process_file function for processing each file.
There is used a regular expression for extracting data from filename. Also, the glob module was used for reading the files from a directory with pattern matching and expansion.

Related

Creating continuous arrays through a for loop iteration in python

I am attempting to create an array that is continually added onto as I loop through multiple csv's. The current code that I have creates multiple arrays within an array, but I would like it to be one continuous array. In the end, I would like the have the DT_temp with its corresponding type_id in a dataframe. Any suggestions as to how I can accomplish this?
import glob
import numpy as np
import pandas as pd
path = ** my folder path
all_files = glob.glob(path + "\*.csv")
type_id = []
DT_temp = []
for filename in all_files:
df = pd.read_csv(filename, index_col = None, header = 0)
type_id.append(df['type_id']) # type_id = a distinct number
df_temp = df[['unix_time_stamp','temperature']].dropna()
DT_temp.append(df_temp['unix_time_stamp'].diff())
DT_frame = pd.DataFrame(type_id)
DT_frame['TimeDiff_Temp'] = DT_temp

How do I add a list of values for a parameter to a group of dataframes, where that parameter has a different value for each dataframe?

I have 15 samples that each have a unique value of a parameter L.
Each sample was tested and provided data which I have placed into separate DataFrames in Pandas.
Each of the DataFrames has a different number of rows, and I want to place the corresponding value of L in each row, i.e. create a column for parameter L.
Note that L is constant in its respective DataFrame.
Is there a way to write a loop that will take a value of L from a list containing all of its values, and create a column in its corresponding sample data DataFrame?
I have so far been copying and pasting each line, and then updating the values and DataFrame names manually, but I suppose that this is not the most effective way of using python/pandas!
Most of the code I have used so far has been based on what I have found online, and my actual understanding of it is quite limited but I have tried to comment where possible.
UPDATED based on first suggested answer.
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Extract dataframes from list
data1_0 = df1[0]
data1_1 = df1[1]
data1_2 = df1[2]
data1_3 = df1[3]
data1_4 = df1[4]
data1_5 = df1[5]
data1_6 = df1[6]
data1_7 = df1[7]
data1_8 = df1[8]
data1_9 = df1[9]
data1_10 = df1[10]
data1_11 = df1[11]
data1_12 = df1[12]
data1_13 = df1[13]
data1_14 = df1[14]
# Add in a new column for values of 'L'
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
data1_0['L'] = L[0]
data1_1['L'] = L[1]
data1_2['L'] = L[2]
data1_3['L'] = L[3]
data1_4['L'] = L[4]
data1_5['L'] = L[5]
data1_6['L'] = L[6]
data1_7['L'] = L[7]
data1_8['L'] = L[8]
data1_9['L'] = L[9]
data1_10['L'] = L[10]
data1_11['L'] = L[11]
data1_12['L'] = L[12]
data1_13['L'] = L[13]
data1_14['L'] = L[14]
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The method I am using (copy and paste lines) works so far, it's just that it doesn't seem to be the most efficient use of my time or the tools I have, and I don't really know how to approach this one with my limited experience of python so far.
I also have several other parameters and datasets that I need to do this for, so any help would be greatly appreciated!
You can do just data1_0['L'] = L0 and so on for the rest of DataFrames. Given a single value on such assignment will fill the whole column with that value automatically, no need to compute length/index.
Untested code:
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Add file name as identifier
df['FNAME'] = os.path.basename(file.name)
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Concatenate the results into single dataframe
data = pd.concat(df1)
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
# Supposing number of files and length of L is the same
repl_dict = {k:v for k,v in zip([os.path.basename(file.name) for file in files], L)}
# Add the new column
data1['L'] = data.FNAME.map(repl_dict)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

adding columns to dataframe based on file name in python

I have several txt files with different file names. I would like to do two things:
1.) Load the data all at once
2.) using partial parts from the file name and add it for the dedicated dataframe as additional column
3.) adding the files together
I have below a really really manual example but want to automize it somehow. How is that possible?
The code looks like the following
import pandas as pd
#load data files
data1 = pd.read_csv('C:/file1_USA_Car_1d.txt')
data2 = pd.read_csv('C:/file2_USA_Car_2d.txt')
data3 = pd.read_csv('C:/file3_USA_Car_1m.txt')
data4 = pd.read_csv('C:/file3_USA_Car_6m.txt')
data5 = pd.read_csv('C:file3_USA_Car_1Y.txt')
df = pd.DataFrame()
print(df)
df = data1
#--> The input for the column below should be taken from the name of the file
df['country'] = 'USA'
df['Type'] = 'Car'
df['duration'] = '1d'
print(df)
Iterate over your files with glob and do some simple splitting on the filenames.
import glob
import pandas as pd
df_list = []
for file in glob.glob('C:/file1_*_*_*.txt'):
# Tweak this to work for your actual filepaths, if needed.
country, typ, dur = file.split('.')[0].split('_')[1:]
df = (pd.read_csv(file)
.assign(Country=country, Type=typ, duration=dur))
df_list.append(df)
df = pd.concat(df_list)
One way of doing it would be to do this:
all_res = pd.DataFrame()
file_list = ['C:/file1_USA_Car_1d.txt', 'C:/file3_USA_Car_1m.txt', 'etc']
for file_name in file_list:
tmp = pd.read_csv(file_name)
tmp['file_name'] = file_name
all_res = all_res.append(tmp)
all_res = all_res.reset_index()
I'd do something like the following:
from pathlib import Path
from operator import itemgetter
import pandas as pd
file_paths = [
Path(path_str)
for path_str in (
'C:/file1_USA_Car_1d.txt', 'C:/file2_USA_Car_2d.txt',
'C:/file3_USA_Car_1m.txt', 'C:/file3_USA_Car_6m.txt',
'C:file3_USA_Car_1Y.txt')
]
def import_csv(csv_path):
df = pd.read_csv(csv_path)
df['country'], df['Type'], df['duration'] = itemgetter(1, 2, 3)(csv_path.stem.split('_'))
return df
dfs = [import_csv(csv_path) for csv_path in file_paths]
This helps encapsulate your desired behavior in a helper function and reduces the things you need to think about.

Changing Column Heading CSV File

I am currently trying to change the headings of the file I am creating. The code I am using is as follows;
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, low_memory=False)
output = (df['logid'].value_counts())
list_.append(output)
df1 = pd.DataFrame()
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Basically I am looping through a file directory and extracting data from each file. Using this is outputs the following image;
http://imgur.com/a/LE7OS
All i want to do it change the columns names from 'logid' to the file name it is currently searching but I am not sure how to do this. Any help is great! Thanks.
Instead of appending the values try to append values by creating the dataframe and setting the column i.e
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
Changes in the code in the question
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in files:
df = pd.read_csv(fname)
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Hope it helps

Python / Pandas abbreviating my numbers.

Probably a very easy fix, but my english isn't good enough to search for the right answer.
Python/Pandas is changing the numbers that I'm writing from: 6570631401430749 to something like: 6.17063140131e+15
I'm merging hundreds of csv files, and this one column comes out all wrong. The name of the column is "serialnumber" and its the 3rd column.
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')
You can use dtype = object when you read csv if you want to preserve the data in its original form. You can change your code to
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename,dtype=object)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')

Categories

Resources