I have a problem which might be easy for majority of people here.
I have four folders: SA1, SA2, SA3, SA4.
Each folder has around 60 csv files. I have defined the path like this:
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
Once the user defines the year and level in the path above, I to combine all the csv files under the "level" folder based on the "index_col=" parameter.
For exmaple, for SA1, I want to combine the CSV files based on the "SA1_code" column. For SA2, I want to combine the CSV files based on
the "SA2_MAIN_DIGIT_CODE" column. For SA3 and SA4, the index_col should be "SA3_MULTI" and "SA4_REGIONS" respectively. As you can see all the columns names for CSV files under these four folders are different.
So far, I have attempted the following things.
I have defined the function as
def combine_csv(path):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename) # Can't give the "index_col" as there four different strings for teh csv files in each folder
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
data_df = combine_csv(path)
gives me the combined dataframe. But I want to combine it based on "SA1_code" if the user chooses to go to SA1 folder or "SA2_MAIN_DIGIT_CODE" if they choose to combine CSV files from SA2 folder, and so on and so forth.
How do I do this?
You don't have four separate index columns, you just have one that changes depending on user input. Therefore, the solution to your problem is relatively simple. First, modify your combine_csv method:
def combine_csv(path, index):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename, index_col = index)
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
All I did was inject a value, index, that will be used for the indeX_col argument to read_csv.
Next, we need to determine the value for index based on the value for level, as input by the user. According to your question, it seems that there should be a one-to-one relationship between these values. So, we can use a dictionary for this:
LevelIndexMapping = {
"SA1": "SA1_code",
"SA2": "SA2_MAIN_DIGIT_CODE",
"SA3": "SA3_MULTI",
"SA4": "SA4_REGIONS"
}
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
combine_csv(path, LevelIndexMapping[level])
Here, I created a dictionary that maps your level variable to its associated index column value, and then accesses that mapping when calling combine_csv.
Related
I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
I have a folder that contains a few excel files. I want to read through the folder and save each excel file in a separate dataframe with a similar name using pandas. Currently, I extracted the file names and made a dictionary with keys as dataframe names and values as excel file addresses but I'm not sure how I can assign them in iterations. For example, in the first step all file names and paths are red and saved in 2 lists:
path='c:/files/*.xlsx'
path_1='c:/files/'
file_names = [os.path.basename(i) for i in glob.glob(path)]
names = ['df_'+i.split('.')[0]+ for i in file_names]
d_names={names[i]:path_1+file_names[i] for i in range(len(names))}
So now there is a dictionary as below:
d_names={'df_a':'c:/files/a.xlsx','df_b':'c:/files/b.xlsx'}
Now, how I assign two data frames iteratively so that at the end of iteration we obtain two data frames as df_a and df_b?!
path='c:/files/'
dfs = {}
for file in glob.glob(path + '*.xlsx'):
file_name = os.path.basename(file).split('.')[0]
dfs['df_'+file_name] = pd.read_excel(file)
dfs would now contain:
{'df_a': <pd.DataFrame>, 'df_b': <pd.DataFrame>}
# And each could be accessed like:
dfs['df_a']
I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
excel_TSMS_df = pd.read_excel(r'C:/TMP/TSMS.xls', sheet_name='TSMS', dayfirst = True)
excel_Ratecard_df = pd.read_excel(r'C:/TMP/Ratecard.xls', sheet_name='Ratecard')
excel_ONCall_df = pd.read_excel(r'C:/TMP/On Call Report.xls', sheet_name='On Call Report')
excel_OverTime_df = pd.read_excel(r'C:/TMP/Over Time.xls', sheet_name='OT')
Here I have 4 files, I want to create 4 dataframes, like N files N dataframes uniquely. Kindly help
Hi suppose in one scenario the source consists only two like tsms , ratecard in that time create two data frame .how many files in source location that many data frames on reading the Excel data ...kindly help in this case
Create two lists, one for file names and one for dataframe names, and empty dictionary
dataframe_names = ["TSMS","Ratecard","ONCall","OverTime"]
file_names = ['C:/TMP/TSMS.xls', 'C:/TMP/Ratecard.xls','C:/TMP/On Call Report.xls','C:/TMP/Over Time.xls']
dataframes = {}
Then iterate over the lists together (make sure they are the same length) and everytime you open an excel file as a dataframe, add it to the dictionary
for index,name in enumerate(file_names):
df = pd.read_excel(name)
df_name = dataframe_names[index]
dataframes[df_name] = df
This should work for any number of excel files as long as you keep the two original lists the same length and positions
Three questions within a single piece of code.
I have quite a lot of excel files which follow a similar pattern in their nomenclature, e.g Design__Tolerance_1.xlsx, Design_Tolerance_2.xlsx, etc. kept in one folder
Let's consider the first file Design__Tolerance_1.xlsx as
I am reading the first three columns of the excel file in my python program using the Pandas module as follows
fields = ['Time', 'Test 1', 'Test 2']
df=pd.read_excel('Design_Tolerance_1.xlsx', skipinitialspace=True,
usecols=fields)
Next, I am finding the mean of the column Test 1 and maximum value of the column Test 2
mean_value = df['Test 1'].mean()
max_value = df['Test 2'].mean()
And, I am printing the output in a seperate .csv file.
columns=["MEAN","MAX"]
data_under_columns = {"MEAN":[mean_value], "MAXIMUM VALUE":[max_value]}
df1 = pd.DataFrame(data_under_columns, columns=columns)
The file output_file.csv will contain the output
df1.to_csv('output_file.csv', sep=",", index = False)
Could you help me do the following:
I have kept all my files in same folder and I want the program to read all the files having the nomenclature pattern as mentioned above (Design__Tolerance_1.xlsx, Design_Tolerance_2.xlsx, etc.) in the same dataframe df , the program should run as many times as the number of files are there.
Let's say I have four excel files following the same naming pattern (Design__Tolerance_XXX.xlsx) present in the folder, I want the program to run four times and calculate the mean of the column Test 1 and maximum value of the column Test 2 for all the files one after another; and
Print only one .csv file as output_file.csv which contains the output from all the excel files.
e.g.
Use of functions are acceptable too.
you can do something like that. This solution is going to go through all your folder, and for every xlsx file, it is going to create a record with the mean and the max value, and once it is done, it is going to create a dataframe with one line per file, and store it as a csv.
# std
import glob
# 3rd
import pandas as pd
# Select all the files in your directory
directory = r'path/to/your/directory'
files = glob.glob(directory + "/*.xlsx")
fields = ["Test 1", "Test 2"]
records = []
for f in files:
# Every file is put in a temp dataframe, and operations are performed
temp_df = pd.read_excel(f, skipinitialspace=True, usecols=fields)
mean_value = temp_df['Test 1'].mean()
max_value = temp_df['Test 2'].max()
records.append((mean_value, max_value))
# Finally, we create a dataframe with our records and we store it
df = pd.DataFrame.from_records(records, columns=['MEAN', 'MAX'])
df.to_csv('output_file.csv', sep=',', index=False)