I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
Related
I have a problem which might be easy for majority of people here.
I have four folders: SA1, SA2, SA3, SA4.
Each folder has around 60 csv files. I have defined the path like this:
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
Once the user defines the year and level in the path above, I to combine all the csv files under the "level" folder based on the "index_col=" parameter.
For exmaple, for SA1, I want to combine the CSV files based on the "SA1_code" column. For SA2, I want to combine the CSV files based on
the "SA2_MAIN_DIGIT_CODE" column. For SA3 and SA4, the index_col should be "SA3_MULTI" and "SA4_REGIONS" respectively. As you can see all the columns names for CSV files under these four folders are different.
So far, I have attempted the following things.
I have defined the function as
def combine_csv(path):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename) # Can't give the "index_col" as there four different strings for teh csv files in each folder
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
data_df = combine_csv(path)
gives me the combined dataframe. But I want to combine it based on "SA1_code" if the user chooses to go to SA1 folder or "SA2_MAIN_DIGIT_CODE" if they choose to combine CSV files from SA2 folder, and so on and so forth.
How do I do this?
You don't have four separate index columns, you just have one that changes depending on user input. Therefore, the solution to your problem is relatively simple. First, modify your combine_csv method:
def combine_csv(path, index):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename, index_col = index)
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
All I did was inject a value, index, that will be used for the indeX_col argument to read_csv.
Next, we need to determine the value for index based on the value for level, as input by the user. According to your question, it seems that there should be a one-to-one relationship between these values. So, we can use a dictionary for this:
LevelIndexMapping = {
"SA1": "SA1_code",
"SA2": "SA2_MAIN_DIGIT_CODE",
"SA3": "SA3_MULTI",
"SA4": "SA4_REGIONS"
}
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
combine_csv(path, LevelIndexMapping[level])
Here, I created a dictionary that maps your level variable to its associated index column value, and then accesses that mapping when calling combine_csv.
I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
excel_TSMS_df = pd.read_excel(r'C:/TMP/TSMS.xls', sheet_name='TSMS', dayfirst = True)
excel_Ratecard_df = pd.read_excel(r'C:/TMP/Ratecard.xls', sheet_name='Ratecard')
excel_ONCall_df = pd.read_excel(r'C:/TMP/On Call Report.xls', sheet_name='On Call Report')
excel_OverTime_df = pd.read_excel(r'C:/TMP/Over Time.xls', sheet_name='OT')
Here I have 4 files, I want to create 4 dataframes, like N files N dataframes uniquely. Kindly help
Hi suppose in one scenario the source consists only two like tsms , ratecard in that time create two data frame .how many files in source location that many data frames on reading the Excel data ...kindly help in this case
Create two lists, one for file names and one for dataframe names, and empty dictionary
dataframe_names = ["TSMS","Ratecard","ONCall","OverTime"]
file_names = ['C:/TMP/TSMS.xls', 'C:/TMP/Ratecard.xls','C:/TMP/On Call Report.xls','C:/TMP/Over Time.xls']
dataframes = {}
Then iterate over the lists together (make sure they are the same length) and everytime you open an excel file as a dataframe, add it to the dictionary
for index,name in enumerate(file_names):
df = pd.read_excel(name)
df_name = dataframe_names[index]
dataframes[df_name] = df
This should work for any number of excel files as long as you keep the two original lists the same length and positions
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")
I am having some issues with my below code. The purpose of the code is to take a list of lists that within each of the lists, carries a series of csv files. I want to loop through each of these lists (one at a time) and pull in only the csv files found in the respective list.
my current code is accumulating all the data instead of starting from scratch each time it loops thru. First loop, use all the csv files in 0th index, second loop, use all the csv files in the 1st index - but dont accumulate
path = "C:/DataFolder/"
allFiles = glob.glob(path + "/*.csv")
fileChunks = [['2003.csv','2004.csv','2005.csv'],['2006.csv','2007.csv','2008.csv']]
for i in range(len(fileChunks)):
"""move empty dataframe here"""
df = pd.DataFrame()
for file_ in fileChunks[i]:
df_temp = pd.read_csv(file_, index_col = None, names = names, parse_dates=True)
df = df.append(df_temp)
note: fileChunks is derived from a function, and it spits out a list of lists like the example above
any help to documentation or pointing out my error would be great - I want to learn from this. thank you.
EDIT
It seems that moving the empty dataframe to within the first for loop works.
This should unnest your files and read each separately using a list comprehension, and then join them all using concat. This is much more efficient than appending each read to a growing dataframe.
df = pd.concat([pd.read_csv(file_, index_col=None, names=names, parse_dates=True)
for chunk in fileChunks for file_ in chunk],
ignore_index=True)
>>> [file_ for chunk in fileChunks for file_ in chunk]
['2003.csv', '2004.csv', '2005.csv', '2006.csv', '2007.csv', '2008.csv']