looping in lists of csv files names - python

I am having some issues with my below code. The purpose of the code is to take a list of lists that within each of the lists, carries a series of csv files. I want to loop through each of these lists (one at a time) and pull in only the csv files found in the respective list.
my current code is accumulating all the data instead of starting from scratch each time it loops thru. First loop, use all the csv files in 0th index, second loop, use all the csv files in the 1st index - but dont accumulate
path = "C:/DataFolder/"
allFiles = glob.glob(path + "/*.csv")
fileChunks = [['2003.csv','2004.csv','2005.csv'],['2006.csv','2007.csv','2008.csv']]
for i in range(len(fileChunks)):
"""move empty dataframe here"""
df = pd.DataFrame()
for file_ in fileChunks[i]:
df_temp = pd.read_csv(file_, index_col = None, names = names, parse_dates=True)
df = df.append(df_temp)
note: fileChunks is derived from a function, and it spits out a list of lists like the example above
any help to documentation or pointing out my error would be great - I want to learn from this. thank you.
EDIT
It seems that moving the empty dataframe to within the first for loop works.

This should unnest your files and read each separately using a list comprehension, and then join them all using concat. This is much more efficient than appending each read to a growing dataframe.
df = pd.concat([pd.read_csv(file_, index_col=None, names=names, parse_dates=True)
for chunk in fileChunks for file_ in chunk],
ignore_index=True)
>>> [file_ for chunk in fileChunks for file_ in chunk]
['2003.csv', '2004.csv', '2005.csv', '2006.csv', '2007.csv', '2008.csv']

Related

How to read multiple csv from folder without concatenating each file

I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`

How to read multiple ann files (from brat annotation) within a folder into one pandas dataframe?

I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())

Stop overwriting when creating new df from looping through original df

I have a large df, where the end column is a filename. I want to make a new CSV continuing the rows of all files who have an 'M' in the filename. I have managed to do the majority of this, but the end csv has only one row, containing the last file that has been found in the large csv. I want each row to be transferred to the csv on a new line.
I have tried df.append in many ways but haven't had any luck. I have seen some very different ways but it required changing all my code when it feels like only a minor adjustment is needed
path = '.../files/'
big_data = pd.read_csv('landmark_coordinates.csv', sep=',', skipinitialspace=True) #open big CSV as a DF
#put photos into a male array based on the M character that appears in the filename
male_files = [f for f in glob.glob(path + "**/*[M]*.??g", recursive=True)]
for each_male in male_files: #for all male files
male_data = big_data.loc[big_data['photo_name'] == each_male] # extract their row of data from the CSV and put in a new dataframe
# NEEDED: ON A NEW LINE! MUST APPEND. right now it just overwrites
male_data.to_csv('male_landmark_coordinates.csv', index=False, sep=',') #transport new df to csv format
Like I said, I need to make sure each file starts on a new row. Would really appreciate any help as it feels like I am so close!
Everytime you call the df.to_csv you are overwriting the csv.
male_data = pd.DataFrame()
for each_male in male_files: #for all male files
male_data.append(big_data.loc[big_data['photo_name'] == each_male])
male_data.to_csv('male_landmark_coordinates.csv', index=False, sep=',') #transport new df to csv format

Python / glob glob - change datatype during import

I'm looping through all excel files in a folder and appending them to a dataframe. One column (column C) has an ID number. In some of the sheets, the ID is formatted as text and in others it's formatted as a number. What's the best way to change the data type during or after the import so that the datatype is consistent? I could always change them in each excel file before importing but there are 40+ sheets.
for f in glob.glob(path):
dftemp = pd.read_excel(f,sheetname=0,skiprows=13)
dftemp['file_name'] = os.path.basename(f)
df = df.append(dftemp,ignore_index=True)
Don't append to a dataframe in a loop, every append relocates the whole dataframe to a new location in memory, very slow. Do one single concat after reading all your dataframes:
dfs = []
for f in glob.glob(path):
df = pd.read_excel(f,sheetname=0,skiprows=13)
df['file_name'] = os.path.basename(f)
df['c'] = df['c'].astype(str)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
It sounds like your ID, that's the c column, is a string, but sometimes lacks alphabets. Ideally this should be used as a string.

Is there a way to parallelize Pandas' Append method?

I have 100 XLS files that I would like to combine into a single CSV file. Is there a way to improve the speed of combining them all together?
This issue with using concat is that it lacks the arguments that to_csv affords me:
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
print a_file
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
frame = frame.append(data)
# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
Using multiprocessing, you could read them in parallel using something like:
import multiprocessing
import pandas as pd
dfs = multiprocessing.Pool().map(df.read_excel, f_names)
and then concatenate them to a single one:
df = pd.concat(dfs)
You probably should check if the first part is at all faster than
dfs = map(df.read_excel, f_names)
YMMV - it depends on the files, the disks, etc.
It'd be more performant to read them into a list and then call concat:
merged = pd.concat(df_list)
so something like
df_list=[]
for f in xl_list:
df_list.append(pd.read_csv(f)) # or read_excel
merged = pd.concat(df_list)
The problem with repeatedly appending to a dataframe is that the memory has to be allocated to fit the new size and the contents copied and really you only want to do this once.

Categories

Resources