Loading multiple CSVs into a single pandas dataframe - python

I am trying to load multiple CSVs into a single pandas dataframe. They are all in one file, and all have the same column structure. I have tried a few different methods from a few different threads, and all return the error 'ValueError: No objects to concatenate.' I'm sure the problem is something dumb like my file path? This is what I've tried:
temps = pd.concat(map(pd.read_csv, glob.glob(os.path.join('./Resources/temps', "*.csv"))))
Also this:
path = r'./Resources/temps'
temps_csvs = glob.glob(os.path.join(path, "*.csv"))
df_for_each_csv = (pd.read_csv(f) for f in temps_csvs)
temps_df = pd.concat(df_for_each_csv, ignore_index = True)```
Thanks for any help!

It might not be as helpful as other answers, but when I tried running your code, it work perfectly fine. The only difference that conflicted was that I changed the path to be like this:
temps_csvs = glob.glob(os.path.join(os.getcwd(), "*.csv"))
df_for_each_csv = (pd.read_csv(f) for f in temps_csvs)
temps_df = pd.concat(df_for_each_csv, ignore_index = True)
and put the script in the same folder as to where the csv files are.
EDIT: I saw your comment about how you are having an error ParserError: Error tokenizing data. C error: Expected 5 fields in line 1394, saw 6
This means that the csv files don't have the same number of columns, here is a question that deals with a similar issue, maybe it will help :
Reading a CSV file with irregular number of columns using Pandas

Change tuple to list on the third line.
[pd.read_csv(f) for f in temps_csvs]
or add tuple to it: tuple(pd.read_csv(f) for f in temps_csvs)
Tuple Comprehension doesn't work this way.
See Why is there no tuple comprehension in Python?

Related

Updated CSV Values across Multiple CSV Files [duplicate]

This question already has answers here:
How exactly does a generator comprehension work?
(8 answers)
Apply function to each element of a list
(4 answers)
Closed 6 months ago.
This is my code below and whenever I run my program, I receive and error stating "attribute error: 'generator' object has no attribute 'loc'"
I'm currently trying to change specified values in a specified column in all csv files to different specified values for the specified column
I'm not sure why this is happening
# Get CSV files list from a folder
csv_files = glob.glob(dest_dir + "/*.csv")
# Read each CSV file into DataFrame
# This creates a list of dataframes
df = (pd.read_csv(file) for file in csv_files)
df.loc[df['Plan_Code'].str.contains('NABVCI'), 'Plan_Code'] = 'CLEAR_BV'
df.loc[df['Plan_Code'].str.contains('NAMVCI'), 'Plan_Code'] = 'CLEAR_MV'
df.loc[df['Plan_Code'].str.contains('NA_NRF'), 'Plan_Code'] = 'FA_GUAR'
df.to_csv(csv_files, index=False)
Thanks!
You wrote this:
df = (pd.read_csv(file) for file in csv_files)
Rather than that generator expression,
you probably intended to write a list comprehension:
df = [pd.read_csv(file) for file in csv_files]
Additionally you likely want to call pd.concat(),
so that multiple .CSVs get incorporated into
a single dataframe.
Alternatively, you might prefer to build up a list of
dicts pulled from csv.DictReader,
and then call pd.DataFrame() on that list.
Multiple .csv files could contribute rows to the list.
One dict per row, without regard to which file
the row appears in.
Because you use round brackets and not square brackets when creating df, df becomes a generator object and not a list of dataframes. But even if you switch to square brackets you will still have a problem: df will now be a list, but lists don't have a loc attribute either, only dataframes -- individual elements of that list -- have it. So df.loc still wouldn't work.
If I understand your intent correctly, you want something like this instead:
csv_files = glob.glob(dest_dir + "/*.csv")
for file in csv_files:
df = pd.read_csv(file) #now df is a dataframe, so df.loc makes sense
#do your df.loc manipulations, then save each df to its own file
df.to_csv(file, index=False)

reading columns of csv file using pandas not working

I am trying to read the following .csv file, but I want to read each column of it. However, usecols is not working as it is giving the following error:
ValueError: Usecols do not match columns, columns expected but not found: ['sources', 'RMS']
this is how I am reading it:
train=pd.read_csv("parameters.csv", usecols = ['sources','RMS'])
And this is my csv file:
how can I read each column of this file?
edit: I had an unclose " but the problem persists
If you want to read each column, just remove the usecols part like so: train=pd.read_csv("parameters.csv"). I'm guessing the reason it doesn't work is that your columns have spaces after the names, so the actual name of one of your columns is something like sources .
It might be an error on your part, but you have unclosed quotation marks in your code snippet.
If the order of the columns will always be the same, you can also use an integer-list with usecols
df = pd.read_csv('file.csv',usecols=[0,4] #this selects just 0 and 4

Load many feather files in a folder into dask

With a folder with many .feather files, I would like to load all of them into dask in python.
So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277
files = [...]
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.concat(dfs)
Unfortunately, this gives me the error TypeError: Truth of Delayed objects is not supported which is mentioned there, but a workaround is not clear.
Is it possible to do the above in dask?
Instead of concat, which operates on dataframes, you want to use from_delayed, which turns a list of delayed objects, each of which represents a dataframe, into a single logical dataframe
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.from_delayed(dfs)
If possible, you should also supply the meta= (a zero-length dataframe, describing the columns, index and dtypes) and divisions= (the boundary values of the index along the partitions) kwargs.

Applying the same operations on multiple .csv file in pandas

I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.
Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe
You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.
Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.
Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)

How to tag records with filename, imported to pandas dataframe from multiple csv files?

I have a set of csv files I need to import into a pandas dataframe.
I have imported the filepaths as a list, FP, and I am using the following code to read the data:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
This is working great, but unfortunately there are no datetimestamps or file identifying attributes in the files. I need to know which file each record came from.
I tried adding this line, but this just returned the filename of the final file read:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
df['filename'] = i
I can imagine some messy multi-step solutions, but wondered if there was something more elegant I could do within my existing loop.
I'd do it this way:
df = pd.concat([pd.read_csv(f, header=None).assign(filename=f) for f in FP],
ignore_index=True)

Categories

Resources