I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.
Related
I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.
I am writing these csv files with the help of spark dataset like this in java
df.write().csv(somePath)
I was also thinking of using coalsec(1) but it is not memory efficient in my case
I know that this write will also create some redundant files in a folder. so need to handle that also
I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file
I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files
so I am not able to figure out how should I merge all the csv files without merging the headers
coalesce(1) is exactly what you want.
Speed/memory usage is the tradeoff you get for wanting exactly one file
It seems this will do it for you:
# importing libraries
import pandas as pd
import glob
import os
# merging the files
joined_files = os.path.join("/hadoop", "*.csv")
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
Edit: I don't know much about Hadoop, but maybe the same logic applies.
I've been searching for a way to merge all csv files in a folder. They all have the same headers, but different names. I've found some videos on youtube on merge and some questions here on stackoverflow that touches the matter. The problem is that this tutorials are focused on files with the same name as: sales1, sales2, etc.
In my case, all files in the directory are CSVs and are located in 'D:\XXXX\XXXX\output'
The code I have used is:
import pandas as pd
# set files path
amazon = r'D:\XXXX\XXXX\output\amazonbooks.csv'
bookcrossing = r'D:\XXXX\XXXX\output\bookcrossing.csv'
# merge files
dataFrame = pd.concat(
map(pd.read_csv, [amazon, bookcrossing]), ignore_index=True)
print(dataFrame)
If the code could merge all the files that stand in the folder output (since all of them are .csv), instead of naming each one of them, it would be better.
I'd be glad if anyone can help me with this problem, or can guide me on how to solve this.
If the goal is to append the files into a single result, you don't really need any CSV processing at all. Just write the file contents minus the header line (except the first one). glob will return file names with path that match the pattern, "*.csv".
from glob import glob
import os
import shutil
csv_dir = r'D:\XXXX\XXXX\output'
result_csv = r'd:\XXXX\XXXX\combined.csv'
first_hdr = True
# all .csv files in the directory have the same header
with open(result_csv, "w", newline="") as result_file:
for filename in glob(os.path.join(csv_dir, "*.csv")):
with open(filename) as in_file:
header = in_file.readline()
if first_hdr:
result_file.write(header)
first_hdr = False
shutil.copyfileobj(in_file, result_file)
(assuming all csvs have equal number of columns)
Try something like this:
import os
import pandas as pd
csvs = [file for file in os.listdir('D:\XXXX\XXXX\output\') if file.endswith('.csv')]
result_df = pd.concat([pd.read_csv(f'D:\XXXX\XXXX\output\{file}') for file in csvs])
I have multiple zip files in a folder and within the zip files are multiple csv files.
All csv files dont have all the columns but a few have all the columns.
How can I use the file that has all the columns as an example and then loop it to extract all the data into one dataframe and save it into one csv for further use?
The code I am following right now is as below:
import glob
import zipfile
import pandas as pd
dfs = []
for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs += [pd.read_csv(zf.open(f), sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
However, I am not getting the columns and headers at all. I am stuck at this stage.
If you'd like to know the file structure,
Please find the output of the code here and
The example csv file here.
If you would like to see my project files for this code, Please find the shared google drive link here
Also, at the risk of sounding redundant, why am I required to use the sep=";", encoding='latin1' part? The code gives me an error without it otherwise.
In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]
I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)
Here is the code I am working with:
# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)
# Confirm the staging directory path
os.path.isdir(staging_dir)
# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")
# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()
# Program to unzip the files
import zipfile
z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()
#Create the dataframes
import io
import glob
import pandas as pd
files = glob.glob(os.path.join("staging" + "/*.csv"))
# OS independent reading of files
for file in files:
dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')
I believe P.Tillmann's solution should've worked. Alternatively, you can load all your dataframes first and then write them back.
files = glob.glob(os.path.join("staging" + "/*.csv"))
dict_ = {}
for file in files:
dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()
for file in dict_:
dict_[file].to_csv(file, encoding='utf-8')
Just add
dfs.dropna().to_csv(file, encoding='utf-8')
to your last loop. It will drop all rows with null values and then save the dataframe by overwriting the old version.
And remove the first bracket in your last line, you open two but only close one. Thats where the EOF error is coming from.