I have a folder with many zip files and within those zip files are multiple csv files.
Is there any way to get all of the .csv files in one dataframe in python?
Or any way I can pass a list of zip files?
The code I am currently trying is:
import glob
import zipfile
import pandas as pd
for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\data_00-01.zip"):
# This is just one file. There are multiple zip files in the folder
zf = zipfile.ZipFile(zip_file)
dfs = [pd.read_csv(zf.open(f), header=None, sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
This code works for one zipfile but I have about 50 zip files in the folder and I would like to read and concatenate all csv files in those zip files in one dataframe.
Thanks
The following code should satisfy your requirements (just edit dir_name according to what you need):
import glob
import zipfile
import pandas as pd
dfs = []
for filename in os.listdir(dir_name):
if filename.endswith('.zip'):
zip_file = os.path.join(dir_name, filename)
zf = zipfile.ZipFile(zip_file)
dfs += [pd.read_csv(zf.open(f), header=None, sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
Related
The code below combines csv files together and creates a new csv file that is 'utf-8-sig'; however, I want it to be ".CSV" for excel. Any suggestions?
import os
import glob
import pandas as pd
# Change File Path to personal directory folder
os.chdir("C:/Users")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Using Pandas to combine all files in the list
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "Combined_File.csv", index=False, encoding='utf-8-sig')
Replace encoding='utf-8-sig' with encoding='utf-8'
I have 10s of tab delimeted text files in my local directory. When I copy and paste a text file into an excel sheet, it becomes a file having 100s of columns. Now, I would like to read all the text files and convert them to corresponding excel files.
If there was a single file, I would have done the following way:
import pandas as pd
df = pd.read_csv("H:\\Yugeen\\text1.txt", sep='\t')
df.to_excel('H:\\Yugeen\\output1.xlsx', 'Sheet1', index = False)
Is there any way to achive a solution that I am looking for ?
I use this function to list all files in a directory, along with their file path:
import os
def list_files_in_directory(path):
'''docstring for list_files_in_directory'''
x = []
for root, dirs, files in os.walk('.'+path):
for file in files:
x.append(root+'/'+file)
return x
Selecting for only text files:
files = list_files_in_directory('.')
filtered_files = [i for i in files if '.txt' in i]
Like Sophia demonstrated, you can use pandas to create a dataframe. I'm assuming you want to merge these files as well.
import pandas as pd
dfs = []
for file in filtered_files:
df = pd.read_csv(file,sep='\t')
dfs.append(df)
df_master = pd.concat(dfs,axis=1)
filename = 'master_dataframe.csv'
df_master.to_csv(filename,index=False)
The saved file can then be opened in Excel.
Are you talking about how to get the filenames? You can use the glob library.
import glob
import pandas as pd
file_paths = glob.glob('your-directory\\*.txt')
for file in file_path:
df = pd.read_csv(file,sep='\t')
df.to_excel('output-directory\\filename.xlsx',index=False)
Does this answer your question?
I'm almost done with merging excel files with pandas in python but when I give the path it wont work. I get the error ''No such file or directory: 'file1.xlsx'''. When I leave the path empty it work but I want to decide from what folder it should take files from. AND I saved the file the folder 'excel'
cwd = os.path.abspath('/Users/Viktor/downloads/excel') #If i leave it empty and have files in /Viktor it works but I have the desired excel files in /excel
print(cwd)
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file), ignore_index=True)
df.head()
df.to_excel(r'/Users/Viktor/Downloads/excel/resultat/merged.xlsx')
pd.read_excel(file) looks for the file relative to the path where the script is executed. If you execute in '/Users/Viktor/' try with:
import os
import pandas as pd
cwd = os.path.abspath('/Users/Viktor/downloads/excel') #If i leave it empty and have files in /Viktor it works but I have the desired excel files in /excel
#print(cwd)
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('downloads/excel/' + file), ignore_index=True)
df.head()
df.to_excel(r'/Users/Viktor/downloads/excel/resultat/merged.xlsx')
How about actually changing the current working directory with
os.chdir(cwd)
Just printing the path doesn't help.
Use pathlib
Path.glob() to find all the files
Use Path.rglob() if you want to include subdirectories
Use pandas.concat to combine the dataframes created with the pd.read_excel in the list comprehension
from pathlib import Path
import pandas as pd
# path to files
p = Path('/Users/Viktor/downloads/excel')
# find the xlsx files
files = p.glob('*.xlsx')
# create the dataframe
df = pd.concat([pd.read_excel(file, ignore_index=True) for file in files])
# save the file
df.to_excel(r'/Users/Viktor/Downloads/excel/resultat/merged.xlsx')
I have a folder JanuaryDataSentToResourcePro that contain multiple .xlsx files.
I want to iterate through folder and convert all of them into .csv and keep the same file name.
For that I'm trying to implement glob, but getting an error: TypeError: 'module' object is not callable
import glob
excel_files = glob('*xlsx*')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(r'''C:\Users\username\Documents\TestFolder\JanuaryDataSentToResourcePro\ResourceProDailyDataset_01_01_2018.xlsx''', 'ResourceProDailyDataset')
df.to_csv(out)
I am new to python. Does it look right?
UPDATE:
import pandas as pd
import glob
excel_files = glob.glob("*.xlsx")
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel, 'ResourceProDailyDataset')
df.to_csv(out)
But still not converting convert .xlsx to .csv
The glob package should be used like:
import glob
f = glob.glob("*.xlsx")
The glob is not a method but glob.glob is.
========================================
import glob
excel_files = glob.glob('C:/Users/username/Documents/TestFolder/JanuaryDataSentToResourcePro/*.xlsx') # assume the path
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # if only the first sheet is needed.
df.to_csv(out)
I have a folder with lots of .txt files. How can I read all the files in the folder and get the content of them with pandas?. I tried the following:
import pandas as pd
list_=pd.read_csv("/path/of/the/directory/*.txt",header=None)
print list_
Something like this:
import glob
l = [pd.read_csv(filename) for filename in glob.glob("/path/*.txt")]
df = pd.concat(l, axis=0)
You have to take into account the header, for example if you want to ignore it take a look at the skiprows option in read_csv.
I used this in my project for merging the csv files
import pandas as pd
import os
path = "path of the file"
files = [file for file in os.listdir(path) if not file.startswith('.')]
all_data = pd.DataFrame()
for file in files:
current_data = pd.read_csv(path+"/"+file , encoding = "ISO-8859-1")
all_data = pd.concat([all_data,current_data])