Python iterating over excel files in a folder - python

I am interested in getting this script to open an excel file, and save it again as a .csv or .txt file. I'm pretty sure the problem with this is the iteration - I haven't coded it correctly to iterate properly over the contents of the folder. I am new to Python, and I managed to get this code to sucessfully print a copy of the contents of the items in the folder by the commented out part. Can someone please advise what needs to be fixed?
My error is: raise XLRDError('Unsupported format, or corrupt file: ' + msg)
from xlrd import open_workbook
import csv
import glob
import os
import openpyxl
cwd= os.getcwd()
print (cwd)
FileList = glob.glob('*.xlsx')
#print(FileList)
for i in FileList:
rb = open_workbook(i)
wb = copy(rb)
wb.save('new_document.csv')

I would just use:
import pandas as pd
import glob
import os
file_list = glob.glob('*.xlsx')
for file in file_list:
filename = os.path.split(file, )[1]
pd.read_excel(file).to_csv(filename.replace('xlsx', 'csv'), index=False)

It appears that your error is related to the excel files, not because of your code.
Check that your files aren't also open in Excel at the same time.
Check that your files aren't encrypted.
Check that your version of xlrd supports the files you are reading
In the above order. Any of the above could have caused your error.

Related

How do I see if the contents of a csv file exists as a file in another directory?

EDIT:
To better explain my dilemma, I have a csv file that lists a number of applications numbered XXXXXX. Each of these applications have a corresponding xml file that exists in another directory. I'm essentially attempting to write a script that.
unzips the folder that contains the xml files and the csv file.
parse the entries within the csv file and sees that that each application listed in the csv file has a corresponding xml file.
Output another CSV file that sets an application to true if the xml file exists.
So far I've written the script to unzip, but I'm having a hard time wrapping my head around step 2 and 3.
from tkinter import Tk
from tkinter.filedialog import askdirectory
import zipfile
import os
import xml.etree.ElementTree as ET
import pandas as pd
from datetime import datetime
def unzipXML(root):
print(f'({datetime.now().strftime("%b. %d - %H:%M:%S")}) Stage 1 of 5: Unzipping folder(s)...')
# Get filepaths of .zip files
zipPaths = []
for filename in os.listdir(root):
if filename.endswith(".zip"):
zipPaths.append(root + "/" + filename)
# Unzip all .zip files
for path in zipPaths:
with zipfile.ZipFile(path, 'r') as zipRef:
zipRef.extractall(root)
print(f'({datetime.now().strftime("%b. %d - %H:%M:%S")}) {len(zipPaths)} folder(s) unzipped successfully.')
Loop through the names in the csv, calling os.path.exists() on each one.
with open("filenames.csv") as inf, open("apps.csv", "w") as outf:
in_csv = csv.reader(inf)
out_csv = csv.writer(outf)
for row in in_csv:
app_name = row[0] # replace [0] with the correct field number for your CSV
if os.path.exists(os.path.join(directory_path, app_name + ".xml")):
out_csv.writerow([app_name, 'exists'])
else:
out_csv.writerow([app_name, 'notexists'])
I don't know if I understand your problem, but maybe this will help:
#Get files from path
List_Of_Files = glob.glob('./' + '\\*.csv')
for file_name in List_Of_Files:
if file_name == your_var:
...

Reading in txt file as pandas dataframe from a folder within a zipped folder

I want to read in a txt file that sits in a folder within a zipped folder as a pandas data frame.
I've looked at how to read in a txt file and how to access a file from within a zipped folder, Load data from txt with pandas and Download Returned Zip file from URL respectively.
The problem is I get a KeyError message with my code.
I think it's because my txt file sits in a folder within a folder?
Thanks for any help!
# MWE
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO
txt_raw = 'hcc-data.txt'
zip_raw = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00423/hcc-survival.zip'
r = requests.get(zip_raw)
files = ZipFile(BytesIO(r.content))
df_raw = pd.read_csv(files.open(txt_raw), sep=",", header=None)
# ERROR
KeyError: "There is no item named 'hcc-data.txt' in the archive"
You need to add full path to the file:
txt_raw = 'hcc-survival/hcc-data.txt'

How to merge multiple CSV files in a folder to a single file on Azure?

I have written this code and its showing no error. But I am not able to see that output file. Any help will be appreciated.
from os import chdir
from glob import glob
import pandas as pd
def produceOneCSV(list_of_files, file_out):
result_obj = pd.concat([pd.read_csv(file,encoding='utf-8') for file in list_of_files])
result_obj.to_csv(file_out, index=False, encoding='utf-8')
root = "FOLDER PATH"
chdir(root)
file_pattern = ".csv"
list_of_files = [file for file in glob(root+'*.csv')]
file_out = "ConsolidateOutput.csv"
produceOneCSV(list_of_files, file_out)
Check your folder path as well as your app path and permissions
I ran this code with no problems. I started with 3 CSVs and ended up with one. However, there were a few configuration issues which caused me to not see the CSV at first. The full code is below with real paths for reference.
Here are the things I had to fix:
As it stands, this code stores in the folder where this code is located. Is that intentional? It often isn't, so go check where your file.py is to see if your CSV is there.
Check that you have the proper permissions to write to that folder. It wasn't a problem here, but it has been an issue for projects in the past.
Check that your root folder is correct and there are actually CSVs there. While the code throws an error when it doesn't find any CSVs on my local, maybe your setup does it differently.
Here is my full working code:
from os import chdir
from glob import glob
import pandas as pd
def produceOneCSV(list_of_files, file_out):
result_obj = pd.concat([pd.read_csv(file, encoding='utf-8') for file in list_of_files])
result_obj.to_csv(file_out, index=False, encoding='utf-8')
root = "C:\\Users\\Matthew\\PycharmProjects\\stackoverflow\\"
chdir(root)
file_pattern = ".csv"
list_of_files = [file for file in glob(root + '*.csv')]
file_out = "ConsolidateOutput.csv"
produceOneCSV(list_of_files, file_out)

How can I open multiple json files in Python with a for loop?

For a data challenge at school we need to open a lot of json files with python. There are too many to open manually. Is there a way to open them with a for loop?
This is the way I open one of the json files and make it a dataframe (it works).
file_2016091718 = '/Users/thijseekelaar/Downloads/airlines_complete/airlines-1474121577751.json'
json_2016091718 = pd.read_json(file_2016091718, lines=True)
Here is a screenshot of how the map where the data is in looks (click here)
Yes, you can use os.listdir to list all the json files in your directory, create the full path for all of them and use the full path using os.path.join to open the json file
import os
import pandas as pd
base_dir = '/Users/thijseekelaar/Downloads/airlines_complete'
#Get all files in the directory
data_list = []
for file in os.listdir(base_dir):
#If file is a json, construct it's full path and open it, append all json data to list
if 'json' in file:
json_path = os.path.join(base_dir, file)
json_data = pd.read_json(json_path, lines=True)
data_list.append(json_data)
print(data_list)
Try this :
import os
# not sure about the order
for root, subdirs, files in os.walk('your/json/dir/'):
for file in files:
with open(file, 'r'):
#your stuff here

Pandas DataFrame Not Saving To File

I'm learning Python and can't seem to get pandas dataframes to save. I'm not getting any errors, the file just doesn't appear in the folder.
I'm using a windows10 machine, python3, jupyter notebook and saving to a local google drive folder.
Any ideas?
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init
df.to_csv('df.to_csv('c:\\Users\\username\\Documents\\myfilename.csv', index=False)', index=False)
The file should be saved in the current working directory.
import os
cwd = os.getcwd()
print(cwd)
In the last line of your code change this:
df.to_csv('C://myfilename.csv', index=False)
Now your file is saved in C drive.
You can change the path as per you wish.
for eg.
df.to_csv('C://Folder//myfilename.csv', index=False)
2.Alternatively ,If you want to locate where your file is stored.
import os
print(os.getcwd())
This gives you the directory where the files are stored.
you can also change your working directory as per your wish.
Just at the beginning of your code
import os
os.chdir("path_to_folder")
In this case then no need of specifying path at the time of saving it to CSV.
You can write a function that saves the file and returns a boolean value as the following:
import os
def save_data(path, file, df):
if (df.to_csv(saving_path + file + '.csv', index = False)):
return True
else:
return False
But you have to provide the right path though.
add this code to the bottom of your file
import os
print(os.getcwd())
That's where your file is
Try writing a simple file with a new script.
F = open(your_path_with_filename, 'w')
F.write("hello")
F.close()

Categories

Resources