I want to get access to a zipped excel sheet online using python without downloading it to my PC. The link is as follow https://www.richmondfed.org/-/media/richmondfedorg/research/regional_economy/surveys_of_business_conditions/manufacturing/zipfile/mfg_historicaldata.zip,
which points to a zipped excel. Does anyone know how to use python to deal with it? For example, I want to print the first row of the excel without unzipping and saving the file directly in my PC.
Downloading and unzipping a .zip file without writing to disk
I have found a similar question below, however, I cannot use this code to read the excel file.
You can use pandas to read the excel file.
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
resp = urlopen("https://www.richmondfed.org/-/media/richmondfedorg/research/regional_economy/surveys_of_business_conditions/manufacturing/zipfile/mfg_historicaldata.zip")
zipfile = ZipFile(BytesIO(resp.read()))
extracted_file = zipfile.open(zipfile.namelist()[0])
print(pd.read_excel(extracted_file))
Related
I am trying to import data into PowerBi using a Python script so that I can schedule it to refresh data at regular basis.
I am facing a challenge getting the data from an excel file and receiving the error 'KeyError: "There is no item named 'xl/sharedStrings.xml' in the archive"
' while importing.
When I look into the archive of the xlsx file in the xl folder there is no file sharedString.xml. As there are no strings in the excel. the file opens properly in an excel without any issues but not with python.
import openpyxl
import pandas
import xlrd
import os
globaltrackerdf = pandas.read_excel (r'C:\Users\Documents\Trackers\Tracker-Global Tracker_V2-2022-06-13.xlsx',sheet_name="Sheet1",engine="openpyxl")
Solution that worked for me: Resave your file using your excel. My file also opened fine in Excel but upon zipping the file and looking inside there was no sharedStrings.xml. There seems to be a bug where saving a xlsx might not produce the sharedStrings.xml file. I found various ideas about why it might happen but since I don't have access to the client's Excel not sure what caused it.
For extra context on what an XLSX file is, I found this to be helpful: https://www.adimian.com/blog/fast-xlsx-parsing-with-python/
In the documentation for pd.ExcelWriter we see the following code snippet:
You can store Excel file in RAM:
import io
df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
buffer = io.BytesIO()
with pd.ExcelWriter(buffer) as writer:
df.to_excel(writer)
My question is that how can we access the excel back. I wanted to have the b64 coded version of the same excel without saving the file in the system that is why I am thinking of saving it in my RAM. Can someone please help on this?
Thanks for your time
Solution: Was able to access the file using buffer.getvalue().
In the snippet you provided, the Excel file has been written to the buffer the same way as if it would have been stored on disk.
Therefore you can read it back in a similar way as if you were reading from file:
pd.read_excel(buffer.getvalue())
More on how BytesIO behave:
Create an excel file from BytesIO using python
Difference between `open` and `io.BytesIO` in binary streams
This question already has answers here:
Pandas cannot open an Excel (.xlsx) file
(5 answers)
Closed 2 years ago.
I have a repetetive task, where I download multiple excel files (I'm forced to download in xlsx format), I then take column G from each excel file and concatenate them into "final.xlsx" Then "final.xlsx" is compared to another excel workbook to see if all number instances are matched in each workbook.
I'm now working on making a cross platform python app to solve this. However, pandas won't allow xlsx files anymore, and manually opening and saving them as xls files just adds more repetitive manual labour.
Is there a cross-platform way for python to convert xlsx files to xls?
Or should I abandon pandas and go with openpyxl since I'm forced to handle xlsx format?
I tried using this without success ;
from pathlib import Path
import openpyxl
import os
# get files
os.chdir(os.path.abspath(os.path.dirname(__file__)))
pdir = Path('.')
filelist = [filename for filename in pdir.iterdir() if filename.suffix == '.xlsx']
for filename in filelist:
print(filename.name)
for infile in filelist:
workbook = openpyxl.load_workbook(infile)
outfile = f"{infile.name.split('.')[0]}.xls"
workbook.save(outfile)
You can still use pandas, but you would need openpyxl. As you have it in your code, I suppose it is ok for you.
Otherwise, you can install it via: pip install openpyxl.
The following illustrates how this can work. Kr.
import pandas as pd
fpath = r".\test.xlsx"
df = pd.read_excel (fpath, engine='openpyxl')
print(df)
A B
0 1 2
1 1 2
Previously, the default argument engine=None to read_excel() would result in using the xlrd engine in many cases, including new Excel 2007+ (.xlsx) files. If openpyxl is installed, many of these cases will now default to using the openpyxl engine. See the read_excel() documentation for more details.
Thus, it is strongly encouraged to install openpyxl to read Excel 2007+ (.xlsx) files. Please do not report issues when using xlrd to read .xlsx files. This is no longer supported, switch to using openpyxl instead.
https://pandas.pydata.org/docs/whatsnew/v1.2.0.html
As the question states, openpyxl reads files like I need it to but I don't know how to download the file from a sharepoint site and read it using openpyxl.
The url is something like this http://teamsites.teamworks.net/sites/efit-eitecs-005/SiteAssets/Lists/Apr19/AllItems/Gluster-2019.15-OS.xlsm
I'm currently using the following code.
import requests
import urllib
resp = requests.get(a, auth=auth).content
output = open(r'C:\Users\Me\temp.xlsx', 'wb')
output.write(resp)
output.close()
Anyone know the answer? Should I be saving as an xlsm file instead? I don't know what to do.
I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.