Python: load excel header without loading remaining data - python

I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.
So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.
Thank you for the help!

You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))
Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:
Time taken to load whole data with pandas read excel is 0:00:49.174538
Time taken with top 5 rows with pandas read excel is 0:00:44.478523
Time taken to load top 5 rows using sxl is 0:00:00.671717
I hope this helps!!

You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)
which means, 99999 rows will be skipped from footer to top & remaining records from header will load.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)
which means, first 5 rows will be shown with header.
Also refer this Stack over flow Question.

from dask import dataframe as dd
df= dd.read_csv(“filename”)
Trust me its fast I am reading 800 mb of file

Related

PySimpleGUI | Add progress bar for Panda's pd.read_excel

I am using pd.read_excel to read large excel files. How do I add in a GUI progress bar to inform users the progress while the system is reading the file?
I came across tqdm and PySimpleGui as the most commonly used progress bar. PySimpleGUI OneLineProgress Meter seems to be the solution to my need, but I'm too novice to add pd.read_excel in. Can anyone teach me how to work this out? Thank you
for i in range(1000): # this is your "work loop" that you want to monitor
sg.OneLineProgressMeter('One Line Meter Example', i + 1, 1000, 'key')
It's not really easy to do with read_excel due to the function lacking any means of tracking progress, unlike read_csv where you can use chunksize parameter to return an interator and then you can update your progress based on how many chunks you've loaded out of the total.
This won't work in case of excel files so your only option is to use a combination of nrows and skiprows. To do this we first need to find out how many rows the file has. There are some neat ways of doing this with little overhead (see answers here) but these won't work for and Excel file so what you do is load one entire column of the sheet and find out the length that way:
# Get the total number of rows
df_temp = pd.read_excel(sheet_name='balbla', usecols=[0])
rows = df_temp.shape[0]
# Now load the file in chunks of 1000 rows at a time
chunks = rows//1000 + 1
chunk_list = []
for i in range(chunks):
tmp = pd.read_excel(sheet_name='blabla', nrows=1000, skiprows=[k for k in range(i*1000)])
chunk_list.append(chunks)
# Update progress
sg.OneLineProgressMeter('Loading excel file', i + 1, chunks)
df = pd.concat((f for f in chunk_list), axis=0)
This should work but will come with significant overhead. Unfortunately, there's no easy way to do this for excel files and you're better off using some other data format.

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

Streamlit loading column data takes too much time

I use simple code snippet in streamlit which shows dataframe consisting of excel which i have. the problem is that it takes too much time to load data inside the filter column of streamlit area. in that filter area i make search on material name, but it takes even 30 seconds to load and show me the data which i am gonna select. How to solve it and make it fastest to select data?
The code is:
import streamlit as st
import pandas as pd
#st.cache
def load_data(nrows):
df=pd.read_excel('materials.xlsx',nrows=nrows)
return df
df=load_data(100000)
species = st.multiselect('SELECT THE MATERIAL', df['Name'])
new_df = df[(df['Name'].isin(species))]
st.write(new_df)
and to show how it is too slow selecting data, look at this: https://streamable.com/kjis2
Ultimately, without knowing which part of your code is running slowly (and how much data you're actually loading), it's hard to give concrete advice. It looks like you're loading 100,000 rows from a spreadsheet. That's potentially a lot of data, especially if those rows are themselves large.
Some things to try:
Add some performance instrumentation around chunks of code, to figure out what is running slowly. This can be as simple as just calling time.time() before and after a chunk of code, and outputting the difference between the two values:
import contextlib
import time
import pandas as pd
import streamlit as st
#contextlib.contextmanager
def profile(name):
start_time = time.time()
yield # <-- your code will execute here
total_time = time.time() - start_time
print("%s: %.4f ms" % (name, total_time * 1000.0))
with profile("load_data"):
df = pd.read_excel('materials.xlsx',nrows=100000)
with profile("create_multiselect"):
species = st.multiselect('SELECT THE MATERIAL', df['Name'])
with profile("filter_on_name"):
new_df = df[(df['Name'].isin(species))]
Change the #st.cache annotation to #st.cache(allow_output_mutation=True). This will let Streamlit avoid hashing the output of the function (which is a gigantic dataframe that may take a while to hash).
Is it possible to load fewer rows at a time?
Alternately, can you avoid loading all the data that's within each of those rows? (Do you need the "Synonyms" column, for example?) pandas.read_excel takes a usecols parameter that lets you limit which columns get parsed.
If you do need all this data loaded in, could you store it in a way that allows for faster filtering? For example, it looks like pandas can interface with SQLite. Could you load your excel data into a SQLite database once, when the app first runs, and then perform queries against that db, rather than against the DataFrame?

Is there a way to filter when reading data with pandas.read_table?

I'm using python 3.5 and pandas 0.19.2.
Using pandas.read_table, is there a way to filter when reading data?
In my example below, I read in my initial data frame and then subset the rows I want based on a condition. Is there a way to do this, or any way to dramatically speed the example below up? I couldn't see anything in the pandas.read_table docs (link), that showed how to speed this up.
Currently it takes around 3 minutes.
import pandas as pd
from datetime import datetime
start_time = datetime.now()
# reading table
df = pd.read_table('https://download.bls.gov/pub/time.series/ce/ce.data.0.AllCESSeries', sep='\t', header=0)
# subsetting
df = df[df['series_id'].str.contains("CEU0000000001")]
end_time = datetime.now()
run_time = end_time-start_time
print(run_time)
Consider using alternative storage format if you want to speed up reading from disk significantly.
I'd consider using HDF5 or Feather formats.
PS HDF Store allows us to index data and to read it per index. So we will read from disk only that data that we need - no need to read up everything from disk to memory and filter data in memory.

Faster way to read Excel files to pandas dataframe

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
def OTT_read(xl,site_name):
df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
usecols=[0,1,2],header=None,
names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
return df
def make_OTT_df(FILEDIR,OTT_FILE):
xl = pd.ExcelFile(FILEDIR + OTT_FILE)
site_names = xl.sheet_names
df_list = [OTT_read(xl,site_name) for site_name in site_names]
return site_names,df_list
FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)
As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))
# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv'
call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
df[sheet]=pd.read_csv(csv)
Here's a little snippet of python to create the ExcelToCsv.vbs script:
#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
""";
f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()
This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue
I used xlsx2csv to virtually convert excel file to csv in memory and this helped cut the read time to about half.
from xlsx2csv import Xlsx2csv
from io import StringIO
import pandas as pd
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
If you have less than 65536 rows (in each sheet) you can try xls (instead of xlsx. In my experience xls is faster than xlsx. It is difficult to compare to csv because it depends on the number of sheets.
Although this is not an ideal solution (xls is a binary old privative format), I have found this is useful if you are working with a lof many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality (instead of csv separated files).
In my experience, Pandas read_excel() works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_name to None it will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.
But the reason that it takes time is for where you parse texts in your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excel it took like below:
t0 = time.time()
def parse(datestr):
y,m,d = datestr.split("/")
return dt.date(int(y),int(m),int(d))
data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)
t1 = time.time()
print(t1 - t0)
## result: 37.54169297218323 seconds
In code above data is a dictionary of 46 Dataframes.
As others suggested, using read_csv() can help because reading .csv file is faster. But consider that for the fact that .xlsx files use compression, .csv files might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv
I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv() is faster but you don't need a VB script to get a csv file.
Open your Excel file and save as *.csv (comma separated value) format.
Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv() to specify your encoding.
Encoding options are listed here
I encourage you to do the comparison yourself and see which approach is appropriate in your situation.
For instance, if you are processing a lot of XLSX files and are only going to ever read each one once, you may not want to worry about the CSV conversion. However, if you are going to read the CSVs over and over again, then I would highly recommend saving each of the worksheets in the workbook to a csv once, then read them repeatedly using pd.read_csv().
Below is a simple script that will let you compare Importing XLSX Directly, Converting XLSX to CSV in memory, and Importing CSV. It is based on Jing Xue's answer.
Spoiler alert: If you are going to read the file(s) multiple times, it's going to be faster to convert the XLSX to CSV.
I did some testing with some files I'm working on are here are my results:
5,874 KB xlsx file (29,415 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:00:31.75
Elapsed time for [Convert XLSX to CSV in mem]: 0:00:22.19
Elapsed time for [Import CSV file]: 0:00:00.21
********************
202,782 KB xlsx file (990,832 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:17:04.31
Elapsed time for [Convert XLSX to CSV in mem]: 0:12:11.74
Elapsed time for [Import CSV file]: 0:00:07.11
YES! the 202MB file really did take only 7 seconds compared to 17 minutes for the XLSX!!!
If you're ready to set up your own test, just open you XLSX in Excel and save one of the worksheets to CSV. For a final solution, you would obviously need to loop through the worksheets to process each one.
You will also need to pip install rich pandas xlsx2csv.
from rich import print
import pandas as pd
from datetime import datetime
from xlsx2csv import Xlsx2csv
from io import StringIO
def timer(name, startTime = None):
if startTime:
print(f"Timer: Elapsed time for [{name}]: {datetime.now() - startTime}")
else:
startTime = datetime.now()
print(f"Timer: Starting [{name}] at {startTime}")
return startTime
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
xlsxFileName = "MyBig.xlsx"
sheetName = "Sheet1"
csvFileName = "MyBig.csv"
startTime = timer(name="Import XLSX with Pandas")
df = pd.read_excel(xlsxFileName, sheet_name=sheetName)
timer("Import XLSX with Pandas", startTime)
startTime = timer(name="Convert XLSX to CSV first")
df = read_excel(path=xlsxFileName, sheet_name=sheetName)
timer("Convert XLSX to CSV first", startTime)
startTime = timer(name="Import CSV")
df = pd.read_csv(csvFileName)
timer("Import CSV", startTime)
There's no reason to open excel if you're willing to deal with slow conversion once.
Read the data into a dataframe with pd.read_excel()
Dump it into a csv right away with pd.to_csv()
Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

Categories

Resources