I use simple code snippet in streamlit which shows dataframe consisting of excel which i have. the problem is that it takes too much time to load data inside the filter column of streamlit area. in that filter area i make search on material name, but it takes even 30 seconds to load and show me the data which i am gonna select. How to solve it and make it fastest to select data?
The code is:
import streamlit as st
import pandas as pd
#st.cache
def load_data(nrows):
df=pd.read_excel('materials.xlsx',nrows=nrows)
return df
df=load_data(100000)
species = st.multiselect('SELECT THE MATERIAL', df['Name'])
new_df = df[(df['Name'].isin(species))]
st.write(new_df)
and to show how it is too slow selecting data, look at this: https://streamable.com/kjis2
Ultimately, without knowing which part of your code is running slowly (and how much data you're actually loading), it's hard to give concrete advice. It looks like you're loading 100,000 rows from a spreadsheet. That's potentially a lot of data, especially if those rows are themselves large.
Some things to try:
Add some performance instrumentation around chunks of code, to figure out what is running slowly. This can be as simple as just calling time.time() before and after a chunk of code, and outputting the difference between the two values:
import contextlib
import time
import pandas as pd
import streamlit as st
#contextlib.contextmanager
def profile(name):
start_time = time.time()
yield # <-- your code will execute here
total_time = time.time() - start_time
print("%s: %.4f ms" % (name, total_time * 1000.0))
with profile("load_data"):
df = pd.read_excel('materials.xlsx',nrows=100000)
with profile("create_multiselect"):
species = st.multiselect('SELECT THE MATERIAL', df['Name'])
with profile("filter_on_name"):
new_df = df[(df['Name'].isin(species))]
Change the #st.cache annotation to #st.cache(allow_output_mutation=True). This will let Streamlit avoid hashing the output of the function (which is a gigantic dataframe that may take a while to hash).
Is it possible to load fewer rows at a time?
Alternately, can you avoid loading all the data that's within each of those rows? (Do you need the "Synonyms" column, for example?) pandas.read_excel takes a usecols parameter that lets you limit which columns get parsed.
If you do need all this data loaded in, could you store it in a way that allows for faster filtering? For example, it looks like pandas can interface with SQLite. Could you load your excel data into a SQLite database once, when the app first runs, and then perform queries against that db, rather than against the DataFrame?
Related
I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:
df = pd.read_csv('data.csv')
But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:
"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."
But I dont see much gain in speed
If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer
In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:
import pandas as pd
import datatable as dt
# Read with databale
datatable_df = dt.fread('myfile.csv')
# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()
I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.
So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.
Thank you for the help!
You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))
Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:
Time taken to load whole data with pandas read excel is 0:00:49.174538
Time taken with top 5 rows with pandas read excel is 0:00:44.478523
Time taken to load top 5 rows using sxl is 0:00:00.671717
I hope this helps!!
You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)
which means, 99999 rows will be skipped from footer to top & remaining records from header will load.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)
which means, first 5 rows will be shown with header.
Also refer this Stack over flow Question.
from dask import dataframe as dd
df= dd.read_csv(“filename”)
Trust me its fast I am reading 800 mb of file
This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)
import csv
import numpy as np
import pandas as pd
import urllib.request
import time
x = urllib.request.urlopen("https://forex.1forge.com/1.0.3/quotes?pairs=EURUSD,EURJPY,GBPUSD,USDCAD,GBPJPY,USDJPY,AUDUSD,&api_key=KEY")
df = pd.read_csv(x,header=None, sep=',',
infer_datetime_format=True)
starttime=time.time()
while True:
print (df)
time.sleep(60.0 - ((time.time() - starttime) % 60.0))
I wrote this code with the intent of pulling data from the URL and place it within the Pandas Dataframe. Then, minute by minute, updating the information to the Dataframe from the URL with the index of time. Currently i'm able to pull the raw data to the dataFrame but when he information is being called by the timer I made, it repeats whats been called before and not updating. The data that i'm getting is also very convoluted and messy, so I have not been able to even index time to begin with.
If I could be pointed in the direction of where i can learn how to clean the information in the datframe and how to call the data thats updated when put to the dataframe, it would be much appreciated. Thanks for reading!
Looks like the data from the site is in JSON format.
Also your action of pulling the data was outside the while loop, so you only pulled once but printed every minute. Try this:
import pandas as pd
import time
while True:
df = pd.read_json("https://forex.1forge.com/1.0.3/quotes?pairs=EURUSD,EURJPY,GBPUSD,USDCAD,GBPJPY,USDJPY,AUDUSD,&api_key=KEY")
print (df)
time.sleep(60)
Firstly, that looked like a real API key, which you should not share.
The code you shared does not request the URL repeatedly. Only the lines in the while True loop will repeatedly execute. In your code, these are the lines that make a request and establish a DataFrame from the response:
x = urllib.request.urlopen("https://forex.1forge.com/1.0.3/quotes?pairs=EURUSD,EURJPY,GBPUSD,USDCAD,GBPJPY,USDJPY,AUDUSD,&api_key=KEY")
df = pd.read_csv(x,header=None, sep=',',
infer_datetime_format=True)
Edit: as regards your question about how to begin cleaning data, the pandas official cheat sheet is not bad in my opinion.
I'm using python 3.5 and pandas 0.19.2.
Using pandas.read_table, is there a way to filter when reading data?
In my example below, I read in my initial data frame and then subset the rows I want based on a condition. Is there a way to do this, or any way to dramatically speed the example below up? I couldn't see anything in the pandas.read_table docs (link), that showed how to speed this up.
Currently it takes around 3 minutes.
import pandas as pd
from datetime import datetime
start_time = datetime.now()
# reading table
df = pd.read_table('https://download.bls.gov/pub/time.series/ce/ce.data.0.AllCESSeries', sep='\t', header=0)
# subsetting
df = df[df['series_id'].str.contains("CEU0000000001")]
end_time = datetime.now()
run_time = end_time-start_time
print(run_time)
Consider using alternative storage format if you want to speed up reading from disk significantly.
I'd consider using HDF5 or Feather formats.
PS HDF Store allows us to index data and to read it per index. So we will read from disk only that data that we need - no need to read up everything from disk to memory and filter data in memory.