Reading html tables in pandas for small size is ok, but the big files in range of 10MB or like 10000 rows/records in html table makes me wait for 10 minutes still no progress, where as same in csv is parsed quickly.
Kindly help speedup html table read in pandas, or getting this converted to csv.
file='testfile.html'
dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
#print(dfdefault)
df = dfdefault[0]
Html dataset is still dataset. In order to read faster large data sets in Pandas, you can choose different strategies, it applies to read_html aswell:
1.Sampling
2.Chunking
3.Optimising Pandas dtypes
Sampling. The most simple option is sampling your dataset.
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename))-1 # Calculate number of rows in file
s = n//10 # sample size of 10%
skip = sorted(random.sample(range(1, n+1), n-s)) # n+1 to compensate for header
df = pandas.read_csv(filename, skiprows=skip)
Chunks / Iteration
If you do need to process all data, you can choose to split the data into a number of chunks (which in itself do fit in memory) and perform your data cleaning and feature engineering on each individual chunk
import pandas
from sklearn.linear_model import LogisticRegression
datafile = "data.csv"
chunksize = 100000
models = []
for chunk in pd.read_csv(datafile, chunksize=chunksize):
chunk = pre_process_and_feature_engineer(chunk)
# A function to clean my data and create my features
model = LogisticRegression()
model.fit(chunk[features], chunk['label'])
models.append(model)
df = pd.read_csv("data_to_score.csv")
df = pre_process_and_feature_engineer(df)
predictions = mean([model.predict(df[features]) for model in models], axis=0)
Optimise data types
When loading data from file, Pandas automatically infers the datatypes. Very convenient of course, however, often these datatypes are not optimal and take up more memory than needed. We will go over the three most common datatypes used by Pandas — int, float and object — and show how to decrease their memory imprint while looking at an example.
Another way to drastically reduce the size of your Pandas Dataframe is to transform columns of dtype Object to category.
Related
I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?
numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5
I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
I have a data frame with 13000 rows and 3 columns:
('time', 'rowScore', 'label')
I want to read subset by subset:
[[1..360], [360..712], ..., [12640..13000]]
I used list too but it's not working:
import pandas as pd
import math
import datetime
result="data.csv"
dataSet = pd.read_csv(result)
TP=0
count=0
x=0
df = pd.DataFrame(dataSet, columns =
['rawScore','label'])
for i,row in df.iterrows():
data= row.to_dict()
ScoreX= data['rawScore']
labelX=data['label']
for i in range (1,13000,360):
x=x+1
for j in range (i,360*x,1):
if ((ScoreX > 0.3) and (labelX ==0)):
count=count+1
print("count=",count)
You can also use the parameters nrows or skiprows to break it up into chunks. I would recommend against using iterrows since that is typically very slow. If you do this when reading in the values, and saving these chunks separately, then it would skip the iterrows section. This is for the file reading if you want to split up into chunks (which seems to be an intermediate step in what you're trying to do).
Another way is to subset using generators by seeing if the values belong to each set:
[[1..360], [360..712], ..., [12640..13000]]
So write a function that takes the chunks with indices divisible by 360 and if the indices are in that range, then choose that particular subset.
I just wrote these approaches down as alternative ideas you might want to play around with, since in some cases you may only want a subset and not all of the chunks for calculation purposes.
There is a data file, consisting of 100K rows, where each row stores a data point. I would like to randomly select 10K rows and save them into a validation file; and use the remaining rows to be saved into a training file. In stead of writing a code to do this, are there any existing function in scikit-learn, pandas or in generic Python to efficiently separate a data file into two ones?
There can be only one reason not to use sklearn train-test-split method because you probably don’t want to take the labels out from the features. You just simply want to split the DataFrame in two sections without splitting the features and labels.
If you don’t want to use train-test-split from sklearn, you can do it in pandas too
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
I have a large SPSS-file (containing a little over 1 million records, with a little under 150 columns) that I want to convert to a Pandas DataFrame.
It takes a few minutes to convert the file to a list, than another couple of minutes to convert it to a dataframe, than another few minutes to set the columnheaders.
Are there any optimizations possible, that I'm missing?
import pandas as pd
import numpy as np
import savReaderWriter as spss
raw_data = spss.SavReader('largefile.sav', returnHeader = True) # This is fast
raw_data_list = list(raw_data) # this is slow
data = pd.DataFrame(raw_data_list) # this is slow
data = data.rename(columns=data.loc[0]).iloc[1:] # setting columnheaders, this is slow too.
You can use rawMode=True to speed up things a bit, as in:
raw_data = spss.SavReader('largefile.sav', returnHeader=True, rawMode=True)
This way, datetime variables (if any) won't be converted to ISO-strings, and SPSS $sysmis values won't be converted to None, and a few other things.