Estimate pandas dataframe size without loading into memory

Estimate pandas dataframe size without loading into memory - python

Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.

You can calculate for one row, and estimate based on it:
data = {'name': ['Bill'],
'year': [2012],
'num_sales': [4]}
df = pd.DataFrame(data, index = ['sales'])
df.memory_usage(index=True).sum() #-> 32

I believe you're looking for df.memory_usage, which would tell you how much each column will occupy.
Altogether it would go something like:
df.memory_usage().sum()
Output:
123123000
You can do more specifics things like including Index (Index = True) or using the Deep feature which will "introspect the data deeply". Feel free to check the documentation!
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html

Related

Increasing efficiency for zero shot classification

I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.
def labeler(input_df,output_df):
labels = ['Fruit','Vegetable','Meat','Other']
for i in tqdm(range(len(input_df))):
temp = classifier(input_df['description'][i],labels)
output ={'work_order_num':input_df['order_num'][i],
'work_order_desc':input_df['description'][i],
'label':temp['labels'][0],
'score':temp['scores'][0]}
output_df.append(output)
In terms of speed and resources would it be better to shape this function with lambda?

Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).
I suggest doing something like this:
output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description'] # these columns can be copied as whole.
def classification(df_desc):
temp = classifier(df_desc, labels)
return temp['labels'][0], temp['scores'][0]
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))
classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.
Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan

Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get

Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

Speeding up derived feature calculation in Pandas dataframe

I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)

It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!

You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up

A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Create new dataframe with condtion per groupby in pandas

I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!

Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Estimate pandas dataframe size without loading into memory - python

Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.

You can calculate for one row, and estimate based on it: data = {'name': ['Bill'], 'year': [2012], 'num_sales': [4]} df = pd.DataFrame(data, index = ['sales']) df.memory_usage(index=True).sum() #-> 32

Related

Increasing efficiency for zero shot classification

Pandas dataframe sum of row won't let me use result in equation

Speeding up derived feature calculation in Pandas dataframe

Quickly search through a numpy array and sum the corresponding values

Create new dataframe with condtion per groupby in pandas

Categories

Resources