In pandas 1.4.0: append() was deprecated, and the docs say to use concat() instead.
FutureWarning: The frame.append method is deprecated and will be
removed from pandas in a future version. Use pandas.concat instead.
Codeblock in question:
def generate_features(data, num_samples, mask):
"""
The main function for generating features to train or evaluate on.
Returns a pd.DataFrame()
"""
logger.debug("Generating features, number of samples", num_samples)
features = pd.DataFrame()
for count in range(num_samples):
row, col = get_pixel_within_mask(data, mask)
input_vars = get_pixel_data(data, row, col)
features = features.append(input_vars)
print_progress(count, num_samples)
return features
These are the two options I've tried, but did not work:
features = pd.concat([features],[input_vars])
and
pd.concat([features],[input_vars])
This is the line that is deprecated and throwing the error:
features = features.append(input_vars)
You can store the DataFrames generated in the loop in a list and concatenate them with features once you finish the loop.
In other words, replace the loop:
for count in range(num_samples):
# .... code to produce `input_vars`
features = features.append(input_vars) # remove this `DataFrame.append`
with the one below:
tmp = [] # initialize list
for count in range(num_samples):
# .... code to produce `input_vars`
tmp.append(input_vars) # append to the list, (not DF)
features = pd.concat(tmp) # concatenate after loop
You can certainly concatenate in the loop but it's more efficient to do it only once.
This will "append" the blank df and prevent errors in the future by using the concat option
features= pd.concat([features, input_vars])
However, still, without having access to actually data and data structures this would be hard to test replicate.
For example, you have a list of dataframes called collector, e.g. for cryptocurrencies, and you want to harvest first rows from two particular columns from each datafarme in our 'collector'. You do as follows
pd.concat([cap[['Ticker', 'Market Cap']].iloc[:1] for cap in collector] )
Related
I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
To summarize: How to perform groupby operations in parallel for a limited number of groups, but writing the result of each group apply function to disk?
My problem: I'm trying to create a supervised structure for regression models from information of a lot of clients separated into years. From the same clients I have to build different models, with different inputs X and labels Y, thus my idea is to create a single X and Y dataframe holding all variables at once, and slicing each one according to the task. For example, X could hold information from the salary, age or sex, but model 1 would use only age and sex, while model 2 only use salary.
As clients are not present every year, I can only use clients that are present from one period to the next one.
Instead of selecting the intersection of clients for each pair of contigous years, I'm trying to concatenate the whole information and performing groupby operations by client ID (and then filtering by year sequence, for example using the rows where the difference of periods are 1). The problem of using Dask for this task is that distributed workers are running low on memory (even after increasing the limit to 30Gb each). Note that for each group I'm creating a new dataframe, so I'm not reducing calculation to a single number per group, thus the memory intensive operation.
What I'm currently doing is performing a groupby operation, then iterating over the groupby object and writing to disk sequentially: for example like:
x_file=open('X.csv', 'w')
for name, group in concatenated_data.groupby('ID'):
data_x=my_func(group) # In my real code, my_func returns x and y dataframes
data_x.to_csv(x_file, header=None)
x_file.close()
which write the data sequentially applying my_func which selects the x and y for each group.
What I want is to perform the operation for a controlled number of groups (lets say 3 at the time), and writing the result of each group to disk (maybe with data_x.to_csv(x_file, single_file=True)).
Of course I can do the same for a dask dataframe, and iterate over the groupbpy object using get_group(), but I don't believe it will run in parallel while also keeping the memory on check.
EDIT: Example
# Lets say I have 3 csv files:
data=['./data_2016', './data_2017', './data_2018'] # Each file contains millions of rows (1 per client ID) and like 85 columns
# and certains variables
x_vars=['x1', 'x2', 'x3'] # x variables
y_vars= ['y1', 'y2', 'x1'] # note than some variables can be among x and y (like using today's salary to predict tomorrows salary)
data=[pd.read_csv(x) for x in data]
def func1(df_):
# do some preprocessing stuff
return df_
data=map(func1, data) # Some preprocessing and adding some columns (for example adding column for year)
concatenated_data=pd.concat(data, axis=1) # Big file, all clients from 2016-2018
def my_func(df_): # function applied above
# order by year
df_['Diff']=df_.year.diff() # calculating the difference among years
df['shifted']=df.Diff.shift(-1) # calculate shift of difference
# For exammple, *client z* may be on 2016 and 2018, thus his year difference is 2.
# I can't use *clien z* x_vars to predict y (only a single period ahead regression)
x=df_.loc[df_['shifted']==1, x_vars] # select only contigous years
y=df_.loc[df_['Diff']==1, y_vars] # the same, but a year ahead of x
return (x, y)
# ... Iteration over groupby object
Instead of using groupby() to reduce, I'm expanding the single, big file into an x and y dataframes, on which y holds information a period ahead of x.
As you can see, using a dask dataframe groupby (omitted for simplicity) would parallelize my_func operation, but as I understand would also wait until all operations nodes are completed, thus depleting my memory. What I would like is to perform my_func for certain groups (ideally as most as memory could hold), finish them, save to disk (without problems related to paralell saving) and finally proceed to the next batch of groups.
Maybe I can use some dask delayed objects, but I don't think it will make good use of my memory if a set the batches manually.
I'm not sure if this is what you are looking for
Generate data
import pandas as pd
import numpy as np
import dask.dataframe as dd
import os
n = 200
df = pd.DataFrame({"grp":np.random.choice(list("abcd"), n),
"x":np.random.randn(n),
"y":np.random.randn(n),
"z":np.random.randn(n)})
df.to_csv("file.csv", index=False)
# we will need later on
df.to_parquet("file.parquet", index=False)
Pandas solution
# we save our files on a given folder
fldr = "output1"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr))
Dask solution
This is basically the same but we need to add meta to our apply and the compute
# we save our files on a given folder
fldr = "output2"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr), meta='f8')\
.compute()
Working with parquet
Here I suggest you to work with parquet as it's going to be ways more efficient
cols2read = ["grp", "x", "y"]
df = dd.read_parquet("file.parquet",
columns=cols2read)
df.to_parquet("output3/",
partition_on="grp")
Inside output3 you can find several folders called grp=a and so on. And each off them could eventually contain several files. but you can read all of them with pd.read_parquet("output3/grp=a)
Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.
I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.
Anyone have any resources I can check out, or have an example of how to do this?
Thanks!
UPDATE
As requested, here is my current correlation dropping routine as described above:
print("Checking correlations of all columns...")
cols_to_drop_from_high_corr = []
corr_threshold = 0.99
for j in df.iloc[:,1:]: # Skip column 0
try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error
for k in df.iloc[:,1:]: # Start 2nd loop at first column also...
# If comparing the same column to itself, skip it
if (j == k):
continue
else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col
if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")
except:
continue
# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")
except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue
print("Correlation dropping completed.")
UPDATE
Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.
Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:
ValueError: Exactly one of npartitions and chunksize must be specified.
So I specify npartitions=2 in the from_pandas, then get this error:
AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'
I tried changing that to .rechunk('auto'), but then got error:
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?
I'm not sure if this help but maybe it could be a starting point.
Pandas
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = pd.read_csv(url)
# we check correlation for these columns only
cols = df.columns[-8:]
# columns in this df don't have a big
# correlation coefficient
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# we take the upper triangular only
corr = np.triu(corr)
# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)
# for every row we want only the True columns
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
Dask
Here I comment only the steps are different from pandas
import dask.dataframe as dd
import dask.array as da
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = dd.read_csv(url)
cols = df.columns[-8:]
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# with dask we need to rechunk
corr = corr.compute_chunk_sizes()
corr = da.triu(corr)
out = (corr != 1) & (corr > corr_threshold)
# dask is lazy
out = out.compute()
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random
def normalize(data, expectation):
"""Normalize data by duplicating existing rows"""
counts = data[expectation].value_counts()
max_count = int(counts.max())
for tag, group in data.groupby(expectation, sort=False):
array = pandas.DataFrame(columns=data.columns.values)
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
i = max_count % counts[tag]
if i > 0:
array = pandas.concat([array, group.ix[random.sample(group.index, i)]])
data = pandas.concat([data, array])
return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?
There are a couple of things that stand out.
To begin with, the loop
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:
stuff = []
for tag, group in data.groupby(expectation, sort=False):
# Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)
I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.
If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.
Is there a convenient function for this? I've looked around but found nothing useful...
i.e. split_dataframe(df, 2(if > 10))?
I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:
size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
or as generator:
list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))
This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.
df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None
Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.
Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.
But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.
There is no specific convenience function.
You'd have to do something like:
first_ten = pd.DataFrame()
rest = pd.DataFrame()
if df.shape[0] > 10: # len(df) > 10 would also work
first_ten = df[:10]
rest = df[10:]
A method based on np.split:
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]
A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).
(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)
Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:
import pandas as pd
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
#print("%s : %s" % (start, count))
dfs.append(df.iloc[start : count])
return dfs
# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])
# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]
# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]
If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:
max_rows = 4500
dataframes = []
while len(df) > max_rows:
top = df[:max_rows]
dataframes.append(top)
df = df[max_rows:]
else:
dataframes.append(df)
You could then save out these data frames:
for _, frame in enumerate(dataframes):
frame.to_csv(str(_)+'.csv', index=False)
Hope this helps someone!
def split_and_save_df(df, name, size, output_dir):
"""
Split a df and save each chunk in a different csv file.
Parameters:
df : pandas df to be splitted
name : name to give to the output file
size : chunk size
output_dir : directory where to write the divided df
"""
import os
for i in range(0, df.shape[0],size):
start = i
end = min(i+size-1, df.shape[0])
subset = df.loc[start:end]
output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
print(f"Going to write into {output_path}")
subset.to_csv(output_path)
output_size = os.stat(output_path).st_size
print(f"Wrote {output_size} bytes")
You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10
def split(df, headSize) :
hd = df.head(headSize)
tl = df.tail(len(df)-headSize)
return hd, tl
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)
The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.
Example:
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name