How to select n equally spaced rows from Dask dataframe? - python

I have a number of parquet files, where all of the chunks together are too big to fit into memory. I would like to load them into a dask dataframe, compute some results (cumsum) and then display the cumsum as a plot. For this reason I wanted to select equally spaced subset of data (some k rows) from the cumsum row, and then plot this subset. How would I do that?

You could try:
slices = 10 # or whatever
slice_point = int(df.shape[0]/slices)
for i in range(slices):
current_sliced_df = df.loc[i*slice_point:(i+1)*slice_point]
and do whatever you want with the current slice

I think that using df[serie].sample(...)(doc) would allow you to avoid to code the way of selecting a representative subset of rows.

Related

How to select a range of columns when using the replace function on a large dataframe?

I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.

Iteratively subset and calculate means of pandas DataFrame

I have a rather large DataFrame (~30k rows, ~30k columns), for which I am trying to iteratively create two subsets based on every columns values, and store the ratio arrays for each column:
for col in df.columns:
high_subset = df.query(col>cutoff_vals['high'][col]).mean(axis=0)
low_subset = df.query(col<cutoff_vals['low'][col]).mean(axis=0)
ratios = high_subset / low_subset
///
store_ratios_for_col
I have the low_cutoff and high_cuttoff values precomputed and stored in a dictionary cutoff_vals. I would like to be able to store the ratio array for every column, which should result in a NxN array of ratios (N == number of columns).
Is there a more efficient method to iterate through the columns, subset them, and perform math/comparisons on the results Series?
I understand that using something like Dask or Ray-project may help, but thought there may be a clever vectorization or built in pandas trick first.
Use gt to compare all the columns, then .where to mask:
cutoffs = pd.DataFrame(cutoff_vals)
highs = df.where(df.gt(cutoffs['high'])).mean()
lows = df.where(df.lt(cutoffs['low'])).mean()
# ratios for all columns
# get any with ratios[col_name]
ratios = highs / lows

Python scatter matrices from dataframe with too many columns

I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
Is it really necessary to have all these scatter plots?
Can I reduce the dimensional in advance?
Why do I have such a high number of dimensions?
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

Computation between two large columns in a Pandas Dataframe

I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.

Categories

Resources