I have a number of parquet files, where all of the chunks together are too big to fit into memory. I would like to load them into a dask dataframe, compute some results (cumsum) and then display the cumsum as a plot. For this reason I wanted to select equally spaced subset of data (some k rows) from the cumsum row, and then plot this subset. How would I do that?
You could try:
slices = 10 # or whatever
slice_point = int(df.shape[0]/slices)
for i in range(slices):
current_sliced_df = df.loc[i*slice_point:(i+1)*slice_point]
and do whatever you want with the current slice
I think that using df[serie].sample(...)(doc) would allow you to avoid to code the way of selecting a representative subset of rows.
Related
I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.
I have a rather large DataFrame (~30k rows, ~30k columns), for which I am trying to iteratively create two subsets based on every columns values, and store the ratio arrays for each column:
for col in df.columns:
high_subset = df.query(col>cutoff_vals['high'][col]).mean(axis=0)
low_subset = df.query(col<cutoff_vals['low'][col]).mean(axis=0)
ratios = high_subset / low_subset
///
store_ratios_for_col
I have the low_cutoff and high_cuttoff values precomputed and stored in a dictionary cutoff_vals. I would like to be able to store the ratio array for every column, which should result in a NxN array of ratios (N == number of columns).
Is there a more efficient method to iterate through the columns, subset them, and perform math/comparisons on the results Series?
I understand that using something like Dask or Ray-project may help, but thought there may be a clever vectorization or built in pandas trick first.
Use gt to compare all the columns, then .where to mask:
cutoffs = pd.DataFrame(cutoff_vals)
highs = df.where(df.gt(cutoffs['high'])).mean()
lows = df.where(df.lt(cutoffs['low'])).mean()
# ratios for all columns
# get any with ratios[col_name]
ratios = highs / lows
I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
Is it really necessary to have all these scatter plots?
Can I reduce the dimensional in advance?
Why do I have such a high number of dimensions?
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.