Odd dropping of pandas rows based on conditions - python

I use the function:
def df_proc(df, n):
print (list(df.lab).count(0)) # control label to see if it changes after conditional dropping
print ('C:', list(df.lab).count(1))
df = df.drop(df[df.lab.eq(0)].sample(n).index)
print (list(df.lab).count(0))
print ('C:', list(df.lab).count(1))
return df
To drop pandas rows based on certain conditions (where df.lab == 0). This works fine on a small df (e.g. n = 100) however when I increase the number of rows in the df something odd happens ... the counts of other labels (!= 0) also begin to decrease and are affected by the condition..
For example:
# dummy example:
import random
list2 = [random.randrange(0, 6, 1) for i in range(1500000)]
list1 = [random.randrange(0, 100, 1) for i in range(1500000)]
dft = pd.DataFrame(list(zip(list1, list2)), columns = ['A', 'lab'])
dftest = df_proc(dft,100000)
gives...
249797
C: 249585
149797
C: 249585
But when I run this on my actual df:
dftest = df_proc(S1,100000)
I get a change in my control labels which is weird.
467110
C: 70434
260616
C: 49395
I'm not sure where the error could have come from. I have tried using frac and df.query('lab == 0') but still run into the same error. The other thing I noticed is that with small n the control labels are unchanged, its only when I increase n.
dftest = df_proc(S1,1)
gives:
467110
C: 70434
467107
C: 70434
Which doesnt add up as 3 samples have been removed not 1.

If it's only about filtering, why not use:
dft = dft[dft['lab'] != 0]
This will filter out all rows with lab=0.

The error was that when drop is used it eliminates based on index however my df was a concatenation of serveral dataframes hence I had to use reset_index to overcome the problem.

Related

How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?

I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!

Python - Sampling rows from a data frame without replacement

I want to sample rows from a pandas data frame without replacement. What I mean is this. In each iteration of the for loop, I sample a certain number of rows from COMBINED without replacement. I want to ensure that over 50,000 iterations, I do not ever sample the same row again. My code below tries to solve this sampling problem, but I get errors.
COMBINED,TEMP, MERGED, SAMPLE, SAMPLE_2 AND PROBABILITY_GENERATED_POISSON are data frames. lst is a list.
Please see my code below:
#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
#IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
lst.append(0)
#IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS
else:
SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE['Sample'] = i
#CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
#except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
try:
TEMP = pd.DataFrame(lst)
#PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
MERGED = TEMP.merge(SAMPLE, how = "inner")
#AVOIDING DUPLICATION WITHIN LIST
#IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
if MERGED.shape[0] == 0:
lst.append(SAMPLE)
else:
#IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM
#THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
#WERE SAMPLED IN PREVIOUS ITERATIONS.
COMBINED_2 = COMBINED.subtract(SAMPLE)
SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE_2['Sample'] = i
lst.append(SAMPLE_2)
except:
continue
print(i)
The error I get is attached with the picture.
I would like some feedback on my question.
Thank you.
Here are are two ways to solve:
solution using pandas .sample function
n = 50000
COMBINED.sample(n, replace=False)
solution using a simple algorithm that does the same thing as .sample()
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
I fixed the error. PROBABILITY_GENERATED_POISSON needs to be a list.

Use Dask to Drop Highly Correlated Pairwise Features in Dataframe?

Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.
I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.
Anyone have any resources I can check out, or have an example of how to do this?
Thanks!
UPDATE
As requested, here is my current correlation dropping routine as described above:
print("Checking correlations of all columns...")
cols_to_drop_from_high_corr = []
corr_threshold = 0.99
for j in df.iloc[:,1:]: # Skip column 0
try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error
for k in df.iloc[:,1:]: # Start 2nd loop at first column also...
# If comparing the same column to itself, skip it
if (j == k):
continue
else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col
if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")
except:
continue
# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")
except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue
print("Correlation dropping completed.")
UPDATE
Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.
Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:
ValueError: Exactly one of npartitions and chunksize must be specified.
So I specify npartitions=2 in the from_pandas, then get this error:
AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'
I tried changing that to .rechunk('auto'), but then got error:
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?
I'm not sure if this help but maybe it could be a starting point.
Pandas
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = pd.read_csv(url)
# we check correlation for these columns only
cols = df.columns[-8:]
# columns in this df don't have a big
# correlation coefficient
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# we take the upper triangular only
corr = np.triu(corr)
# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)
# for every row we want only the True columns
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
Dask
Here I comment only the steps are different from pandas
import dask.dataframe as dd
import dask.array as da
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = dd.read_csv(url)
cols = df.columns[-8:]
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# with dask we need to rechunk
corr = corr.compute_chunk_sizes()
corr = da.triu(corr)
out = (corr != 1) & (corr > corr_threshold)
# dask is lazy
out = out.compute()
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)

How to speed up this task in Python

I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index.
I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:
for i in range(1000):
if df.iloc[i,1] == 1:
df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
if df.iloc[i,1] == 2:
df.iloc[i,3] = abs(df.iloc[i,3])
The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !
Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.
The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:
df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)
The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.
np.where
a = np.where(df.iloc[:, 1].to_numpy() == 1, -1, 1)
b = np.abs(df.iloc[:, 3].to_numpy())
df.iloc[:, 3] = a * b
Vectorize it:
df.iloc[:, 3] = df.iloc[:, 3].abs() * (2 * (df.iloc[:, 1] != 1) - 1)
Explanation:
Treated as int, boolean series df.iloc[:, 1] != 1 gets converted to ones and zeroes. Multiplied by 2, it gets twos and zeroes. After subtracting one, it gets -1 where the first column is 1, and 1 otherwise. Finally, it is multiplied by the absolute value of the third column, which enforces the sign.
Vectorization typically provides an order of magnitude or two speedup comparing to for loops.
Use
df.iloc[:,3] = df.iloc[:,3].abs().mul( df.iloc[:,-1].map({2:1,1:-1}) )
Another way to do this:
import pandas as pd
Take an example data set:
df = pd.DataFrame({'x1':[1,2,1,2], 'x2':[4,8,1,2]})
Make new column, code values as -1 and +1:
df['nx1'] = df['x1'].replace({1:-1, 2:1})
Multiply columnwise:
df['nx1'] * df['x2']

Split pandas dataframe in two if it has more than 10 rows

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.
If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.
Is there a convenient function for this? I've looked around but found nothing useful...
i.e. split_dataframe(df, 2(if > 10))?
I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:
size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
or as generator:
list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))
This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.
df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None
Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.
Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.
But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.
There is no specific convenience function.
You'd have to do something like:
first_ten = pd.DataFrame()
rest = pd.DataFrame()
if df.shape[0] > 10: # len(df) > 10 would also work
first_ten = df[:10]
rest = df[10:]
A method based on np.split:
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]
A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).
(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)
Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:
import pandas as pd
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
#print("%s : %s" % (start, count))
dfs.append(df.iloc[start : count])
return dfs
# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])
# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]
# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]
If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:
max_rows = 4500
dataframes = []
while len(df) > max_rows:
top = df[:max_rows]
dataframes.append(top)
df = df[max_rows:]
else:
dataframes.append(df)
You could then save out these data frames:
for _, frame in enumerate(dataframes):
frame.to_csv(str(_)+'.csv', index=False)
Hope this helps someone!
def split_and_save_df(df, name, size, output_dir):
"""
Split a df and save each chunk in a different csv file.
Parameters:
df : pandas df to be splitted
name : name to give to the output file
size : chunk size
output_dir : directory where to write the divided df
"""
import os
for i in range(0, df.shape[0],size):
start = i
end = min(i+size-1, df.shape[0])
subset = df.loc[start:end]
output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
print(f"Going to write into {output_path}")
subset.to_csv(output_path)
output_size = os.stat(output_path).st_size
print(f"Wrote {output_size} bytes")
You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10
def split(df, headSize) :
hd = df.head(headSize)
tl = df.tail(len(df)-headSize)
return hd, tl
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)
The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.
Example:
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name

Categories

Resources