Deleting large amount of data from pandas dataframe

Deleting large amount of data from pandas dataframe - python

I have highly unbalanced data (with binary labels, zeros are 96% of data, while ones are just 4%) to balance it I have decided to delete some rows with label zero. However by iterating over the whole dataframe program would take several hours to delete the rows using pandas.dataframe.drop() method. What is the most time efficient way to delete the data?
I have tried sorting the data and then just clearing out bunch of rows with label 0, but unfortunately I must not change the order of data.
I have selected indexes of rows with label 0 and chosen random indexes from that list to delete like so:
drops = random.sample(zero_indexes, X) (where X is amount of rows I want to delete) but I am not sure how to delete rows with such indexes in acceptable time. Any help would be appreciated

Get a list of indices you want to chuck
bad_labels = df[df['label'] == 0].sample(500).index
Then filter df to rows not in there
df1 = df[~df.index.isin(bad_labels)]

Related

How to parse batches of flagged rows and keep the row sastisfying some conditions in a Pandas dataframes?

I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..

Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

Sort the DataFrames columns which are dynamically generated

I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.

You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)

Python: groupby and aggregate > adding to orginal df

I have a data frame, with a categorical variable where the group sizes vary.
Within every group of the categorical variable, I want to assign a random number between 1 and 10. I create as many random numbers between 1 and 10 as entries in a specific group.
To assign a random number I made a simple function called createrandomnum.
Then I used this line of code:
grouped_vales = data.groupby("categories").categories.agg(newnumber = createrandomnum)
Then the output is a data frame, where every row represents a category. The column named 'newnumber' contains lists with numbers between 1 and 10. The length of the list corresponds to the group sizes in the original data frame.
I would like to add these numbers to my original data frame. Which number is allocated to which entry is not that important, as long as the category is the same.
I figured I probably have to sort my original data frame;
data.sort_values("categories")
But then...
Anyone that could help me? Thanks in advance!
P.S. I just started learning Python, so maybe the code I provided here is not the most efficient. Tips are welcome of course :)

I believe you can use GroupBy.transform function for return new column (Series) with same size like original DataFrame:
data['new'] = data.groupby("categories").categories.transform(createrandomnum)

A method to add random number added:
import random
data['new'] = data.groupby('categories')['categories'].transform(lambda group: random.randint(1,10))

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.

You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.

I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting large amount of data from pandas dataframe - python

Get a list of indices you want to chuck bad_labels = df[df['label'] == 0].sample(500).index Then filter df to rows not in there df1 = df[~df.index.isin(bad_labels)]

Related

How to parse batches of flagged rows and keep the row sastisfying some conditions in a Pandas dataframes?

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

Sort the DataFrames columns which are dynamically generated

Python: groupby and aggregate > adding to orginal df

How to create a new python DataFrame with multiple columns of differing row lengths?

Categories

Resources