I have a DataFrame containing many NaN values. I want to delete rows that contain too many NaN values; specifically: 7 or more.
I tried using the dropna function several ways but it seems clear that it greedily deletes columns or rows that contain any NaN values.
This question (Slice Pandas DataFrame by Row), shows me that if I can just compile a list of the rows that have too many NaN values, I can delete them all with a simple
df.drop(rows)
I know I can count non-null values using the count function which I could them subtract from the total and get the NaN count that way (Is there a direct way to count NaN values in a row?). But even so, I am not sure how to write a loop that goes through a DataFrame row-by-row.
Here's some pseudo-code that I think is on the right track:
### LOOP FOR ADDRESSING EACH row:
m = total - row.count()
if (m > 7):
df.drop(row)
I am still new to Pandas so I'm very open to other ways of solving this problem; whether they're simpler or more complex.
Basically the way to do this is determine the number of cols, set the minimum number of non-nan values and drop the rows that don't meet this criteria:
df.dropna(thresh=(len(df) - 7))
See the docs
The optional thresh argument of df.dropna lets you give it the minimum number of non-NA values in order to keep the row.
df.dropna(thresh=df.shape[1]-7)
Related
I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)
Background: I am having a list of several hundred departments that I would like to allocate budget as follow:
Each DEPT has an AMT_TOTAL budget within given number of months. They also have a monthly limit LIMIT_MONTH that they cannot exceed.
As each DEPT plans to spend their budget as fast as possible, we assume they will spend up to their monthly limit until AMT_TOTAL runs out. The amount be forecast they will spend, given this assumption, is in AMT_ALLOC_MONTH
My objective is to calculate the AMT_ALLOC_MONTH column, given the LIMIT_MONTH and AMT_TOTAL column. Based on what I've read and searched, I believe a combination of fillna and cumsum() can do the job. So far, the Python dataframe I've managed to generate is as followed:
I planned to fill the NaN using the following line:
table['AMT_ALLOC_MONTH'] = min((table['AMT_TOTAL'] - table.groupby('DEPT')['AMT_ALLOC_MONTH'].cumsum()).ffill, table['LIMIT_MONTH'])
My objective is to have the AMT_TOTAL minus the cumulative sum of AMT_ALLOC_MONTH (excluding the NaN values), grouped by DEPT; the result is then compared with value in column LIMIT_MONTH, and the smaller value is filled in the NaN cells. The process is repeated till all NaN cells of each DEPT is filled.
Needless to say, the result did not come up as I expected; the code line only works with the 1st NaN after the cell with value; subsequent NaN cells just copy the value above it. If there is a way to fix the issue, or a new & more intuitive way to do this, please help. Truly appreciated!
Try this:
for department in table['DEPT'].unique():
subset = table[table['DEPT'] == department]
for index, row in subset.iterrows():
subset = table[table['DEPT'] == department]
cumsum = subset.loc[:index-1, 'AMT_ALLOC_MONTH'].sum()
limit = row['LIMIT_MONTH']
remaining = row['AMT_TOTAL'] - cumsum
table.at[index, 'AMT_ALLOC_MONTH'] = min(remaining, limit)
It't not very elegant I guess, but it seems to work..
I want to remove records which have 2 or more NaNs, but so far all the code I have found online is either for removing one NaN or not relevant to this situation (e.g. Thresh, any and all).
The code I use for at least 1 NaN is df_exercise.isnull().any(axis=1).
I’m not sure how to adapt this specifically to 2 or more NaNs.
You count the number of empty fields per row and only keep those with fewer than 2 empty fields.
keep = df_exercise.isnull().sum(axis=1).lt(2)
df_exercise[keep]
The key to success is to count NaN cases in each row and check whether
it is less than your threshold.
You can get it running:
df_exercise.isnull().sum(axis=1) < 2
And to drop rows exceeding this threshold (keep rows within the threshold),
run:
df_exercise = df_exercise[df_exercise.isnull().sum(axis=1) < 2]
I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.
I have a dataframe like this:
df.head()
day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A
and want to find out how many occurrences of certain resource_records exist.
My first try was using the Series returned by value_counts(), which seems great, but does not allow me to exclude some labels afterwards, because there is no drop() implemented in dask.Series.
So I tried just to not print the undesired labels:
for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
continue
print('\t{0}\t{1}'.format(row[1], row[0]))
Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found mask(), but this feels a bit clumsy as well:
records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()
I looked for a method which would allow me to just count individual values, but count() does count all values that are not NaN.
Then I found str.contains(), but I don't know how to handle the undocumented Scalar type I get returned with this code:
print(df.resource_record.str.contains('A').sum())
Output:
dd.Scalar<series-..., dtype=int64>
But even after looking at Scalar's code in dask/dataframe/core.py I didn't find a way of getting its value.
How would you efficiently count the occurrences of a certain set of values in your dataframe?
In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.
I believe the simplest solution to your question is this:
df[df.resource_record!='AAAA'].resource_record.value_counts().compute()
Where the expression in the selector square brackets could be some mapping or function.
One quite nice method I found is this:
counts = df.resource_record.mask(df.resource_record.isin(['AAAA'])).dropna().value_counts()
First we mask all entries we'd like to get removed, which replaces the value with NaN. Then we drop all rows with NaN and last count the occurrences of unique values.
This requires df to have no NaN values, which otherwise leads to the row containing NaN being removed as well.
I expect something like
df.resource_record.drop(df.resource_record.isin(['AAAA']))
would be faster, because I believe drop would run through the dataset once, while mask + dropna runs through the dataset twice. But drop is only implemented for axis=1, and here we need axis=0.