Count occurrences of certain values in dask.dataframe

Count occurrences of certain values in dask.dataframe - python

I have a dataframe like this:
df.head()
day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A
and want to find out how many occurrences of certain resource_records exist.
My first try was using the Series returned by value_counts(), which seems great, but does not allow me to exclude some labels afterwards, because there is no drop() implemented in dask.Series.
So I tried just to not print the undesired labels:
for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
continue
print('\t{0}\t{1}'.format(row[1], row[0]))
Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found mask(), but this feels a bit clumsy as well:
records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()
I looked for a method which would allow me to just count individual values, but count() does count all values that are not NaN.
Then I found str.contains(), but I don't know how to handle the undocumented Scalar type I get returned with this code:
print(df.resource_record.str.contains('A').sum())
Output:
dd.Scalar<series-..., dtype=int64>
But even after looking at Scalar's code in dask/dataframe/core.py I didn't find a way of getting its value.
How would you efficiently count the occurrences of a certain set of values in your dataframe?

In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.
I believe the simplest solution to your question is this:
df[df.resource_record!='AAAA'].resource_record.value_counts().compute()
Where the expression in the selector square brackets could be some mapping or function.

One quite nice method I found is this:
counts = df.resource_record.mask(df.resource_record.isin(['AAAA'])).dropna().value_counts()
First we mask all entries we'd like to get removed, which replaces the value with NaN. Then we drop all rows with NaN and last count the occurrences of unique values.
This requires df to have no NaN values, which otherwise leads to the row containing NaN being removed as well.
I expect something like
df.resource_record.drop(df.resource_record.isin(['AAAA']))
would be faster, because I believe drop would run through the dataset once, while mask + dropna runs through the dataset twice. But drop is only implemented for axis=1, and here we need axis=0.

Related

Python code to remove records with two or more empty fields

I want to remove records which have 2 or more NaNs, but so far all the code I have found online is either for removing one NaN or not relevant to this situation (e.g. Thresh, any and all).
The code I use for at least 1 NaN is df_exercise.isnull().any(axis=1).
I’m not sure how to adapt this specifically to 2 or more NaNs.

You count the number of empty fields per row and only keep those with fewer than 2 empty fields.
keep = df_exercise.isnull().sum(axis=1).lt(2)
df_exercise[keep]

The key to success is to count NaN cases in each row and check whether
it is less than your threshold.
You can get it running:
df_exercise.isnull().sum(axis=1) < 2
And to drop rows exceeding this threshold (keep rows within the threshold),
run:
df_exercise = df_exercise[df_exercise.isnull().sum(axis=1) < 2]

Modifying the date column calculation in pandas dataframe

I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.

Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)

Any existing methods to find a drop in a noisy time series?

I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.

You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64

Applying a function to a column in a pandas dataframe

So I have a function replaceMonth(string), which is just a series of if statements that returns a string derived from a column in a pandas dataframe. Then I need to replace the original string with the derived one.
The dataframe is defined like this:
Index ID Year DSFS DrugCount
0 111111 Y1 3- 4 months 1
There are around 80K rows in the dataframe. What I need to do is to replace what is in column DSFS with the result from the replaceMonth(string) function.
So if, for example, the value in the first row of DSFS was '3-4 months', if I ran that string through replaceMonth() it would give me '_3_4' as the return value. Then I need to change the value in the dataframe from the '3- 4 months' to '_3_4'.
I've been trying to use apply on the dataframe but I'm either getting the syntax wrong or not understanding what it's doing correctly, like this:
dataframe['DSFS'].apply(replaceMonth(dataframe['DSFS']))
That doesn't ring right to me but I'm not sure where I'm messing up on it. I'm fairly new to Python so it's probably the syntax. :)
Any help is greatly appreciated!

When you apply you pass the function that you want applied to each element.
Try
dataframe['DSFS'].apply(replaceMonth)
Reassign to the dataframe to preserve the changes
dataframe['DSFS'] = dataframe['DSFS'].apply(replaceMonth)

Pandas - Delete Rows with only NaN values

I have a DataFrame containing many NaN values. I want to delete rows that contain too many NaN values; specifically: 7 or more.
I tried using the dropna function several ways but it seems clear that it greedily deletes columns or rows that contain any NaN values.
This question (Slice Pandas DataFrame by Row), shows me that if I can just compile a list of the rows that have too many NaN values, I can delete them all with a simple
df.drop(rows)
I know I can count non-null values using the count function which I could them subtract from the total and get the NaN count that way (Is there a direct way to count NaN values in a row?). But even so, I am not sure how to write a loop that goes through a DataFrame row-by-row.
Here's some pseudo-code that I think is on the right track:
### LOOP FOR ADDRESSING EACH row:
m = total - row.count()
if (m > 7):
df.drop(row)
I am still new to Pandas so I'm very open to other ways of solving this problem; whether they're simpler or more complex.

Basically the way to do this is determine the number of cols, set the minimum number of non-nan values and drop the rows that don't meet this criteria:
df.dropna(thresh=(len(df) - 7))
See the docs

The optional thresh argument of df.dropna lets you give it the minimum number of non-NA values in order to keep the row.
df.dropna(thresh=df.shape[1]-7)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count occurrences of certain values in dask.dataframe - python

Related

Python code to remove records with two or more empty fields

Modifying the date column calculation in pandas dataframe

Any existing methods to find a drop in a noisy time series?

Applying a function to a column in a pandas dataframe

Pandas - Delete Rows with only NaN values

Categories

Resources