What is a good method to sum dataframes for all Null / NaN values when using Koalas?
or stated another way
How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if possible.
NOTE: .sum() omits null values in Koalas (skipna:boolean, default True - can't change to False). So running df1.isnull().sum() is out of the question
numpy was listed as an alternative but due to the dataframe being in Koalas I observed that .sum() still was omitting the nan values.
Disclaimer: I get I can run pandas on Spark but I understand that is counter productive resource wise. I hesitate to sum it from a Spark or Pandas dataframe and then convert the dataframe into Koalas (again wasting resources in my opinion). I'm working with a dataset that contains 73 columns and 4m rows.
You can actually use df.isnull(). The reason for that is that it returns an "array" of booleans to indicate whether a value is missing. Therefore, if you first call isnull and then sum you will get the correct count.
Example:
import databricks.koalas as ks
df = ks.DataFrame([
[1, 3, 9],
[2, 3, 7],
[3, None, 3]
], ["c1", "c2", "c3"])
df.isnull().sum()
Related
When grouping a Pandas DataFrame, when should I use transform and when should I use aggregate? How do
they differ with respect to their application in practice and which one do you
consider more important?
consider the dataframe df
df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9]))
groupby is the standard use aggregater
df.groupby('A').mean()
maybe you want these values broadcast across the whole group and return something with the same index as what you started with.
use transform
df.groupby('A').transform('mean')
df.set_index('A').groupby(level='A').transform('mean')
agg is used when you have specific things you want to run for different columns or more than one thing run on the same column.
df.groupby('A').agg(['mean', 'std'])
df.groupby('A').agg(dict(B='sum', C=['mean', 'prod']))
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I have a large dataframe with over 4 million rows and multiple columns. Column X may have a value of Nan. I want to firstly filter any row where X column has a value, then split the dataframe into smaller segments for processing. However, if I use both loc and iloc, the settingwithcopywarning error is raised. How can I code around this problem?
The reason for segmenting is to extract the dataframe in CSV every time a segment is processed to prevent extensive data loss if an error occurs.
My code is the following:
filtered_df = initdf.loc[initdf['x'].isnull(), :]
for i in range(0, len(filtered_df.index), 2000):
filtered_df_chunk = filtered_df.iloc[i:i+2000]
# Code to edit the chunk
initdf.update(filtered_df_chunk, overwrite=False)
Is there any better way to avoid the settingwithcopywarning but still being able to filter and segment the initial dataframe?
Edit: An initial ommition, althouth I don't think it changes the answer: The exported dataframe is the initial one, once the chunk changes have been integrated to it using df.update.
Many thanks!
Here's my initial take on this. Using a simplified example.
list_a = {
"a": [1, 7, 3, np.NaN, 8, 3, 9, 9, 3, np.NaN, 4, 3],
"b": np.arange(12)
} # Creating random DataFrame with NaN values
df = pandas.DataFrame(list_a)
df_no_nan = df[df["a"].isna() == False] # Removing indexes where row "a" is NaN
def chunk_operation(df, chunk_size):
split_points = [index for index in np.arange(len(df))[0:-1:chunk_size]]
for chunk in [df_no_nan.iloc[split:split+chunk_size] for split in split_points]:
chunk["a"] * 5
chunk.to_csv(r"\some_path")
chunk_operation(df_no_nan, 3)
while I was scripting a column, I came into something very interesting. There are two ways in which I was using pd.DataFrame.isna for single and multiple columns. While I am scripting in multiple brackets pd.df.isna is returning the entire code back to me.
override[override.ORIGINAL_CREDITOR_ID.notna()].shape
override[override[['ORIGINAL_CREDITOR_ID']].notna()].shape
So the first line returns me 3880 rows and runs in 2.5ms whereas the second one returns me all the rows present in the override data frame and that too takes 3.08s.
Is there a reason why that is happening? How can I avoid this because I have to make it configurable for passing multiple columns in the second query?
The first line of code is selection with a Boolean Series, while the second is selection with a Boolean DataFrame, and these are handled very differently as DataFrames are 2D and there are 2 axes to align. There's a section dedicated to illustrating this difference in the pandas docs.
In the first case, selection with a Boolean Series, you return all columns for only the rows that are True in the Boolean Series.
In the case of selection with a Boolean DataFrame, you return an object the same shape as the original where the True values in the Boolean DataFrame are kept and any False values are replaced with NaN. (It's actually implemented as DataFrame.where) For rows and columns that don't appear in your Boolean DataFrame mask, those become NaN by default.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.NaN, 4],
'b': [10, 11, 12, 13]})
# Boolean Series, return all columns only for for rows where condition is True
df[df['a'] == 2]
# a b
#1 2.0 11
# Boolean DataFrame, equivalent to df.where(df[['a']] == 2)
df[df[['a']] == 2]
# a b
#0 NaN NaN
#1 2.0 NaN
#2 NaN NaN
#3 NaN NaN
So, I found a way where once I've got the data frame of True and False, I then take a bitwise operation by using all or any. You can refer to:
override[override[['ORIGINAL_CREDITOR_ID']].notna().all(1)].shape
This would help me in filtering the results I want and that too much faster i.e. in 8ms.
I found this answer on here. So hope you find that useful. Let me know if you need more understanding.
data_c["dropoff_district"] = "default value"
data_c["distance"] = "default value" #Formed a new column named distance for geocoder
data_c["time_of_day"] = "default value" #Formed a new column named time of the day for timestamps
So I create these columns at the start of the project for plotting and data manipulaton.After I edited and filled these columns with certain values, I wanted to perform a groupby operation on data_c.
avg_d = data_c.groupby(by = 'distance').sum().reset_index()
Although when I perform a groupby on data_c, I somehow lose my 'time_of_day' and 'dropoff_district' columns in avg_d. How can I solve this issue?
The problem is that Pandas doesn't know how to add date/time objects together. Thus, when you tell Pandas to groupby and then sum, it throws out the columns it doesn't know what to do with. Example,
df = pd.DataFrame([['2019-01-01', 2, 3], ['2019-02-02', 2, 4], ['2019-02-03', 3, 5]],
columns=['day', 'distance', 'duration'])
df.day = pd.to_datetime(df.day)
If I just run your query, I'd get,
>>> df.groupby('distance').sum()
duration
distance
2 7
3 5
You can fix this by telling Pandas you want to do something different with those columns---for example, take the first value,
df.groupby('distance').agg({
'duration': 'sum',
'day': 'first'
})
which brings them back,
duration day
distance
2 7 2019-01-01
3 5 2019-02-03
Groupby does not remove your columns. The sum() call does. If those columns are not numeric, you will not retain them after sum().
So how do you like to retain columns 'time_of_day' and 'dropoff_district'? Assume you still want to keep them when they are distinct, put them into groupby:
data_c.groupby(['distance','time_of_day','dropoff_district']).sum().reset_index()
otherwise, you will have multiple different 'time_of_day' for the same 'distance'. You need to massage your data first.