Noob question, but I have column in Python DataFrame that I want to aggregate into a new column.
Also I'm trying to create a column that take n*(average value), and also a column for the difference.
How...?
Added link to picture of dataset to illustrate. VERY new to Python/Jupyter Notebook!
Thanks in advance!:)
n = pd.Series([1,2,3,4,5])
a = pd.Series([1, 2, 4, 6, 11])
cumsum = a.cumsum()
average_n = n* a.mean()
diff = -cumsum+average_n
df = pd.concat([n,a,cumsum,average_n,diff],axis=1)
df.columns = ["n","data","cumsum","average_n","diff"]
df
Related
I want to filter out id's that not appear 3 times in the dataset below.
I thought of using groupby and transform('size'), but that doesn't work.
Why?
data = pd.DataFrame({'id':[0,0,0, 1,1,1, 2,2, 3,3,3, 4, 4],
'info':[23,22,12,12,14,23,11,2,98,76,46,341,12]})
data[data.groupby(['id']).transform('size')==3]
Specify column after groupby:
df = data[data.groupby(['id'])['id'].transform('size')==3]
Alternative:
df = data[data['id'].map(data['id'].value_counts())==3]
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I have a large dataframe with over 4 million rows and multiple columns. Column X may have a value of Nan. I want to firstly filter any row where X column has a value, then split the dataframe into smaller segments for processing. However, if I use both loc and iloc, the settingwithcopywarning error is raised. How can I code around this problem?
The reason for segmenting is to extract the dataframe in CSV every time a segment is processed to prevent extensive data loss if an error occurs.
My code is the following:
filtered_df = initdf.loc[initdf['x'].isnull(), :]
for i in range(0, len(filtered_df.index), 2000):
filtered_df_chunk = filtered_df.iloc[i:i+2000]
# Code to edit the chunk
initdf.update(filtered_df_chunk, overwrite=False)
Is there any better way to avoid the settingwithcopywarning but still being able to filter and segment the initial dataframe?
Edit: An initial ommition, althouth I don't think it changes the answer: The exported dataframe is the initial one, once the chunk changes have been integrated to it using df.update.
Many thanks!
Here's my initial take on this. Using a simplified example.
list_a = {
"a": [1, 7, 3, np.NaN, 8, 3, 9, 9, 3, np.NaN, 4, 3],
"b": np.arange(12)
} # Creating random DataFrame with NaN values
df = pandas.DataFrame(list_a)
df_no_nan = df[df["a"].isna() == False] # Removing indexes where row "a" is NaN
def chunk_operation(df, chunk_size):
split_points = [index for index in np.arange(len(df))[0:-1:chunk_size]]
for chunk in [df_no_nan.iloc[split:split+chunk_size] for split in split_points]:
chunk["a"] * 5
chunk.to_csv(r"\some_path")
chunk_operation(df_no_nan, 3)
I need to create a duplicate for each row in a dataframe, apply some basic operations to the duplicate row and then combine these dupped rows along with the originals back into a dataframe.
I'm trying to use apply for it and the print shows that it's working correctly but when I return these 2 rows from the function and the dataframe is assembled I get an error message "cannot copy sequence with size 7 to array axis with dimension 2". It is as if it's trying to fit these 2 new rows back into the original 1 row slot. Any insight on how I can achieve it within apply (and not by iterating over every row in a loop)?
def f(x):
x_cpy=x.copy()
x_cpy['A']=x['B']
print(pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True))
#return pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True)
hld_pos.apply(f,axis=1)
The apply function of pandas operates along an axis. With axis=1, it operates along every row. To do something like what you're trying to do, think of how you would construct a new row from your existing row. Something like this should work:
import pandas as pd
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6]})
def f(row):
"""Return a new row with the items of the old row squared"""
pd.Series({'a': row['a'] ** 2, 'b': row['b'] ** 2})
new_df = my_df.apply(f, axis=1)
combined = concat([my_df, new_df], axis=0)
I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/