Faster way to assign new values in pandas dataframe - python

I was wondering if there is faster way to assign new values to cells in a pandas dataframe conditional on the value of another cell. For example, take this df:
df = pd.DataFrame({'rank':[1, 1, 1, 1, 2, 2, 2, 2], 'condition':[.01, .01, .01, .01, .01, .01, .01, .01]})
The following code works:
def changerank(row):
if (row['condition'] == 0) & (row['rank'] > 1):
row['rank'] = 1
return row
df = df.apply(changerank, axis=1)
But it is rather slow on my real dataframe containing millions of rows. I feel like there may be another way to change the values of 'rank' depending on values of row.
Thanks for any thoughts!

You can use .ix:
df.ix[(df.condition==0) & (df.rank>1), 'rank'] = 1
I believe loc may also work instead of ix here.

Related

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

Filter a dataframe by both column value and row number

I have a large dataframe with over 4 million rows and multiple columns. Column X may have a value of Nan. I want to firstly filter any row where X column has a value, then split the dataframe into smaller segments for processing. However, if I use both loc and iloc, the settingwithcopywarning error is raised. How can I code around this problem?
The reason for segmenting is to extract the dataframe in CSV every time a segment is processed to prevent extensive data loss if an error occurs.
My code is the following:
filtered_df = initdf.loc[initdf['x'].isnull(), :]
for i in range(0, len(filtered_df.index), 2000):
filtered_df_chunk = filtered_df.iloc[i:i+2000]
# Code to edit the chunk
initdf.update(filtered_df_chunk, overwrite=False)
Is there any better way to avoid the settingwithcopywarning but still being able to filter and segment the initial dataframe?
Edit: An initial ommition, althouth I don't think it changes the answer: The exported dataframe is the initial one, once the chunk changes have been integrated to it using df.update.
Many thanks!
Here's my initial take on this. Using a simplified example.
list_a = {
"a": [1, 7, 3, np.NaN, 8, 3, 9, 9, 3, np.NaN, 4, 3],
"b": np.arange(12)
} # Creating random DataFrame with NaN values
df = pandas.DataFrame(list_a)
df_no_nan = df[df["a"].isna() == False] # Removing indexes where row "a" is NaN
def chunk_operation(df, chunk_size):
split_points = [index for index in np.arange(len(df))[0:-1:chunk_size]]
for chunk in [df_no_nan.iloc[split:split+chunk_size] for split in split_points]:
chunk["a"] * 5
chunk.to_csv(r"\some_path")
chunk_operation(df_no_nan, 3)

Is there a way to use a method/function as an expression for .loc() in pandas?

I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])

Multiplying columns by another column in a dataframe

(Full disclosure that this is related to another question I asked, so bear with me if I should have appended it to what I wrote previously, even though the problem is different.)
I have a dataframe consisting of a column of weights and columns containing binary values of 0 and 1. I'd like to multiply every column within the dataframe by the weights column. However, I seem to be replacing every column within the dataframe with the weight column. I'm sure I'm missing something incredibly stupid/basic here--I'm rather new to pandas and python as a whole. What am I doing wrong?
celebfile = pd.read_csv(celebcsv)
celebframe = pd.DataFrame(celebfile)
behaviorfile = pd.read_csv(behaviorcsv)
behaviorframe = pd.DataFrame(behaviorfile)
celebbehavior = pd.merge(celebframe, behaviorframe, how ='inner', on = 'RespID')
celebbehavior2 = celebbehavior.copy()
def multiplycolumns(column):
for column in celebbehavior:
return celebbehavior[column]*celebbehavior['WEIGHT']
celebbehavior2 = celebbehavior2.apply(lambda column: multiplycolumns(column), axis=0)
print(celebbehavior2.head())
You have return statement in a for loop, which means the for loop is executed only once, to multiply a data frame with a column, you can use mul method with the correct axis parameter:
celebbehavior.mul(celebbehavior['WEIGHT'], axis=0)
read_csv
returns a pd.DataFrame... Not necessary to use pd.DataFrame on top of it.
mul with axis=0
You can use apply but that is awkward. Use mul(axis=0)... This should be all you need.
df = pd.read_csv(celebcsv).merge(pd.read_csv(behaviorcsv), on='RespID')
df = df.mul(df.WEIGHT, 0)
?
You said that it looks like you are just replacing with the weights column? Are you other columns all ones?
you can use the `mul' method to multiply the columns. However, just fyi if you do want to use apply you can bear in mind the following:
The apply function passes each series in the dataframe to the function. This looping is inherent to the apply function. Therefore first thing to say is that your loop within the function is redundant. Also you have a return statement inside it which is causing the behavior you do not want.
If each column is passed as the argument automatically all you need to do is tell the function what to multiply it by. In this case your weights series.
Here is an implementation using apply. Of course the undesirable here is that the weights are also multiplpied by themselves:
df = pd.DataFrame({'1' : [1, 1, 0, 1],
'2' : [0, 0, 1, 0],
'weights' : [0.5, 0.25, 0.1, 0.05]})
def multiply_columns(column, weights):
return column * weights
df.apply(lambda x: multiply_columns(x, df['weights']))

Panda's DataFrame dup each row, apply changes to the duplicate and combine back into a dataframe

I need to create a duplicate for each row in a dataframe, apply some basic operations to the duplicate row and then combine these dupped rows along with the originals back into a dataframe.
I'm trying to use apply for it and the print shows that it's working correctly but when I return these 2 rows from the function and the dataframe is assembled I get an error message "cannot copy sequence with size 7 to array axis with dimension 2". It is as if it's trying to fit these 2 new rows back into the original 1 row slot. Any insight on how I can achieve it within apply (and not by iterating over every row in a loop)?
def f(x):
x_cpy=x.copy()
x_cpy['A']=x['B']
print(pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True))
#return pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True)
hld_pos.apply(f,axis=1)
The apply function of pandas operates along an axis. With axis=1, it operates along every row. To do something like what you're trying to do, think of how you would construct a new row from your existing row. Something like this should work:
import pandas as pd
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6]})
def f(row):
"""Return a new row with the items of the old row squared"""
pd.Series({'a': row['a'] ** 2, 'b': row['b'] ** 2})
new_df = my_df.apply(f, axis=1)
combined = concat([my_df, new_df], axis=0)

Categories

Resources