I keep writing code like this, when I want a specific column in a row, where I select the row first based on another column.
my_col = "Value I am looking for"
df.loc[df["Primary Key"] == "blablablublablablalblalblallaaaabblablalblabla"].iloc[0][
my_col
]
I don't know why, but it seems weird. Is there a more beautiful solution to this?
It would be helpful with a complete minimally working example, since it is not clear what your data structure looks like. You could use the example given here:
import pandas as pd
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
If you are then trying to e.g. select the viper-row based on its max_speed, and then obtain its shield-value like so:
my_col = "shield"
df.loc[df["max_speed"] == 4].iloc[0][my_col]
then I guess that is the way to do that - not a lot of fat in that command.
Related
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I have a problem and would like to split a column of a pandas dataframe into several columns, where, however, only binary numbers are written (0 or 1), so that I can continue working with them. The problem is as follows ( very simplified):
df = pd.DataFrame({'c1': [1, 2, 3, 4, 5, 6],
'c2': [0, "A", "B", "A, B", "B", "C"]},
columns=['c1', 'c2'])
However, I cannot work with the second column in this way and would therefore like to divide c2 into several columns so that there are the following columns: c2_A, c2_B and c2_C. If, for example, the second column contains "A", then column c2_A should contain a 1 and otherwise a 0. If, for example, as in the fourth column, "A,B" is written, then there should be a 1 in both c2_A and c2_B and a 0 in c2_C.
I have already tried a lot and tried it with if/ else, for example, but then failed at the latest when there is more than one letter in a row (e.g. "A,B").
It would be great if someone could help me, because I'm really running out of ideas.
PS: I am also rather a Python newbie.
Thanks in advance!
I want to filter a pandas data frame based on exact match of a string.
I have a data frame as below
df1 = pd.DataFrame({'vals': [1, 2, 3, 4,5], 'ids': [u'aball', u'bball', u'cnut', u'fball','aballl']})
I want to filter all the rows except the row that has 'aball'.As you can see I have one more entry with ids == 'aballl'. I want that filterd out. Hence the below code does not work:
df1[df1['ids'].str.contains("aball")]
even str.match does not work
df1[df1['ids'].str.match("aball")]
Any help would be greatly appreciated.
Keeping it simple, this should work:
df1[df1['ids'] == "aball"]
You can try this:
df1[~(df1['ids'] == "aball")]
Essentially it will find all entries matching "aball" and then negate it.
Is there any way in Excel or in DAX i can check if all the values of a single column exist or don't on another column.
Example - I have a column called Column 1 where i have some values, like 4,5,2,1. now i want to check how many of those values exists on Column 2 !
As an Output, i expected it can Go Green if the value exists else Red.
I have looked in a lot of place but the only useful result i have found where i can find for a sngle value, not for all the values in a single column.
Do anyone knows any way of doing this work !
Since you mention Python, this is possible programmatically with the Pandas library:
import pandas as pd
# define dataframe, or read in via df = pd.read_excel('file.xlsx')
df = pd.DataFrame({'col1': [4, 5, 2, 1] + [np.nan]*4,
'col2': [6, 8, 3, 4, 1, 6, 3, 4]})
# define highlighting logic
def highlight_cols(x):
res = []
for i in x:
if np.isnan(i):
res.append('')
elif i in set(df['col2']):
res.append('background: green')
else:
res.append('background: red')
return res
# apply highlighting logic to first column only
res = df.style.apply(highlight_cols, subset=pd.IndexSlice[:, ['col1']])
Result:
Create a (optionally hidden) column that will be adjacent to your search column (in my example that will be column C to column B)
=IF(ISERROR(VLOOKUP(B1,$A$1:$A$4, 1, 0)), FALSE, TRUE)
This will determine, if the value is contained within the first data-list (returns true if it is)
And then just use simple conditional formatting
Provides the result as expected:
You can do this easily without adding hidden columns as below. This will updated anytime if you change numbers in column A.
Select column B
Conditional Formatting -> New Rule -> Use a formula to determine which cells to format
insert formula as =OR(B2=$A$2,B2=$A$3,B2=$A$4,B2=$A$5) = TRUE and format cell as your wish (here in Green)
Repeat steps 1 to 2
insert formula as =OR(B2=$A$2,B2=$A$3,B2=$A$4,B2=$A$5) = FASLE and format cells as your wish (here in Red)
Select the column name cell (To remove column heading formatting)
Conditional Formatting -> Clear Rule -> Clear Rules from selected cells
I want to filter a pandas data frame based on exact match of a string.
I have a data frame as below
df1 = pd.DataFrame({'vals': [1, 2, 3, 4,5], 'ids': [u'aball', u'bball', u'cnut', u'fball','aballl']})
I want to filter all the rows except the row that has 'aball'.As you can see I have one more entry with ids == 'aballl'. I want that filterd out. Hence the below code does not work:
df1[df1['ids'].str.contains("aball")]
even str.match does not work
df1[df1['ids'].str.match("aball")]
Any help would be greatly appreciated.
Keeping it simple, this should work:
df1[df1['ids'] == "aball"]
You can try this:
df1[~(df1['ids'] == "aball")]
Essentially it will find all entries matching "aball" and then negate it.