Split one pandas dataframe column into several so that they are binary - python

I have a problem and would like to split a column of a pandas dataframe into several columns, where, however, only binary numbers are written (0 or 1), so that I can continue working with them. The problem is as follows ( very simplified):
df = pd.DataFrame({'c1': [1, 2, 3, 4, 5, 6],
'c2': [0, "A", "B", "A, B", "B", "C"]},
columns=['c1', 'c2'])
However, I cannot work with the second column in this way and would therefore like to divide c2 into several columns so that there are the following columns: c2_A, c2_B and c2_C. If, for example, the second column contains "A", then column c2_A should contain a 1 and otherwise a 0. If, for example, as in the fourth column, "A,B" is written, then there should be a 1 in both c2_A and c2_B and a 0 in c2_C.
I have already tried a lot and tried it with if/ else, for example, but then failed at the latest when there is more than one letter in a row (e.g. "A,B").
It would be great if someone could help me, because I'm really running out of ideas.
PS: I am also rather a Python newbie.
Thanks in advance!

Related

Convert a pandas DataFrame to a dict with row and column name for each value

I want to convert a dataframe (pandas) to a dictionary with the labels in column 0 and the column names for each value. Additionally, I need to change the column names to 1-100 (there are 100 columns).
The dataframe looks like this:
I would like to get an ouput like this:
{(1,'DC-Vienna'):8831, (2, 'DC-Vienna'):10174, ...
(1, 'DC-Valencia'):3012, (2, 'DC-Valencia'):2276, ...
...}
I was able to convert it to a dictionary of course, but I need it in a specific format without the row indices and with changed column names. I'm quite new to Python so all these things are unfamiliar to me. There might be a few basic steps involved here that I just completely missed as well. Such as renaming the columns of the dataframe.
Hope someone can help! Thanks in advance!
I create a minified dataset that represents your problem. And here is a code that does what you need.
The likely was that you needed to move Distribution column to index and only then apply .to_dict method. Then you get index transformed into two-item tuples, as you wanted.
import pandas as pd
df = pd.DataFrame({
"Distribution": ["Viena", "Kaunas"],
"1": [2, 3],
"2": [4, 5],
})
rdf = df.set_index("Distribution").stack().swaplevel().to_dict()
rdf
Results in :
{
('1', 'Viena'): 2,
('2', 'Viena'): 4,
('1', 'Kaunas'): 3,
('2', 'Kaunas'): 5}

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

Get column value of row with condition based on another column

I keep writing code like this, when I want a specific column in a row, where I select the row first based on another column.
my_col = "Value I am looking for"
df.loc[df["Primary Key"] == "blablablublablablalblalblallaaaabblablalblabla"].iloc[0][
my_col
]
I don't know why, but it seems weird. Is there a more beautiful solution to this?
It would be helpful with a complete minimally working example, since it is not clear what your data structure looks like. You could use the example given here:
import pandas as pd
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
If you are then trying to e.g. select the viper-row based on its max_speed, and then obtain its shield-value like so:
my_col = "shield"
df.loc[df["max_speed"] == 4].iloc[0][my_col]
then I guess that is the way to do that - not a lot of fat in that command.

Pandas: How to Populate Column in df A, only where first four columns in df A match df B, from value in df B

Let's say I have a Dataframe in Pandas, 'A'
A = pd.DataFrame(columns=['A','B','C','D','Code'])
I have another Dataframe 'B'
B = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
I need to add and populate the 'Code' column in 'B', only where the first four columns of BOTH dataframes match.
So, say, Dataframe 'C' would look like
C = pd.DataFrame(columns=['A','B','C','D','E','F','G','Code'])
I've tried using inner merge but doesn't seem to work. Or maybe I did it wrong? Below is the code I used
key_column = ["A", "B", "C", "D"]
C = pd.merge(A, B, on='key_column', how='left', suffixes=('',''))
Is the above supposed to work? Not sure how else I would go about this with Pandas.
EDIT: The above code may not be hundred percent accurate as I'm dictating it from memory, but would this work or is there another way to solve my problem?

Split a dataframe column's list into two dataframe columns

I'm effectively trying to do a text-to-columns (from MS Excel) action, but in Pandas.
I have a dataframe that contains values like: 1_1, 2_1, 3_1, and I only want to take the values to the right of the underscore. I figured out how to split the string, which gives me a list of the broken up string, but I don't know how to break that out into different dataframe columns.
Here is my code:
import pandas as pd
test = pd.DataFrame(['1_1','2_1','3_1'])
test.columns = ['values']
test = test['values'].str.split('_')
I get something like: [1, 1], [2, 1], [3, 1].
What I'm trying to get is two separate columns:
col1: 1, 2, 3
col2: 1, 1 ,1
Thoughts? Thanks in advance for your help
Use expand=True when doing the split to get multiple columns:
test['values'].str.split('_', expand=True)
If there's only one underscore, and you only care about the value to the right, you could use:
test['values'].str.split('_').str[1]
You are close:
Instead of just splitting try this:
test2 = pd.DataFrame(test['values'].str.split('_').tolist(), columns = ['c1','c2'])

Categories

Resources