I have a problem and would like to split a column of a pandas dataframe into several columns, where, however, only binary numbers are written (0 or 1), so that I can continue working with them. The problem is as follows ( very simplified):
df = pd.DataFrame({'c1': [1, 2, 3, 4, 5, 6],
'c2': [0, "A", "B", "A, B", "B", "C"]},
columns=['c1', 'c2'])
However, I cannot work with the second column in this way and would therefore like to divide c2 into several columns so that there are the following columns: c2_A, c2_B and c2_C. If, for example, the second column contains "A", then column c2_A should contain a 1 and otherwise a 0. If, for example, as in the fourth column, "A,B" is written, then there should be a 1 in both c2_A and c2_B and a 0 in c2_C.
I have already tried a lot and tried it with if/ else, for example, but then failed at the latest when there is more than one letter in a row (e.g. "A,B").
It would be great if someone could help me, because I'm really running out of ideas.
PS: I am also rather a Python newbie.
Thanks in advance!
Related
I want to convert a dataframe (pandas) to a dictionary with the labels in column 0 and the column names for each value. Additionally, I need to change the column names to 1-100 (there are 100 columns).
The dataframe looks like this:
I would like to get an ouput like this:
{(1,'DC-Vienna'):8831, (2, 'DC-Vienna'):10174, ...
(1, 'DC-Valencia'):3012, (2, 'DC-Valencia'):2276, ...
...}
I was able to convert it to a dictionary of course, but I need it in a specific format without the row indices and with changed column names. I'm quite new to Python so all these things are unfamiliar to me. There might be a few basic steps involved here that I just completely missed as well. Such as renaming the columns of the dataframe.
Hope someone can help! Thanks in advance!
I create a minified dataset that represents your problem. And here is a code that does what you need.
The likely was that you needed to move Distribution column to index and only then apply .to_dict method. Then you get index transformed into two-item tuples, as you wanted.
import pandas as pd
df = pd.DataFrame({
"Distribution": ["Viena", "Kaunas"],
"1": [2, 3],
"2": [4, 5],
})
rdf = df.set_index("Distribution").stack().swaplevel().to_dict()
rdf
Results in :
{
('1', 'Viena'): 2,
('2', 'Viena'): 4,
('1', 'Kaunas'): 3,
('2', 'Kaunas'): 5}
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I keep writing code like this, when I want a specific column in a row, where I select the row first based on another column.
my_col = "Value I am looking for"
df.loc[df["Primary Key"] == "blablablublablablalblalblallaaaabblablalblabla"].iloc[0][
my_col
]
I don't know why, but it seems weird. Is there a more beautiful solution to this?
It would be helpful with a complete minimally working example, since it is not clear what your data structure looks like. You could use the example given here:
import pandas as pd
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
If you are then trying to e.g. select the viper-row based on its max_speed, and then obtain its shield-value like so:
my_col = "shield"
df.loc[df["max_speed"] == 4].iloc[0][my_col]
then I guess that is the way to do that - not a lot of fat in that command.
Let's say I have a Dataframe in Pandas, 'A'
A = pd.DataFrame(columns=['A','B','C','D','Code'])
I have another Dataframe 'B'
B = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
I need to add and populate the 'Code' column in 'B', only where the first four columns of BOTH dataframes match.
So, say, Dataframe 'C' would look like
C = pd.DataFrame(columns=['A','B','C','D','E','F','G','Code'])
I've tried using inner merge but doesn't seem to work. Or maybe I did it wrong? Below is the code I used
key_column = ["A", "B", "C", "D"]
C = pd.merge(A, B, on='key_column', how='left', suffixes=('',''))
Is the above supposed to work? Not sure how else I would go about this with Pandas.
EDIT: The above code may not be hundred percent accurate as I'm dictating it from memory, but would this work or is there another way to solve my problem?
I'm effectively trying to do a text-to-columns (from MS Excel) action, but in Pandas.
I have a dataframe that contains values like: 1_1, 2_1, 3_1, and I only want to take the values to the right of the underscore. I figured out how to split the string, which gives me a list of the broken up string, but I don't know how to break that out into different dataframe columns.
Here is my code:
import pandas as pd
test = pd.DataFrame(['1_1','2_1','3_1'])
test.columns = ['values']
test = test['values'].str.split('_')
I get something like: [1, 1], [2, 1], [3, 1].
What I'm trying to get is two separate columns:
col1: 1, 2, 3
col2: 1, 1 ,1
Thoughts? Thanks in advance for your help
Use expand=True when doing the split to get multiple columns:
test['values'].str.split('_', expand=True)
If there's only one underscore, and you only care about the value to the right, you could use:
test['values'].str.split('_').str[1]
You are close:
Instead of just splitting try this:
test2 = pd.DataFrame(test['values'].str.split('_').tolist(), columns = ['c1','c2'])