Split a dataframe column's list into two dataframe columns

Split a dataframe column's list into two dataframe columns - python

I'm effectively trying to do a text-to-columns (from MS Excel) action, but in Pandas.
I have a dataframe that contains values like: 1_1, 2_1, 3_1, and I only want to take the values to the right of the underscore. I figured out how to split the string, which gives me a list of the broken up string, but I don't know how to break that out into different dataframe columns.
Here is my code:
import pandas as pd
test = pd.DataFrame(['1_1','2_1','3_1'])
test.columns = ['values']
test = test['values'].str.split('_')
I get something like: [1, 1], [2, 1], [3, 1].
What I'm trying to get is two separate columns:
col1: 1, 2, 3
col2: 1, 1 ,1
Thoughts? Thanks in advance for your help

Use expand=True when doing the split to get multiple columns:
test['values'].str.split('_', expand=True)
If there's only one underscore, and you only care about the value to the right, you could use:
test['values'].str.split('_').str[1]

You are close:
Instead of just splitting try this:
test2 = pd.DataFrame(test['values'].str.split('_').tolist(), columns = ['c1','c2'])

Related

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.

Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:

The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])

without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!

Use startswith for this:
df = df[df['Code'].str.startswith('pl')]

Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]

If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()

The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Pass a list to str.contains - Pandas

I have a pandas related question: I need to filter a column (approx. 40k entries) based on substrings included (or not) in the column. Each of the entries in the column is basically a very long list of attributes (text) which I need to be able to filter individually. This line of code works, but it is not scalable (I have hundreds of attribures I have to filter for):
df[df['Product Lev 1'].str.contains('W1 Rough wood', na=False) & df['Product Lev 1'].str.contains('W1.2', na=False)]
Is there a possibiltiy to insert all the items I have to filter and pass it as a list? Or any similr solution ?
THANK YOU!

Like this:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['aaaaDB', 'bbbbbbCB', 'cccccEB', 'ddddddUB']}
df=pd.DataFrame.from_dict(data)
lst = ['DB','CB'] #replace with your list
rstr = '|'.join(lst)
df[df['col_2'].str.upper().str.contains(rstr)]

Python: for cycle in range over files

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way

You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a dataframe column's list into two dataframe columns - python

Use expand=True when doing the split to get multiple columns: test['values'].str.split('_', expand=True) If there's only one underscore, and you only care about the value to the right, you could use: test['values'].str.split('_').str[1]

You are close: Instead of just splitting try this: test2 = pd.DataFrame(test['values'].str.split('_').tolist(), columns = ['c1','c2'])

Related

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Python: Extract unique index values and use them in a loop

Python.pandas: how to select rows where objects start with letters 'PL'

Pass a list to str.contains - Pandas

Python: for cycle in range over files

Categories

Resources