Create multi-level index pandas dataframe from delimited string column - python

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({
'col1': ['a, b'],
'col2': [100]
}, index=['A'])
What I'd like to achieve is by "exploding" col1 to create a multi-level index with the values of col1 as the 2nd level - while retaining the value of col2 from the original index, eg:
idx_1,idx_2,val
A,a,100
A,b,100
I'm sure I need a col1.str.split(', ') in there, but I'm at a complete loss as to how to create the desired result - maybe I need a pivot_table but can't see how I can get that to get the required index.
I've spent a good hour and a half looking at the docs on re-shaping and pivoting etc... I'm sure it's straight-forward - I just have no idea of the terminology needed to find the "right thing".

Adapting the first answer here, this is one way. You might want to play around with the names to get those that you'd like.
If your eventual aim is to do this for very large dataframes, there may be more efficient ways to do this.
import pandas as pd
from pandas import Series
# Create test dataframe
df = pd.DataFrame({'col1': ['a, b'], 'col2': [100]}, index=['A'])
#split the values in column 1 and then stack them up in a big column
s = df.col1.str.split(', ').apply(Series, 1).stack()
# get rid of the last column from the *index* of this stack
# (it was all meaningless numbers if you look at it)
s.index = s.index.droplevel(-1)
# just give it a name - I've picked yours from OP
s.name = 'idx_2'
del df['col1']
df = df.join(s)
# At this point you're more or less there
# If you truly want 'idx_2' as part of the index - do this
indexed_df = df.set_index('idx_2', append=True)
Using your original dataframe as input, the code gives this as output:
>>> indexed_df
col2
idx_2
A a 100
b 100
Further manipulations
If you want to give the indices some meaningful names - you can use
indexed_df.index.names = ['idx_1','idx_2']
Giving output
col2
idx_1 idx_2
A a 100
b 100
If you really want the indices as flattened into columns use this
indexed_df.reset_index(inplace=True)
Giving output
>>> indexed_df
idx_1 idx_2 col2
0 A a 100
1 A b 100
>>>
More complex input
If you try a slightly more interesting example input - e.g.
>>> df = pd.DataFrame({
... 'col1': ['a, b', 'c, d'],
... 'col2': [100,50]
... }, index = ['A','B'])
You get out:
>>> indexed_df
col2
idx_2
A a 100
b 100
B c 50
d 50

Related

Why is pandas broadcast formula is skipping rows

I have a dataframe and I'm doing tons (20+) of calculations creating new columns etc.
All the calculations work well, including the calculation in question except for 2 rows out of roughly 1,000. The rows are not adjacent to one another and I can't find anything remarkable about these two specific rows the calculation seems to be skipping. The data is being read from a csv and an xlsx file. The trouble rows are from apart of the data from the csv file.
The calculation is:
df['c'] = df['b'] - df['a']
The data for the two trouble rows looks like this:
['a'] ['b'] ['c']
0 30.6427984591421 0
0 9584.28792256921 0
The data for the rest of the df where the calculation works fine looks similar but is processing correctly:
['a'] ['b'] ['c']
102411.4521 37008.6603 -65402.7918
202244.75895 211200.2304295 8955.4714795
Example code:
a = [0, 0, 102411.4521, 202244.75895]
b = [30.6427984591421, 9584.28792256921, 37008.6603, 211200.2304295]
df = pd.DataFrame(zip(a, b), columns=['a', 'b'])
df['c'] = df['b'] - df['a']
Why would the calculation seemingly skip these rows?
You could try resetting the index before doing the operation.
df = df.reset_index(drop=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html#pandas-dataframe-reset-index
Based on the information you supplied,
cPython 3.10.8 does not reproduce the error.
import pandas as pd
df = pd.DataFrame(
[
dict(a=0, b= 30.6427984591421),
dict(a=0, b= 9584.28792256921),
dict(a=102411.4521, b= 37008.6603),
dict(a=202244.75895, b=211200.2304295),
]
)
df["c"] = df.b - df.a
print(pd.__version__)
print(df)
output
1.5.2
a b c
0 0.00000 30.642798 30.642798
1 0.00000 9584.287923 9584.287923
2 102411.45210 37008.660300 -65402.791800
3 202244.75895 211200.230429 8955.471480
What the issue was I needed to fillnas before my calculations:
df = df.fillna(0, inplace=False)

Create Dataframe based on one column Pandas

Apologies if this is a repeat question, I searched SO for awhile and, as simple as a question that it is, I couldn't find a similar one. I am looking to simply create one data frame (5x3 in my case) based off of one column in my Pandas dataframe. I've tried both pd.DataFrame and pd.concat and neither have seemed to work. Example below:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
#using pd.DataFrame
table_data = {'Column1': df.iloc[0:5,0],
'Column2': df.iloc[5:10,0],
'Column3': df.iloc[10:15,0]}
pd.DataFrame(table_data)
#different method using pd.DataFrame
pd.DataFrame([df.iloc[0:5,0],
df.iloc[5:10,0],
df.iloc[10:15,0]],
columns = ['Column1', 'Column2', 'Column3'])
#using pd.concat
pd.concat([df.iloc[0:5,0], df.iloc[5:10,0], df.iloc[10:15,0]],
axis=1, keys=['Column1', 'Column2', 'Column3'])
Note that my actual starting data frame has more than just 1 column. The issues seem to be happening when I use indexing as opposed to simply hard coding the numbers that should be in each column. This seems like such a simple thing to do yet I can't seem to find anywhere how to solve it. Any help appreciated.
Like this:
In [591]: import numpy as np
In [585]: d = pd.DataFrame()
In [553]: df_split = np.array_split(df, 5) ## Split df into equal parts of 5 rows each
In [586]: for i in df_split:
...: d = pd.concat([d,i.reset_index(drop=True)], axis=1)
...:
In [588]: d.columns = ['Col1', 'Col2', 'Col3']
In [589]: d
Out[589]:
Col1 Col2 Col3
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15

Python Pandas Selecting a Single Dataframe Column and Keeping as Dataframe instead of Series [duplicate]

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

Keep selected column as DataFrame instead of Series

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]

Categories

Resources