For every two rows in my df, I would like to concatenate them into one.
Starting with this:
and ending with this:
I've been able to apply this to one column, but have not been able to apply it across all of them. I would also like to loop this for every two rows for the entire df.
This is my actual df:
Team Spread
0 Wagner Seahawks (-11.5, -118)
1 Fairleigh Dickinson Knights (11.5, -110)
I know this isn't the best way to format a table, but for my needs it is the best option. Thank you
If I were to do this in excel - I would use this:
=TEXTJOIN(CHAR(10),TRUE,A1:A2)
Does this work for you?
>>> df = pd.DataFrame({
"Col1": ["A", "B", "C", "D"],
"Col2": [(-11.5, -118), (11.5, -110), (-11.5, -118), (11.5, -110)],
})
>>> df
Col1 Col2
0 A (-11.5, -118)
1 B (11.5, -110)
2 C (-11.5, -118)
3 D (11.5, -110)
If you have non-string columns, you'll need to transform them to string first:
>>> df["Col2"] = df["Col2"].astype(str)
Now, use .groupby using real floor division, and aggregate each pair of rows using "\n".join.
>>> df = df.groupby(df.index // 2).agg("\n".join)
>>> df
Col1 Col2
0 A\nB (-11.5, -118)\n(11.5, -110)
1 C\nD (-11.5, -118)\n(11.5, -110)
Consider that you would need to write the Excel file on your own to dump the dataframe and load the Excel in the format that you want (as described in this SO answer).
Related
I see this question asked multiple times but solutions from other questions did not worked!
I have data frame like
df = pd.DataFrame({
"date": ["20180920"] * 3 + ["20180921"] * 3,
"id": ["A12","A123","A1234","A12345","A123456","A0"],
"mean": [1,2,3,4,5,6],
"std" :[7,8,9,10,11,12],
"test": ["a", "b", "c", "d", "e", "f"],
"result": [70, 90, 110, "(-)", "(+)", 0.3],})
using pivot_table
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std']))
I got
df_sum_table.columns
MultiIndex([('mean', '20180920'),
('mean', '20180921'),
( 'std', '20180920'),
( 'std', '20180921')],
names=[None, 'date'])
So I wanted to shift date column one row below and remove id row. but keep id name there.
by following these past solutions
ValueError when trying to have multi-index in DataFrame.pivot
Removing index name from df created with pivot_table()
Resetting index to flat after pivot_table in pandas
pandas pivot_table keep index
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std'])).reset_index().rename_axis(None, axis=1)
but getting error
TypeError: Must pass list-like as names.
How can I remove date but keep the id in the first column ?
The desired output
#jezrael
Try with rename_axis:
df = df.pivot_table(index=['id'], columns = ['date'], values = ['mean', 'std']).rename_axis(columns={'date': None}).fillna('').reset_index().T.reset_index(level=1).T.reset_index(drop=True).reset_index(drop=True)
df.index = df.pop('id').replace('', 'id').tolist()
print(df)
Output:
mean mean std std
id 20180920 20180921 20180920 20180921
A0 6 12
A12 1 7
A123 2 8
A1234 3 9
A12345 4 10
A123456 5 11
You could use rename_axis and rename the specific column axis name with dictionary mapping. I specify the columns argument for column axis name mapping.
Apologies if this is a repeat question, I searched SO for awhile and, as simple as a question that it is, I couldn't find a similar one. I am looking to simply create one data frame (5x3 in my case) based off of one column in my Pandas dataframe. I've tried both pd.DataFrame and pd.concat and neither have seemed to work. Example below:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
#using pd.DataFrame
table_data = {'Column1': df.iloc[0:5,0],
'Column2': df.iloc[5:10,0],
'Column3': df.iloc[10:15,0]}
pd.DataFrame(table_data)
#different method using pd.DataFrame
pd.DataFrame([df.iloc[0:5,0],
df.iloc[5:10,0],
df.iloc[10:15,0]],
columns = ['Column1', 'Column2', 'Column3'])
#using pd.concat
pd.concat([df.iloc[0:5,0], df.iloc[5:10,0], df.iloc[10:15,0]],
axis=1, keys=['Column1', 'Column2', 'Column3'])
Note that my actual starting data frame has more than just 1 column. The issues seem to be happening when I use indexing as opposed to simply hard coding the numbers that should be in each column. This seems like such a simple thing to do yet I can't seem to find anywhere how to solve it. Any help appreciated.
Like this:
In [591]: import numpy as np
In [585]: d = pd.DataFrame()
In [553]: df_split = np.array_split(df, 5) ## Split df into equal parts of 5 rows each
In [586]: for i in df_split:
...: d = pd.concat([d,i.reset_index(drop=True)], axis=1)
...:
In [588]: d.columns = ['Col1', 'Col2', 'Col3']
In [589]: d
Out[589]:
Col1 Col2 Col3
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]
I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({
'col1': ['a, b'],
'col2': [100]
}, index=['A'])
What I'd like to achieve is by "exploding" col1 to create a multi-level index with the values of col1 as the 2nd level - while retaining the value of col2 from the original index, eg:
idx_1,idx_2,val
A,a,100
A,b,100
I'm sure I need a col1.str.split(', ') in there, but I'm at a complete loss as to how to create the desired result - maybe I need a pivot_table but can't see how I can get that to get the required index.
I've spent a good hour and a half looking at the docs on re-shaping and pivoting etc... I'm sure it's straight-forward - I just have no idea of the terminology needed to find the "right thing".
Adapting the first answer here, this is one way. You might want to play around with the names to get those that you'd like.
If your eventual aim is to do this for very large dataframes, there may be more efficient ways to do this.
import pandas as pd
from pandas import Series
# Create test dataframe
df = pd.DataFrame({'col1': ['a, b'], 'col2': [100]}, index=['A'])
#split the values in column 1 and then stack them up in a big column
s = df.col1.str.split(', ').apply(Series, 1).stack()
# get rid of the last column from the *index* of this stack
# (it was all meaningless numbers if you look at it)
s.index = s.index.droplevel(-1)
# just give it a name - I've picked yours from OP
s.name = 'idx_2'
del df['col1']
df = df.join(s)
# At this point you're more or less there
# If you truly want 'idx_2' as part of the index - do this
indexed_df = df.set_index('idx_2', append=True)
Using your original dataframe as input, the code gives this as output:
>>> indexed_df
col2
idx_2
A a 100
b 100
Further manipulations
If you want to give the indices some meaningful names - you can use
indexed_df.index.names = ['idx_1','idx_2']
Giving output
col2
idx_1 idx_2
A a 100
b 100
If you really want the indices as flattened into columns use this
indexed_df.reset_index(inplace=True)
Giving output
>>> indexed_df
idx_1 idx_2 col2
0 A a 100
1 A b 100
>>>
More complex input
If you try a slightly more interesting example input - e.g.
>>> df = pd.DataFrame({
... 'col1': ['a, b', 'c, d'],
... 'col2': [100,50]
... }, index = ['A','B'])
You get out:
>>> indexed_df
col2
idx_2
A a 100
b 100
B c 50
d 50
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]