My current code is shown below - I'm importing a MAT file and trying to create a DataFrame from variables within it:
mat = loadmat(file_path) # load mat-file
Variables = mat.keys() # identify variable names
df = pd.DataFrame # Initialise DataFrame
for name in Variables:
B = mat[name]
s = pd.Series (B[:,1])
So within the loop, I can create a series of each variable (they're arrays with two columns - so the values I need are in column 2)
My question is how do I append the series to the dataframe? I've looked through the documentation and none of the examples seem to fit what I'm trying to do.
Here is how to create a DataFrame where each series is a row.
For a single Series (resulting in a single-row DataFrame):
series = pd.Series([1,2], index=['a','b'])
df = pd.DataFrame([series])
For multiple series with identical indices:
cols = ['a','b']
list_of_series = [pd.Series([1,2],index=cols), pd.Series([3,4],index=cols)]
df = pd.DataFrame(list_of_series, columns=cols)
For multiple series with possibly different indices:
list_of_series = [pd.Series([1,2],index=['a','b']), pd.Series([3,4],index=['a','c'])]
df = pd.concat(list_of_series, axis=1).transpose()
To create a DataFrame where each series is a column, see the answers by others. Alternatively, one can create a DataFrame where each series is a row, as above, and then use df.transpose(). However, the latter approach is inefficient if the columns have different data types.
No need to initialize an empty DataFrame (you weren't even doing that, you'd need pd.DataFrame() with the parens).
Instead, to create a DataFrame where each series is a column,
make a list of Series, series, and
concatenate them horizontally with df = pd.concat(series, axis=1)
Something like:
series = [pd.Series(mat[name][:, 1]) for name in Variables]
df = pd.concat(series, axis=1)
Nowadays there is a pandas.Series.to_frame method:
Series.to_frame(name=NoDefault.no_default)
Convert Series to DataFrame.
Parameters
nameobject, optional: The passed name should substitute for the series name (if it has one).
Returns
DataFrame: DataFrame representation of Series.
Examples
s = pd.Series(["a", "b", "c"], name="vals")
s.to_frame()
I guess anther way, possibly faster, to achieve this is
1) Use dict comprehension to get desired dict (i.e., taking 2nd col of each array)
2) Then use pd.DataFrame to create an instance directly from the dict without loop over each col and concat.
Assuming your mat looks like this (you can ignore this since your mat is loaded from file):
In [135]: mat = {'a': np.random.randint(5, size=(4,2)),
.....: 'b': np.random.randint(5, size=(4,2))}
In [136]: mat
Out[136]:
{'a': array([[2, 0],
[3, 4],
[0, 1],
[4, 2]]), 'b': array([[1, 0],
[1, 1],
[1, 0],
[2, 1]])}
Then you can do:
In [137]: df = pd.DataFrame ({name:mat[name][:,1] for name in mat})
In [138]: df
Out[138]:
a b
0 0 0
1 4 1
2 1 0
3 2 1
[4 rows x 2 columns]
Related
I have a data frame like below. It has two columns column1,column2 from these two columns I want to filter few values(combinations of two lists) and get the index. Though I wrote the logic for it. It is will be too slow for filtering from a larger data frame. Is there any faster way to filter the data and get the list of indexes?
Data frame:-
import pandas as pd
d = {'col1': [11, 20,90,80,30], 'col2': [30, 40,50,60,90]}
df = pd.DataFrame(data=d)
print(df)
col1 col2
0 11 30
1 20 40
2 90 50
3 80 60
4 30 90
l1=[11,90,30]
l2=[30,50,90]
final_result=[]
for i,j in zip(l1,l2):
res=df[(df['col1']==i) & (df['col2']==j)]
final_result.append(res.index[0])
print(final_result)
[0, 2, 4]
You can just use underlying numpy array and create the boolean indexing:
mask=(df[['col1', 'col2']].values[:,None]==np.vstack([l1,l2]).T).all(-1).any(1)
# mask
# array([ True, False, True, False, True])
df.index[mask]
# prints
# Int64Index([0, 2, 4], dtype='int64')
you can use:
condition_1=df['col1'].astype(str).str.contains('|'.join(map(str, l1)))
condition_2=df['col2'].astype(str).str.contains('|'.join(map(str, l2)))
final_result=df.loc[ condition_1 & condition_2 ].index.to_list()
here is one way to do it. Merging the two DF and filtering where value exists in both DF
# create a DF of the list you like to match with
df2=pd.DataFrame({'col1': l1, 'col2': l2})
# merge the two DF
df3=df.merge(df2, how='left',
on=['col1', 'col2'], indicator='foundIn')
# filterout rows that are in both
out=df3[df3['foundIn'].eq('both')].index.to_list()
out
[0, 2, 4]
Suppose we have a dataframe
df = pd.DataFrame({'A': [1, 2], 'A_values': [3, 4], 'B': [5, 6], 'B_values': [7, 8]})
I want to sort of pivot this dataframe so that we have columns A and B suffixed with their values as index and column names in a new dataframe (i.e. index = ['A_1', 'A_2'], columns = ['B_5', 'B_6']), and the values of this new dataframe would be a result of a function on columns A_values and B_values. Suppose the function is a simple sum. For A = 1 we have A_values = 3, for B = 5 we have B_values = 7, therefore in the new dataframe at the intersection of A_1 and B_5 we would have 3+7=10. Complete resulting dataframe below:
df_pivoted = pd.DataFrame([[3+7, 3+8], [4+7, 4+8]], index = ['A_1', 'A_2'], columns = ['B_5', 'B_6'])
After searching for a while I did not find the functionality in .pivot_table() that would allow to pass a function of multiple columns as values. Maybe there is a more suitable method for this situation? Any help is appreciated.
We can do filter with index , then merge (which also know as cross join in SQL) with new key and pivot . here I am using groupby with unstack which is equal to pivot_table
A=df.filter(like='A').assign(key=1)
B=df.filter(like='B').assign(key=1)
s=A.merge(B)
s=s.assign(value=s.A_values+s.B_values).groupby(['A','B'])['value'].sum().unstack()
s
B 5 6
A
1 10 11
2 11 12
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]
I have a dataframe with a lot of columns using the suffix '_o'. Is there a way to drop all the columns that has '_o' in the end of its label?
In this post I've seen a way to drop the columns that start with something using the filter function. But how to drop the ones that end with something?
Pandonic
df = df.loc[:, ~df.columns.str.endswith('_o')]
df = df[df.columns[~df.columns.str.endswith('_o')]]
List comprehensions
df = df[[x for x in df if not x.endswith('_o')]]
df = df.drop([x for x in df if x.endswith('_o')], 1)
To use df.filter() properly here you could use it with a lookbehind:
>>> df = pd.DataFrame({'a': [1, 2], 'a_o': [2, 3], 'o_b': [4, 5]})
>>> df.filter(regex=r'.*(?<!_o)$')
a o_b
0 1 4
1 2 5
This can be done by re-assigning the dataframe with only the needed columns
df = df.iloc[:, [not o.endswith('_o') for o in df.columns]]
When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]