Fill in empty header with previous column name - pandas - python

I have a dataframe where each second column name is skipped:
eg
Step_1.
The idea is to fill unnamed columns with previous name to get:
Step_2.
To sum up "in" and "out" in each class, to get final result like this
The intermediary Step_1 is important and cannot be skipped to get the final result.
I appreciate any help and apologize for not being clear enough when asking question at the first attempt.
Thank you

Idea is convert columns to Series, so possible replace missing values instead values starting by Unnamed with forward filling:
df.columns = df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
print (df)
Column_1 Column_1 Column_2 Column_2
0 a d f g
EDIT:
If missing values in index:
df.columns = df.columns.to_series().ffill()
MultiIndex solution is necessary, if second row is header too - first use header=[0,1] for MultiIndex:
import pandas as pd
temp=u"""Column_1;Unnamed_column;Column_2;Unnamed_column
a;d;f;g
1;5;5;6
7;8;9;4"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", header=[0,1])
print (df)
Column_1 Unnamed_column Column_2 Unnamed_column
a d f g
0 1 5 5 6
1 7 8 9 4
a = df.columns.get_level_values(0)
b = df.columns.get_level_values(1)
df.columns = [a.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill(), b]
print (df)
Column_1 Column_2
a d f g
0 1 5 5 6
1 7 8 9 4

I tried this,
t = pd.DataFrame(df.columns)
t.loc[t[0].str.startswith('Unnamed: '),0] = np.NaN
t[0].bfill(inplace=True)
df.columns = t[0].values
Create temp dataframe with column of original dataframe. apply ffill or bfill as per your wish. assign back the values again to original dataframe.

You can rewrite the df.index with a list comprehension.
from itertools import chain
df = pd.DataFrame(
{"Column_1": [1], "Unnamed_column1": [2], "Column_2": [3], "Unnamed_column2": [4]})
cols = [[c, c] for c in df.columns[::2]]
df.columns = [_ for _ in chain(*cols)]
Having said that it might be better to assign unique names to columns as they will be used keys/indices, i.e .
cols = [[c, c+"_new"] for c in df.columns[::2]]

Related

Python : How do you filter out columns from a dataset based on substring match in Column names

df_train = pd.read_csv('../xyz.csv')
headers = df_train.columns
I want to filter out those columns in headers which have _pct in their substring.
Use df.filter
df = pd.DataFrame({'a':[1,2,3], 'b_pct':[1,2,3],'c_pct':[1,2,3],'d':[1]*3})
print(df.filter(items=[i for i in df.columns if '_pct' not in i]))
## or as jezrael suggested
# print(df[[i for i in df.columns if '_pct' not in i]])
Output:
a d
0 1 1
1 2 1
2 3 1
Use:
#data from AkshayNevrekar answer
df = df.loc[:, ~df.columns.str.contains('_pct')]
print (df)
Filter solution is not trivial:
df = df.filter(regex=r'^(?!.*_pct).*$')
a d
0 1 1
1 2 1
2 3 1
Thank you, #IanS for another solutions:
df[df.columns.difference(df.filter(like='_pct').columns).tolist()]
df.drop(df.filter(like='_pct').columns, axis=1)
As df.columns returns a list of the column names, you can use list comprehension and build your new list with a simple condition:
new_headers = [x for x in headers if '_pct' not in x]

Changing multiple column names

Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

How to (re)name an empty column header in a pandas dataframe without exporting to csv

I have a pandas dataframe df1 with an index column and an unnamed series of values. I want to assign a name to the unnamed series.
The only way to do this that I know so far is to export to df1.csv using:
df1.to_csv("df1.csv", header = ["Signal"])
and then re-import using:
pd.read_csv("df1.csv", sep=",")
However, this costs time and storage space. How to do this in-memory?
When I do df2 = df1.rename(columns = {"" : "Signal"}, inplace = True)
I yield:
AttributeError: "Series" object has no attribute "Signal".
I think inplace=True has to be removed, because it return None:
df2 = df1.rename(columns = {"" : "Signal"})
df1.rename(columns = {"" : "Signal"}, inplace = True)
Another solution is asign new name by position:
df.columns.values[0] = 'Signal'
Sample:
df1 = pd.DataFrame({'':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df1)
B C
0 1 4 7
1 2 5 8
2 3 6 9
df2 = df1.rename(columns = {"" : "Signal"})
print (df2)
Signal B C
0 1 4 7
1 2 5 8
2 3 6 9
You can use this if there are multiple empty columns. This will generate an empty column with cols and i (for the column position)
df.columns = ["cols_"+str(i) if a == "" else a for i, a in enumerate(df.columns)]
#cols -> just rename the column name just as you want
#i -> count the column number

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Categories

Resources