How to rename the rows in dataframe using pandas read (Python)? - python

I want to rename rows in python program (version - spyder 3 - python 3.6) . At this point I have something like that:
import pandas as pd
data = pd.read_csv(filepath, delim_whitespace = True, header = None)
Before that i wanted to rename my columns:
data.columns = ['A', 'B', 'C']
It gave me something like that.
A B C
0 1 n 1
1 1 H 0
2 2 He 1
3 3 Be 2
But now, I want to rename rows. I want:
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
How can I do it? The main idea is to rename every row created by pd.read by the data in the B column. I tried something like this:
for rows in data:
data.rename(index={0:'df.loc(index, 'B')', 1:'one'})
but it's not working.
Any ideas? Maybe just replace the data frame rows by column B? How?

I think need set_index with rename_axis:
df1 = df.set_index('B', drop=False).rename_axis(None)
Solution with rename and dictionary:
df1 = df.rename(dict(zip(df.index, df['B'])))
print (dict(zip(df.index, df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
If default RangeIndex solution should be:
df1 = df.rename(dict(enumerate(df['B'])))
print (dict(enumerate(df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
Output:
print (df1)
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
EDIT:
If dont want column B solution is with read_csv by parameter index_col:
import pandas as pd
temp=u"""1 n 1
1 H 0
2 He 1
3 Be 2"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), delim_whitespace=True, header=None, index_col=[1])
print (df)
0 2
1
n 1 1
H 1 0
He 2 1
Be 3 2

I normally rename my rows in my dataset by following these steps.
import pandas as pd
df=pd.read_csv("zzzz.csv")
#in a dataframe it is hard to change the names of our rows so,
df.transpose()
#this changes all the rows to columns
df.columns=["","",.....]
# make sure the length of this and the length of columns are same ie dont skip any names.
#Once you are done renaming them:
df.transpose()
#We get our original dataset with changed row names.

just put colnames into "names" when reading
import pandas as pd
df = pd.read_csv('filename.csv', names=["colname A", "colname B"])

Related

pandas dataframe restructuring [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 3 years ago.
I have a dataframe like so:
Input:
df = pd.DataFrame({'a': range(3), 'b': np.arange(3)-1})
Desired output:
df_rearranged = pd.DataFrame({'data': [0,1,2,-1,0,1], 'origin': ['a', 'a', 'a', 'b', 'b', 'b']})
I have found a (hacky) way of doing this:
Attempt:
subset_1 = df[['a']]
subset_1['origin'] = 'a'
subset_1.rename(columns={'a':'data'}, inplace=True)
subset_2 = df[['b']]
subset_2['origin'] = 'b'
subset_2.rename(columns={'b':'data'}, inplace=True)
df_rearranged = subset_1.append(subset_2)
This works, but it quickly becomes impractical when I want to pool larger numbers of columns. Also, I feel that there should be a function in pandas that does this by default, but I am lacking the keywords to find it. Help is greatly appreciated!
Use DataFrame.melt with change order columns by DataFrame.reindex:
df1 = df.melt(var_name='origin', value_name='data').reindex(['data','origin'], axis=1)
print (df1)
data origin
0 0 a
1 1 a
2 2 a
3 -1 b
4 0 b
5 1 b
Or DataFrame constructor with numpy.ravel and numpy.repeat, obviously working with better performance:
df1 = pd.DataFrame({'data':df.values.ravel(), 'origin':np.repeat(df.columns, len(df))})
print (df1)
data origin
0 0 a
1 -1 a
2 1 a
3 0 b
4 2 b
5 1 b

How to fix reading a CSV where column MultiIndex header row(s) have missing values?

I try to read in several csv files with a unfortunate structure, here's a simplified example:
[empty], A, A, B, B
time , X, Y, X, Y
0.0 , 0, 0, 0, 0
1.0 , 2, 5, 7, 0
... , ., ., ., .
...using pandas.read_csv with the header=[0,1] argument I can access the values fine:
>>> df = pd.read_csv('file.csv', header=[0,1]'
>>> df.A.X
0 0
1 2
...
But the empty field above the time header results in an ugly Unnamed: 0_level_0 level:
>>> df.columns
MultiIndex(levels=[['Unnamed: 0_level_0', 'A', 'B'], ...
Is there any way to fix this, so I can access the time data with df.Time again?
EDIT:
This is a snippet of the actual data set:
,,Bone,Bone,Bone
,,Skeleton1_Hip,Skeleton1_Hip,Skeleton1_Hip
,,"1","1","1"
,,Rotation,Rotation,Rotation
Frame,Time,X,Y,Z
0,0.000000,0.009332,0.999247,0.021044
1,0.008333,0.009572,0.999217,0.020468
3,0.016667,0.009871,0.999183,0.019797
(see also: https://gist.github.com/fhaust/25ba612f99420d366f0597b15dbf43e7 for a more complete example)
read via:
pd.read_csv(file, skiprows=2, header=[0,1,3,4], index_col=[1])
I don't really care about the Frame column, as it's given implicitly with the row index.
Add parameter index_col for convert first column to index:
import pandas as pd
temp=u""",A,A,B,B
time,X,Y,X,Y
0.0,0,0,0,0
1.0,2,5,7,0"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=[0,1], index_col=[0])
print (df)
A B
time X Y X Y
0.0 0 0 0 0
1.0 2 5 7 0
Or rename column:
df = df.rename(columns={'Unnamed: 0_level_0':'val'})
print (df)
val A B
time X Y X Y
0 0.0 0 0 0 0
1 1.0 2 5 7 0

Changing multiple column names

Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Categories

Resources