How am I supposed to remove the index column in the first row. I know it is not counted as a column but when I transpose the data frame, it does not allow me to use my headers anymore.
In[297] df = df.transpose()
print(df)
df = df.drop('RTM',1)
df = df.drop('Requirements', 1)
df = df.drop('Test Summary Report', 1)
print(df)
This throws me an error "labels ['RTM'] not contained in axis".
RTM is contained in an axis and this works if I do index_col=0
df = xl.parse(sheet_name,header=1, index_col=0, usecols="A:E", nrows=6, index_col=None)
but then I lose my (0,0) value "Artifact name" as a header. Any help will be appreciated.
You can do this with .iloc, to assign the column names to the first row after transposing. Then you can drop the first row, and clean up the name
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': list('ABCDE'),
'val1': np.arange(1,6,1),
'val2': np.arange(11,16,1)})
id val1 val2
0 A 1 11
1 B 2 12
2 C 3 13
3 D 4 14
4 E 5 15
Transpose and clean up the names
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.iloc[0].index.name)
df.columns.name = None
df is now:
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Alternatively, just create a new DataFrame to begin with, specifying which column you want to be the header column.
header_col = 'id'
cols = [x for x in df.columns if x != header_col]
pd.DataFrame(df[cols].values.T, columns=df[header_col], index=cols)
Output:
id A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Using the setup from #ALollz:
df.set_index('id').rename_axis(None).T
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Related
I have my input in a Pandas dataframe in the following format.
I would like to convert it into the below format
What I have managed to do so far:
I managed to extract the col values A and B from the column names have have cross joined it with the name column to obtain the following dataframe. I am not sure if my approach is correct.
I am not sure how I should go about it. Any help would be appreciated. Thanks
I agree with the earlier comment about posting data/code, but in this case it's simple enough to type in an example:
df = pd.DataFrame( { 'Name' : ['AA','BB','CC'],
'col1_A' : [5,2,5],
'col2_A' : [10,3,6],
'col1_B' : [15,4,7],
'col2_B' : [20,6,21],
})
print(df)
Name col1_A col2_A col1_B col2_B
0 AA 5 10 15 20
1 BB 2 3 4 6
2 CC 5 6 7 21
You can create a pd.MultiIndex to replace the column names to match the structure of the table:
df = df.set_index('Name')
df.columns = pd.MultiIndex.from_product([['A','B'],['val_1','val_2']], names=('col', None))
print(df)
col A B
val_1 val_2 val_1 val_2
Name
AA 5 10 15 20
BB 2 3 4 6
CC 5 6 7 21
Then stack() the 'col' column index, and reset both indices to be columns:
df = df.stack('col').reset_index()
print(df)
Name col val_1 val_2
0 AA A 5 10
1 AA B 15 20
2 BB A 2 3
3 BB B 4 6
4 CC A 5 6
5 CC B 7 21
Example code:
import pandas as pd
import re
# Dummy dataframe
d = {'Name': ['AA', 'BB'], 'col1_A': [5, 4], 'col1_B': [10, 9], 'col2_A': [15, 14], 'col2_B': [20, 19]}
df = pd.DataFrame(d)
# Get all the number index inside 'col' columns name
col_idx = [re.findall(r'\d+', name)[0] for name in list(df.columns[df.columns.str.contains('col')])]
# Get all the alphabet suffix at end of 'col' columns name
col_sfx = [name.split('_')[-1] for name in list(df.columns[df.columns.str.contains('col')])]
# Get unique value in list
col_idx = list(dict.fromkeys(col_idx))
col_sfx = list(dict.fromkeys(col_sfx))
# Create new df with repeated 'Name' and 'col'
new_d = {'Name': [name for name in df['Name'] for i in range(len(col_sfx))], 'col': col_sfx * len(df.index)}
new_df = pd.DataFrame(new_d)
all_sub_df = []
all_sub_df.append(new_df)
print("Name and col:\n{}\n".format(new_df))
# Create new df for each val columns
for i_c in col_idx:
df_coli = df.filter(like='col' + i_c, axis=1)
df_coli = df_coli.stack().reset_index()
df_coli = df_coli[df_coli.columns[-1:]]
df_coli.columns = ['val_' + i_c]
print("df_col{}:\n{}\n".format(i_c, df_coli))
all_sub_df.append(df_coli)
# Concatenate all columns for result
new_df = pd.concat(all_sub_df, axis=1)
new_df
Outputs:
Name and col:
Name col
0 AA A
1 AA B
2 BB A
3 BB B
df_col1:
val_1
0 5
1 10
2 4
3 9
df_col2:
val_2
0 15
1 20
2 14
3 19
Name col val_1 val_2
0 AA A 5 15
1 AA B 10 20
2 BB A 4 14
3 BB B 9 19
Let's say I have a DataFrame and don't know the names of all columns. However, I know there's a column called "N_DOC" and I want this to be the first column of the DataFrame - (while keeping all other columns, regardless its order).
How can I do this?
You can reorder the columns of a datframe with reindex:
cols = df.columns.tolist()
cols.remove('N_DOC')
df.reindex(['N_DOC'] + cols, axis=1)
Use DataFrame.insert with DataFrame.pop for extract column:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'N_DOC':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
c = 'N_DOC'
df.insert(0, c, df.pop(c))
Or:
df.insert(0, 'N_DOC', df.pop('N_DOC'))
print (df)
N_DOC A B C E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
Here's a simple, one line, solution using DataFrame masking:
import pandas as pd
# Building sample dataset.
cols = ['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC']
df = pd.DataFrame(columns=cols)
# Re-order columns.
df = df[['N_DOC'] + df.columns.drop('N_DOC').tolist()]
Before:
Index(['N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe', 'N_DOC'], dtype='object')
After:
Index(['N_DOC', 'N_DOCa', 'N_DOCb', 'N_DOCc', 'N_DOCd', 'N_DOCe'], dtype='object')
I have a pandas dataframe with more than 100 columns.
For example in the following df:
df['A','B','C','D','E','date','G','H','F','I']
How can I move date to be the last column? assuming the dataframe is large and i cant write all the column names manually.
You can try this:
new_cols = [col for col in df.columns if col != 'date'] + ['date']
df = df[new_cols]
Test data:
cols = ['A','B','C','D','E','date','G','H','F','I']
df = pd.DataFrame([np.arange(len(cols))],
columns=cols)
print(df)
# A B C D E date G H F I
# 0 0 1 2 3 4 5 6 7 8 9
Output of the code:
A B C D E G H F I date
0 0 1 2 3 4 6 7 8 9 5
Use pandas.DataFrame.pop and pandas.concat:
print(df)
col1 col2 col3
0 1 11 111
1 2 22 222
2 3 33 333
s = df.pop('col1')
new_df = pd.concat([df, s], 1)
print(new_df)
Output:
col2 col3 col1
0 11 111 1
1 22 222 2
2 33 333 3
This way :
df_new=df.loc[:,df.columns!='date']
df_new['date']=df['date']
Simple reindexing should do the job:
original = df.columns
new_cols = original.delete(original.get_loc('date'))
df.reindex(columns=new_cols)
You can use reindex and union:
df.reindex(df.columns[df.columns != 'date'].union(['date']), axis=1)
Let's only work with the index headers and not the complete dataframe.
Then, use reindex to reorder the columns.
Output using #QuangHoang setup:
A B C D E F G H I date
0 0 1 2 3 4 8 6 7 9 5
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'date')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
I need to merge 1 df with 1 csv.
df1 contains only 1 columns (id list of the product I want to update)
df2 contains 2 columns (id of all the products, quantity)
df1=pd.read_csv(id_file, header=0, index_col=False)
df2 = pd.DataFrame(data=result_q)
df3=pd.merge(df1, df2)
What I want: a dataframe that contains only id from csv/df1 merge with the quantities of df2 for the same id
if you want only the products that ya have in first data_frame you can use this:
df_1
Out[11]:
id
0 1
1 2
2 4
3 5
df_2
Out[12]:
id prod
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
6 7 g
7 8 h
df_3 = df_1.merge(df_2,on='id')
df_3
Out[14]:
id prod
0 1 a
1 2 b
2 4 d
3 5 e
you neede use the parameter on='column' so the will generate a new df only with the correspondent rows that have the same id.
you can use new_df= pd.merge(df1,df2, on=['Product_id'])
I've found the solution. I needed to reset the index for my df2
df1=pd.read_csv(id_file)
df2 = pd.DataFrame(data=result_q).reset_index()
df1['id'] = pd.to_numeric(df1['id'], errors = 'coerce')
df2['id'] = pd.to_numeric(df2['id'], errors = 'coerce')
df3=df1.merge(df2, on='id')
Thank you everyone!
I have a dataframe containing (record formatted) json strings as follows:
In[9]: pd.DataFrame( {'col1': ['A','B'], 'col2': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']})
Out[9]:
col1 col2
0 A [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
I would like to extract the json and for each record add a new row to the dataframe:
co1 t v
0 A 05:15:00 20
1 A 05:20:00 25
2 B 05:15:00 10
3 B 05:20:00 15
I've been experimenting with the following code:
def json_to_df(x):
df2 = pd.read_json(x.col2)
return df2
df.apply(json_to_df, axis=1)
but the resulting dataframes are assigned as tuples, rather than creating new rows. Any advice?
The problem with apply is that you need to return mulitple rows and it expects only one. A possible solution:
def json_to_df(row):
_, row = row
df_json = pd.read_json(row.col2)
col1 = pd.Series([row.col1]*len(df_json), name='col1')
return pd.concat([col1,df_json],axis=1)
df = map(json_to_df, df.iterrows()) #returns a list of dataframes
df = reduce(lambda x,y:x.append(y), x) #glues them together
df
col1 t v
0 A 05:15 20
1 A 05:20 25
0 B 05:15 10
1 B 05:20 15
Ok, taking a little inspiration from hellpanderrr's answer above, I came up with the following:
In [92]:
pd.DataFrame( {'X': ['A','B'], 'Y': ['fdsfds','fdsfds'], 'json': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']},)
Out[92]:
X Y json
0 A fdsfds [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B fdsfds [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
In [93]:
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
_.apply(json_to_df, axis=1, json_col='json')
pd.concat(dfs)
Out[93]:
t v X Y
0 05:15 20 A fdsfds
1 05:20 25 A fdsfds
0 05:15 10 B fdsfds
1 05:20 15 B fdsfds