I would like to know if there is an easy way to convert pandas dataframe to list by column instead of row ? for the example below, can we have [['Apple','Orange','Kiwi','Mango'],[220,200,1000,800],['a','o','k','m']] ?
Appreciate if anyone can advise on this. Thanks
import pandas as pd
data = {'Brand': ['Apple','Orange','Kiwi','Mango'],
'Price': [220,200,1000,800],
'Type' : ['a','o','k','m']
}
df = pd.DataFrame(data, columns = ['Brand', 'Price', 'Type'])
df.head()
df.values.tolist()
#[['Apple', 220, 'a'], ['Orange', 200, 'o'], ['Kiwi', 1000, 'k'], ['Mango', 800, 'm']]
#Anyway to have ?????
#[['Apple','Orange','Kiwi','Mango'],[220,200,1000,800],['a','o','k','m']]
Just use Transpose(T) attribute:
lst=df.T.values.tolist()
OR
use transpose() method:
lst=df.transpose().values.tolist()
If you print lst you will get:
[['Apple', 'Orange', 'Kiwi', 'Mango'], [220, 200, 1000, 800], ['a', 'o', 'k', 'm']]
Related
Suppose I have a dataframe like this:
data = [['A', 'HIGH', 120, 200],
['A', 'MID', 350, 200],
['B', 'HIGH', 130, 100],
['B', 'HIGH', 70, 100],
['A', 'MID', 130, 200]]
df = pd.DataFrame(data, columns=['Category', 'Range', 'Total', 'Avg'])
Now, I want to create a Group By that when the category is A, it groups by category and Range, while when it is B, it group only by category.
Is it possible to do?
Thanks!
Check below code. It will also work B has multiple range.
import pandas as pd
import numpy as np
data = [['A', 'HIGH', 120, 200],
['A', 'MID', 350, 200],
['A', 'MID', 130, 200],
['B', 'HIGH', 130, 100],
['B', 'MID', 70, 100],
['B', 'MID', 70, 100]
]
df = pd.DataFrame(data, columns=['Category', 'Range', 'Total', 'Avg'])
df[['Total_New','Avg_New']] = df.assign(group_col = np.where(df['Category']=='A',df.Category+df.Range, df.Category)).\
groupby('group_col')['Total','Avg'].transform('sum')
df
Output:
Suppose I have a dataframe
data = {'Date': ['22-08-2021', '12-09-2021', '02-10-2021', '22-11-2021'], 'ID': ['A', 'B', 'C', 'O'], 'Item':['Apple','Banana','Carrot', 'Orange'], 'Cost':[10, 12, 15, 13]}
dataframe = pd.DataFrame(data)
dataframe
And a list of indices,
index_list = ['A', 'A', 'B', 'B', 'O', 'C', 'C']
And I want to select rows, based on their indices, multiple times as they appear in this list, so that the above dataframe would become
data2 = {'Date':['22-08-2021', '22-08-2021', '12-09-2021', '12-09-2021', '22-11-2021', '02-10-2021', '02-10-2021'], 'ID': ['A', 'A', 'B', 'B', 'O', 'C', 'C'], 'Item':['Apple', 'Apple', 'Banana', 'Banana', 'Orange', 'Carrot', 'Carrot'], 'Cost':[10, 10, 12, 12, 13, 15, 15]}
dataframe2 = pd.DataFrame(data2)
dataframe2
What's the best way to do this using Pandas?
My approach:
I wrote the following for loop to achieve this, but I think there should be inbuilt pandas functions to do this in a much more elegant and efficient way.
dataframe2 = pd.DataFrame(columns = dataframe.columns)
for i in index_list:
idx = dataframe.index[dataframe['ID'] == i]
dataframe2 = pd.concat([dataframe2, dataframe.loc[idx]])
dataframe2
Any help will be appreciated.
You can use .reset_index() to add the index as a normal column, then set_index() and .loc[] to fetch rows by ID. Then once you know the original indexes of the rows you want, you can use .loc[] again to get them.
>>> orig_indexes = dataframe.reset_index().set_index('ID').loc[index_list, 'index']
>>> dataframe.loc[orig_indexes]
Date ID Item Cost
0 22-08-2021 A Apple 10
0 22-08-2021 A Apple 10
1 12-09-2021 B Banana 12
1 12-09-2021 B Banana 12
3 22-11-2021 O Orange 13
2 02-10-2021 C Carrot 15
2 02-10-2021 C Carrot 15
inp Dataframe
df = pd.DataFrame({'Loc': ['Hyd', 'Hyd','Bang','Bang'],
'Item': ['A', 'B', 'A', 'B'],
'Month' : ['May','May','June','June'],
'Sales': [100, 100, 200, 200],
'Values': [1000, 1000, 2000, 2000]
})
My expected output
df = pd.DataFrame({'Loc': ['Hyd', 'Hyd','Hyd','Hyd','Bang','Bang','Bang','Bang'],
'Item': ['A', 'A', 'B', 'B','A', 'A', 'B', 'B'],
'VAR' : ['Sales','Values','Sales','Values','Sales','Values','Sales','Values'],
'May': [100, 1000, 100, 1000, 100, 1000, 100, 1000],
'June': [200, 2000, 200, 2000, 200, 2000, 200, 2000]
})
I have tried multiple solutions using melt and pivot but nothing seems to work ? not sure where I am missing ?
Here's my code
dem.melt(['Part','IBU','Date1']).pivot_table(index=['Part','IBU','variable'],columns=['Date1'])
Any help would be much appreciated
You can use melt and pivot functions in pandas:
df_melted = pd.melt(df, id_vars=["Loc", "Item", "Month"], value_vars=["Sales", "Values"])
This will result:
And then:
df_pivot = df_melted.pivot_table(index=["Loc", "Item", "variable"], columns="Month")
So, the final output will be:
I have two DataFrames:
df1 = {'MS': [1000, 1005, 1007, NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '---', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '--', 'A', 'B', 'B'],
'Data3': [1, 0, 0, NaN, 1, 1, 0]}
df2 = {'MS': [1001, 1006, 1010, NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '---', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '--', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, NaN, 1, 0, 0, 0]}
I want to compare every row in 'df1' with 'df2', except 'MS' column, where 'MS' is time in milliseconds. Both the DFs have identical columns. Column 'MS' might contain NaN, which case need to be ignored.
By comparing, I want to print
Matching rows in 'df1' and 'df2', one below the other, with a new column 'Diff' having 'MS' difference between the values; from above example, row 3 in 'df1' is matching with row 1 of 'df2', so print,
MS Diff Command Data1 Data2 Data3
0 1007 NaN WR 120 B 0
1 1001 6 WR 120 B 0
Print all unmatched rows in df1 and df2
Compare function should be generic enough to accept an argument with columns of choice and compare only those values in columns to consider match or no-match. For example, every iteration I may pass different column lists,
itr1_comp_col = ['Command', 'Data1', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
For respective iterations, it shall compare only those column values of user choice.
So far I am not able to produce any satisfactory code. I am a beginner to Pandas and
I have tried grouping them by 'Command' column and concatenating two identical groups by dropping duplicates, as discussed in this thread.
I have manually looped through values in every row and compared, which is absolutely inefficient, as data is very huge, some million entries.
Please suggest an efficient way to handle above case. Thanks in advance.
I will answer my own question, wrt, #Ankur said in his comments:
Even though this doesn't print matching rows one below the other, however it partially fulfils the requirement.
Referring to this page, merge can be used to find difference in DFs. Especially, the argument how= will do the work. Below is the function:
def find_diff(df1: pd.DataFrame, df2: pd.DataFrame, col_itr):
res = pd.merge(df1, df2, on=col_itr, how='outer')
res['Diff'] = res['MS_x'] - res['MS_y']
print (res)
Usage:
import pandas as pd
import numpy as np
d1 = {'MS': [1000, 1005, 1007, np.NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '-', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, np.NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '-', 'A', 'B', 'B'],
'Data3': [1, 0, 0, np.NaN, 1, 1, 0]}
d2 = {'MS': [1001, 1006, 1010, np.NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '-', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, np.NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '-', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, np.NaN, 1, 0, 0, 0]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
itr1_comp_col = ['Command', 'Data1', 'Data2', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
find_diff(df1, df2, itr1_comp_col)
find_diff(df1, df2, itr2_comp_col)
I have the a dataframe something like the below struture :
I need to make it look it as this :
Can any one help pls ?
You can use the groupby() function with a list and append summarising functions with agg().
import pandas as pd
df = pd.DataFrame({'customer': [1,2,1,3,1,2,3],
"group_code": ['111', '111', '222', '111', '111', '111', '333'],
"ind_code": ['A', 'B', 'AA', 'A', 'AAA', 'C', 'BBB'],
"amount": [100, 200, 140, 400, 225, 125, 600],
"card": ['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'YYY', 'XXX']})
df_groupby = df.groupby(['customer', 'group_code', 'ind_code']).agg(['count', 'mean'])