I have dataframe which has many rows.
How can I make this upper dataframe as below which has one rows.
import pandas as pd
# source dataframe
df_source = pd.DataFrame({
'ID': ['A01', 'A01'],
'Code': ['101', '102'],
'amount for code': [10000, 20000],
'count for code': [4, 3]
})
# target dataframe
df_target = pd.DataFrame({
'ID': ['A01'],
'Code101': [1],
'Code102': [1],
'Code103': [0],
'amount for code101': [10000],
'count for code101': [4],
'amount for code102': [20000],
'count for code102': [3],
'amount for code103': [None],
'count for code103': [None],
'count for code': [None],
'sum of amount': [30000],
'sum of count': [7]
})
I tried to use method 'get.dummies' but It can be used only for there was that code or not.
How can I handle dataframe to make my dataset?
You can iterate through the rows of your existing dataframe and populate (using .at or .loc) your new dataframe (df2). df2 will have the index ID, which is now unique.
import pandas as pd
df = pd.DataFrame({
'ID': ['A01', 'A01'],
'Code': ['101', '102'],
'amount for code': [10000, 20000],
'count for code': [4, 3]
})
df2 = pd.DataFrame()
for idx, row in df.iterrows():
for col in df.columns:
if col !='ID' and col !='Code':
df2.at[row['ID'],col+row['Code']]=row[col]
You can use pivot_table:
df_result = df.pivot_table(index='ID', columns='Code', values=['amount for code', 'amount for code'])
This will return a data frame with multi-level column index, for example ('101', 'amount for code')
Then you can add other calculated columns like sum of amount and so on.
Related
I need to divide each row by a specific/set row for each column in my data frame. In this case, I need to divide every row by Revenue for each time period. I want to get a percentage of how much each account is of Revenue. I would also like to figure out how to make dynamic for any amount of columns.
My current Data frame:
data = {'202112 YTD': {'Gross Margin': 200000,
'Other (Income) & Expense': -100000,
'Revenue': 5000000,
'SG&A Expense': 150000,
'Segment EBITDA': 200000},
'202212 YTD': {'Gross Margin': 2850000,
'Other (Income) & Expense': -338000,
'Revenue': 6000000,
'SG&A Expense': 15000,
'Segment EBITDA': 200000}}
df = pd.DataFrame.from_dict(data)
df
Desired Output:
outdata = {'202112 YTD': {'Gross Margin': 0.040,
'Other (Income) & Expense': -0.020,
'Revenue': 1,
'SG&A Expense': 0.030,
'Segment EBITDA': 0.040},
'202212 YTD': {'Gross Margin': 0.475,
'Other (Income) & Expense': -0.056,
'Revenue': 1,
'SG&A Expense': 0.003,
'Segment EBITDA': 0.033}}
outdf = pd.DataFrame.from_dict(outdata)
outdf
Help would be appreciated. Original attempt as was to structure solution like this example:
import copy
import pandas as pd
original_table = [
{'name': 'Alice', 'age': 25, 'gender': 'Female'},
{'name': 'Bob', 'age': 32, 'gender': 'Male'},
{'name': 'Charlie', 'age': 40, 'gender': 'Male'},
{'name': 'Daisy', 'age': 22, 'gender': 'Female'},
{'name': 'Eve', 'age': 18, 'gender': 'Female'},
]
# Duplicate the table using copy.deepcopy()
duplicate_table = copy.deepcopy(original_table)
# Choose a specific column to divide the rows by
column_name = 'age'
divisor_value = original_table[3][column_name]
# Iterate over the rows in the duplicate table and divide each column by the divisor value
for i, row in enumerate(duplicate_table):
if column_name in row:
duplicate_table[i][column_name] = row[column_name] / divisor_value
else:
print(f"column: {column_name} not found in table")
# Convert the duplicate table to a DataFrame
duplicate_df = pd.DataFrame(duplicate_table)
# Print the duplicate DataFrame
duplicate_df
duplicate_df
Simply use:
outdf = df.div(df.loc['Revenue']).round(3)
Output:
202112 YTD 202212 YTD
Gross Margin 0.04 0.475
Other (Income) & Expense -0.02 -0.056
Revenue 1.00 1.000
SG&A Expense 0.03 0.002
Segment EBITDA 0.04 0.033
Please help with logic on applying merge function only where condition is met.
In below example: merge should be applicable only when, np.where name = John, else show 0
df1 = pd.DataFrame({'Name': ['John', 'Tom', 'Simon', 'Jose'],
'Age': [5, 6, 4, 5]})
df2 = pd.DataFrame({'Name': ['John', 'Tom', 'Jose'],
'Class': ['Second', 'Third', 'Fifth']})
Expected result:
TIA
use merge and select the good rows of your df2.
df1.merge(df2[df2["Name"] == 'John'] , how = 'left' , on = 'Name')
A data frame like below. the names are in 5 groups, linking by the common in column A.
I want to group the names. I tried:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
df_1 = df.groupby('A')['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
print (row['A'], row['B'])
the outputs are:
('Brian', ['Ronald', 'Anthony'])
('Edward', ['David', 'Joseph'])
('James', ['John', 'Michael', 'William'])
('Jason', ['George', 'Kenneth', 'Steven'])
('Thomas', ['Christopher', 'Daniel'])
but I want one list for each group (it would be even better if there's an automatic way to assign a variable to each list), like:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I tried row['B'].append(row['A']) but it returns None.
What's the right way to group them? thank you.
You can add values of A grouping column in GroupBy.apply with .name attribute:
s = df.groupby('A')['B'].apply(lambda x: [x.name] + list(x))
print (s)
A
Brian [Brian, Ronald, Anthony]
Edward [Edward, David, Joseph]
James [James, John, Michael, William]
Jason [Jason, George, Kenneth, Steven]
Thomas [Thomas, Christopher, Daniel]
Name: B, dtype: object
You can try this. Use pd.Series.tolist()
for k,g in df.groupby('A')['B']:
print([k]+g.tolist())
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
The reason you got None as output is list.append returns None it mutates the list in-place.
try the following:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
#display(df)
df_1 = df.groupby(list('A'))['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
''' The value of column A is not a list,
so need to split the string and store in to a list and then concatenate with column B '''
print(row['A'].split("delimiter") + row['B'])
output:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I am doing groupby and rolling to a dataframe.
If I have more than 1 group, the result is a pandas series but if I only have 1 group then the result is a pandas dataframe. I replicated it below if you need to see what I am doing.
Is there a way to force pandas to return a series each time, even if there is only one group?
If you wish to recreate what I am seeing, you can run the below examples.
Example 1 (Series):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'Jill', 'Jill', 'Jill', 'Jill'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Example 2 (Dataframe):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'John', 'John', 'John', 'John'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Series:
pd.Series(df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum().values)
I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name