I am doing groupby and rolling to a dataframe.
If I have more than 1 group, the result is a pandas series but if I only have 1 group then the result is a pandas dataframe. I replicated it below if you need to see what I am doing.
Is there a way to force pandas to return a series each time, even if there is only one group?
If you wish to recreate what I am seeing, you can run the below examples.
Example 1 (Series):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'Jill', 'Jill', 'Jill', 'Jill'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Example 2 (Dataframe):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'John', 'John', 'John', 'John'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Series:
pd.Series(df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum().values)
Related
Please help with logic on applying merge function only where condition is met.
In below example: merge should be applicable only when, np.where name = John, else show 0
df1 = pd.DataFrame({'Name': ['John', 'Tom', 'Simon', 'Jose'],
'Age': [5, 6, 4, 5]})
df2 = pd.DataFrame({'Name': ['John', 'Tom', 'Jose'],
'Class': ['Second', 'Third', 'Fifth']})
Expected result:
TIA
use merge and select the good rows of your df2.
df1.merge(df2[df2["Name"] == 'John'] , how = 'left' , on = 'Name')
first dataset: dim(d)=(70856886 12), Second dataset: dim(e)=(354 6)
both data set have common variable which is subject and I want to merge both dataset by subject, I used this code by python:
# Merging both dataset:
data=pd.merge(d, e, on='subject')
When I do that I lost some data set my dim of my new merging dataset is 62611728
my question is why I am losing those observation?? [70856886- 62611728= 8245158]
As the documentation states, pd.merge() "Merge[s] DataFrame or named Series objects with a database-style join."
In general, it's a good idea to try something on a small dataset to see if you understand its function correctly and then to apply it to a large dataset.
Here's an example for pd.merge():
import pandas as pd
df1 = pd.DataFrame([
{'subject': 'a', 'value': 1},
{'subject': 'a', 'value': 2},
{'subject': 'b', 'value': 3},
{'subject': 'c', 'value': 4},
{'subject': 'c', 'value': 5},
])
df2 = pd.DataFrame([
{'subject': 'a', 'other': 6},
{'subject': 'b', 'other': 7},
{'subject': 'b', 'other': 8},
{'subject': 'd', 'other': 9}
])
df = pd.merge(df1, df2, on='subject')
print(df)
What output do you expect? It should be this:
subject value other
0 a 1 6
1 a 2 6
2 b 3 7
3 b 3 8
In your case, we can only assume that, when combined, only 62611728 records could actually be constructed with matching 'subject' - the rest of the records in either d or e had subjects which had no match in the other.
You only see the records that have the combined values from both dataframes, but only those that share the value for 'subject'. Any non-matching 'subject' records are left out, on either side (it's an 'inner' join).
Look at the documentation for the other variants. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
I have dataframe which has many rows.
How can I make this upper dataframe as below which has one rows.
import pandas as pd
# source dataframe
df_source = pd.DataFrame({
'ID': ['A01', 'A01'],
'Code': ['101', '102'],
'amount for code': [10000, 20000],
'count for code': [4, 3]
})
# target dataframe
df_target = pd.DataFrame({
'ID': ['A01'],
'Code101': [1],
'Code102': [1],
'Code103': [0],
'amount for code101': [10000],
'count for code101': [4],
'amount for code102': [20000],
'count for code102': [3],
'amount for code103': [None],
'count for code103': [None],
'count for code': [None],
'sum of amount': [30000],
'sum of count': [7]
})
I tried to use method 'get.dummies' but It can be used only for there was that code or not.
How can I handle dataframe to make my dataset?
You can iterate through the rows of your existing dataframe and populate (using .at or .loc) your new dataframe (df2). df2 will have the index ID, which is now unique.
import pandas as pd
df = pd.DataFrame({
'ID': ['A01', 'A01'],
'Code': ['101', '102'],
'amount for code': [10000, 20000],
'count for code': [4, 3]
})
df2 = pd.DataFrame()
for idx, row in df.iterrows():
for col in df.columns:
if col !='ID' and col !='Code':
df2.at[row['ID'],col+row['Code']]=row[col]
You can use pivot_table:
df_result = df.pivot_table(index='ID', columns='Code', values=['amount for code', 'amount for code'])
This will return a data frame with multi-level column index, for example ('101', 'amount for code')
Then you can add other calculated columns like sum of amount and so on.
A data frame like below. the names are in 5 groups, linking by the common in column A.
I want to group the names. I tried:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
df_1 = df.groupby('A')['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
print (row['A'], row['B'])
the outputs are:
('Brian', ['Ronald', 'Anthony'])
('Edward', ['David', 'Joseph'])
('James', ['John', 'Michael', 'William'])
('Jason', ['George', 'Kenneth', 'Steven'])
('Thomas', ['Christopher', 'Daniel'])
but I want one list for each group (it would be even better if there's an automatic way to assign a variable to each list), like:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I tried row['B'].append(row['A']) but it returns None.
What's the right way to group them? thank you.
You can add values of A grouping column in GroupBy.apply with .name attribute:
s = df.groupby('A')['B'].apply(lambda x: [x.name] + list(x))
print (s)
A
Brian [Brian, Ronald, Anthony]
Edward [Edward, David, Joseph]
James [James, John, Michael, William]
Jason [Jason, George, Kenneth, Steven]
Thomas [Thomas, Christopher, Daniel]
Name: B, dtype: object
You can try this. Use pd.Series.tolist()
for k,g in df.groupby('A')['B']:
print([k]+g.tolist())
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
The reason you got None as output is list.append returns None it mutates the list in-place.
try the following:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
#display(df)
df_1 = df.groupby(list('A'))['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
''' The value of column A is not a list,
so need to split the string and store in to a list and then concatenate with column B '''
print(row['A'].split("delimiter") + row['B'])
output:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name