Pandas merge only on where condition - python

Please help with logic on applying merge function only where condition is met.
In below example: merge should be applicable only when, np.where name = John, else show 0
df1 = pd.DataFrame({'Name': ['John', 'Tom', 'Simon', 'Jose'],
'Age': [5, 6, 4, 5]})
df2 = pd.DataFrame({'Name': ['John', 'Tom', 'Jose'],
'Class': ['Second', 'Third', 'Fifth']})
Expected result:
TIA

use merge and select the good rows of your df2.
df1.merge(df2[df2["Name"] == 'John'] , how = 'left' , on = 'Name')

Related

Python - Pandas - how to extract content of specific row without header column

I would like to get the content of specific row without header column , I'm going to use df.iloc[row number] , but it didn't give me an expected result ?
my code as below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2]
The result i get is:
first_name Marry
last_name Jackson
age 37
Name: 2, dtype: object
I expected it could give me a list , sth like below:
['Marry', 'Jackson','37']
Any idea to do this, could you please advise for my case?
Well there are many functions in pandas that could help you do this. to_String() or values are a few among them.
So if you do something like
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].to_String()
print(df_temp)
you will get an output like this for your given code:
first_name Marry
last_name Jackson
age 37
however in your case because you want a list you can just call values and get it as you want. Here's your updated code below:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
)
df.head()
df_temp = df.loc[2].values
print(df_temp)
which will give you the output you probably want as
['Marry' 'Jackson' 37]

How to split pandas dataframe into list of dataframes by id?

I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3

Python Pandas to group values in 2 columns

A data frame like below. the names are in 5 groups, linking by the common in column A.
I want to group the names. I tried:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
df_1 = df.groupby('A')['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
print (row['A'], row['B'])
the outputs are:
('Brian', ['Ronald', 'Anthony'])
('Edward', ['David', 'Joseph'])
('James', ['John', 'Michael', 'William'])
('Jason', ['George', 'Kenneth', 'Steven'])
('Thomas', ['Christopher', 'Daniel'])
but I want one list for each group (it would be even better if there's an automatic way to assign a variable to each list), like:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I tried row['B'].append(row['A']) but it returns None.
What's the right way to group them? thank you.
You can add values of A grouping column in GroupBy.apply with .name attribute:
s = df.groupby('A')['B'].apply(lambda x: [x.name] + list(x))
print (s)
A
Brian [Brian, Ronald, Anthony]
Edward [Edward, David, Joseph]
James [James, John, Michael, William]
Jason [Jason, George, Kenneth, Steven]
Thomas [Thomas, Christopher, Daniel]
Name: B, dtype: object
You can try this. Use pd.Series.tolist()
for k,g in df.groupby('A')['B']:
print([k]+g.tolist())
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
The reason you got None as output is list.append returns None it mutates the list in-place.
try the following:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
#display(df)
df_1 = df.groupby(list('A'))['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
''' The value of column A is not a list,
so need to split the string and store in to a list and then concatenate with column B '''
print(row['A'].split("delimiter") + row['B'])
output:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']

Pandas rolling groupby with one group

I am doing groupby and rolling to a dataframe.
If I have more than 1 group, the result is a pandas series but if I only have 1 group then the result is a pandas dataframe. I replicated it below if you need to see what I am doing.
Is there a way to force pandas to return a series each time, even if there is only one group?
If you wish to recreate what I am seeing, you can run the below examples.
Example 1 (Series):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'Jill', 'Jill', 'Jill', 'Jill'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Example 2 (Dataframe):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'John', 'John', 'John', 'John'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Series:
pd.Series(df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum().values)

How to find out difference of two dataframes in terms of column name using Python

I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name

Categories

Resources