How to join or merge in python - python

We have two Dataframes.
We want all the columns form Dataframe1, but just one column (target_name) from the Dataframe2.
However, when we do this, it gives us duplicate values.
Dataframe1 values:
user_id subject_id x y w h g
0 858580 23224814 58.133331 57.466675 181.000000 42.000000 1
1 858580 23224814 293.133331 176.466675 80.000000 34.000000 2
2 313344 28539152 834.049316 37.493195 63.005920 36.444595 1
3 313344 28539152 104.003235 45.072937 242.956024 26.754082 2
4 313344 28539152 635.436829 80.038574 108.716065 35.240089 3
5 313344 28539152 351.910156 80.162117 201.371887 32.738373 4
6 861687 28539165 125.313393 39.836521 231.202873 43.087811 1
7 861687 28539165 623.450500 44.040207 151.332825 34.680435 2
8 1254304 28539165 128.893204 45.765110 225.686691 35.547726 1
Dataframe2 Values:
Unnamed: 0 user_id subject_id good x y w h T0 T1 T2 T3 T4 T5 T6 T7 T8 target_name target_name_length target_name3
0 0 858580 23224814 1 58.133331 57.466675 181.000000 42.000000 NaN 1801 No, there are still more names to be marked Male 1881 John Abbott NaN NaN NaN John Abbott 11 John Abbott
1 1 858580 23224814 1 293.133331 176.466675 80.000000 34.000000 NaN NaN Yes, I've marked all the names Female NaN NaN Edith Joynt Edith Abbot NaN Edith Joynt 11 Edith Joynt
2 2 340348 30629031 1 152.968750 26.000000 224.000000 41.000000 NaN 1852 No, there are still more names to be marked Male 1924 William Sparrow NaN NaN NaN William Sparrow 15 William Sparrow
3 3 340348 30629031 1 497.968750 325.000000 87.000000 29.000000 NaN NaN Yes, I've marked all the names Female NaN NaN Minnie NaN NaN Minnie 6 Minnie
4 4 340348 28613182 1 103.968750 31.000000 162.000000 38.000000 NaN 1819 No, there are still more names to be marked Male 1876 Albert [unclear]Gles[/unclear] NaN NaN NaN Albert Gles 30 Albert Gles
5 5 340348 28613182 1 107.968750 76.000000 72.000000 25.000000 NaN 1819 Yes, I've marked all the names Female 1884 NaN Eliza [unclear]Gles[/unclear] NaN NaN Eliza Gles 29 Eliza Gles
6 6 340348 30628864 1 172.968750 29.000000 192.000000 41.000000 NaN 1840 No, there are still more names to be marked Male 1918 John Slaltery NaN NaN NaN John Slaltery 13 John Slaltery
7 7 340348 30628864 1 115.968750 214.000000 149.000000 31.000000 NaN NaN No, there are still more names to be marked Male NaN [unclear]P.[/unclear] Slaltery NaN NaN NaN P. Slaltery 30 unclear]P. Slaltery
8 8 340348 30628864 1 537.968750 218.000000 64.000000 26.000000 NaN NaN Yes, I've marked all the names Female 1901 NaN Elizabeth Slaltery NaN NaN Elizabeth Slaltery 18 Elizabeth Slaltery
Here is the code we are trying to use:

If you want to blindly add the target column to dataframe1 then
dataframe1['target'] = dataframe2['target']
Just make sure that both the dataframes have same number of rows and they are sorted by any given common column. Eg: user_id is found in both the dataframes

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN
I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg
Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

How do I place NaN when computing the average rating for each movie in a DataFrame?

I am working with the MovieLens dataset, basically there are 2 files, a .csv file which contains movies and another .csv file which contains ratings given by n users to specific movies.
I did the following in order to get the average rating for each movie in the DF.
ratings_data.groupby('movieId').rating.mean()
however with that code I am getting 9724 movies whereas in the main DataFrame I have 9742 movies.
I think that there are movies that are not rated at all, however since I want to add the ratings to the main movies dataset how would I put NaN on the fields that have no ratings?!
Use Series.reindex by unique movieId form another column, for same order is add Series.sort_values:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
mov = movies_data['movieId'].sort_values().drop_duplicates()
df = ratings_data.groupby('movieId').rating.mean().reindex(mov).reset_index()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9737 193581 4.000000
9738 193583 3.500000
9739 193585 3.500000
9740 193587 3.500000
9741 193609 4.000000
[9742 rows x 2 columns]
df1 = df[df['rating'].isna()]
print (df1)
movieId rating
816 1076 NaN
2211 2939 NaN
2499 3338 NaN
2587 3456 NaN
3118 4194 NaN
4037 5721 NaN
4506 6668 NaN
4598 6849 NaN
4704 7020 NaN
5020 7792 NaN
5293 8765 NaN
5421 25855 NaN
5452 26085 NaN
5749 30892 NaN
5824 32160 NaN
5837 32371 NaN
5957 34482 NaN
7565 85565 NaN
EDIT:
If need new column to movie_data DataFrame, use DataFrame.merge with left join:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
df = ratings_data.groupby('movieId', as_index=False).rating.mean()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9719 193581 4.000000
9720 193583 3.500000
9721 193585 3.500000
9722 193587 3.500000
9723 193609 4.000000
[9724 rows x 2 columns]
df = movies_data.merge(df, on='movieId', how='left')
print (df)
movieId title \
0 1 Toy Story (1995)
1 2 Jumanji (1995)
2 3 Grumpier Old Men (1995)
3 4 Waiting to Exhale (1995)
4 5 Father of the Bride Part II (1995)
... ...
9737 193581 Black Butler: Book of the Atlantic (2017)
9738 193583 No Game No Life: Zero (2017)
9739 193585 Flint (2017)
9740 193587 Bungo Stray Dogs: Dead Apple (2018)
9741 193609 Andrew Dice Clay: Dice Rules (1991)
genres rating
0 Adventure|Animation|Children|Comedy|Fantasy 3.920930
1 Adventure|Children|Fantasy 3.431818
2 Comedy|Romance 3.259615
3 Comedy|Drama|Romance 2.357143
4 Comedy 3.071429
... ...
9737 Action|Animation|Comedy|Fantasy 4.000000
9738 Animation|Comedy|Fantasy 3.500000
9739 Drama 3.500000
9740 Action|Animation 3.500000
9741 Comedy 4.000000
[9742 rows x 4 columns]

Merge multiple rows to one row in a csv file using python pandas

I have a csv file with multiple rows as stated below
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 NAN NAN NAN NAN
2 BCD 15 NAN NAN NAN NAN
3 CDE 17 NAN NAN NAN NAN
1 ABC NAN 18 NAN 17 NAN
2 BCD NAN 10 NAN 15 NAN
1 ABC NAN NAN 16 NAN NAN
3 CDE NAN NAN 19 NAN NAN
I want to merge the rows having the same id and name into a single row using pandas in python. The output should be :
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 18 16 17 NAN
2 BCD 15 10 NAN 15 NAN
3 CDE 17 NAN 19 NAN NAN
IIUC, DataFrame.groupby + as_index=False with GroupBy.first to eliminate NaN.
#df = df.replace('NAN',np.nan) #If necessary
df.groupby(['Id','Name'],as_index=False).first()
if you think it could have a pair Id Name with non-null values ​​in some column you could use GroupBy.apply with Series.ffill and Series.bfill + DataFrame.drop_duplicates to keep all the information.
df.groupby(['Id','Name']).apply(lambda x: x.ffill().bfill()).drop_duplicates()
Output
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
0 1 ABC 10 18 16 17 NaN
1 2 BCD 15 10 NaN 15 NaN
2 3 CDE 17 NaN 19 NaN NaN
Hacky answer:
pd.groupby(“Name”).mean().reset_index()
This will only work if for each column there is only one valid value for each Name.

Setting subset of a pandas DataFrame by a DataFrame

I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question.
So I am selecting a subset of a pandas DataFrame and want to change these values individually.
I am subselecting my DataFrame like this:
df.loc[df[key].isnull(), [keys]]
which works perfectly. If I try and set all values to the same value such as
df.loc[df[key].isnull(), [keys]] = 5
it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either.
So for example I have a DataFrame:
data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]]
df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value'])
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.0 2.0
1 Bob 12 0 0.0 1.0
2 Clarke 13 0 0.0 4.0
3 Dennis 64 2 NaN NaN
4 Jennifer 56 1 NaN NaN
5 Tom 95 5 NaN NaN
6 Ellen 42 2 NaN NaN
7 Heather 31 3 NaN NaN
and a second DataFrame:
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
cars_per_year some_other_value
0 0.031250 5
1 0.017857 1
2 0.052632 7
3 0.047619 5
4 0.096774 7
and I would like to replace those nans with the second DataFrame
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values?
Any help would be appreciated. Sorry if this has been posted before.
It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment:
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
If not, get errors like:
#4 rows assigned to 5 rows
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,)
Another idea is set index of df2 by index of filtered rows in df1:
df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
Just add .values or .to_numpy() if using pandas v 0.24 +
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?
I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS
Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)
Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.

Categories

Resources