I am having two pandas series, namely, x and y.
x.head() gives:
user hotel rating id
0 1 1253 5 2783_1253
1 4 589 5 2783_589
2 5 1270 4 2783_1270
3 3 1274 4 2783_1274
4 2 741 5 2783_741
y.head() gives:
UserID Gender Age Occupation Zip Code
0 1.0 F 18.0 10.0 48067
1 2.0 M 56.0 16.0 70072
2 3.0 M 25.0 15.0 55117
3 4.0 M 45.0 7.0 2460
4 5.0 M 25.0 20.0 55455
What I need is to merge columns of these two where user = UserID.
So for example my first row should look like:
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 F 18.0 10.0 48067
How will I get it?
I think you need first convert float column to int and then merge:
y['user'] = y.UserID.astype(int)
df = pd.merge(x,y, on='user')
print (df)
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 2.0 M 56.0 16.0 70072
1 4 589 5 2783_589 4.0 5.0 M 25.0 20.0 55455
2 3 1274 4 2783_1274 3.0 4.0 M 45.0 7.0 2460
3 2 741 5 2783_741 2.0 3.0 M 25.0 15.0 55117
Or convert both columns to float:
x['UserID'] = x.user.astype(float)
df = pd.merge(x,y, on='UserID')
print (df)
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 2.0 M 56.0 16.0 70072
1 4 589 5 2783_589 4.0 5.0 M 25.0 20.0 55455
2 3 1274 4 2783_1274 3.0 4.0 M 45.0 7.0 2460
3 2 741 5 2783_741 2.0 3.0 M 25.0 15.0 55117
What you are looking for is a join. You will find your answer here: http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.join.html (it works just like in SQL).
However, there might some additional casting and renaming if you want to keep both user as an integer and UserID as a float.
Related
I have 2 tables
x
y
Both have different number of rows.
Columns "a" and "b" together act as unique key.
I want the rows in our y dataframe to replace the rows in x dataframe which have common "a" and "b" column values.
x=pd.DataFrame({"a":[1,2,2,3,3,np.nan,5],
"b":[12,13,14,15,16,17,18],
"c":["japan",np.nan,"india",np.nan,np.nan,"france","brazil"],
"d":[12,15,10,np.nan,11,6,20]})
Result:
y=pd.DataFrame({"a":[2,2,3,3],
"b":[13,14,15,16],
"c":[np.nan,"india","sweden","spain"],
"d":[15,10,25,11]})
Required output:
I tried multiple methods like merge(),update() but its not working , please help
Use merge and combine_first:
out = x[['a', 'b']].merge(y, on=['a', 'b'], how='left').combine_first(x)
Output:
a b c d
0 1.0 12 japan 12.0
1 2.0 13 NaN 15.0
2 2.0 14 india 10.0
3 3.0 15 sweden 25.0
4 3.0 16 spain 11.0
5 NaN 17 france 6.0
6 5.0 18 brazil 20.0
If a and b together act as unique keys, you can set them as the index and then use combine_first as #mozway has suggested.
x = x.set_index(["a", "b"])
y = y.set_index(["a", "b"])
out = x.combine_first(y)
c d
a b
1.0 12 japan 12.0
2.0 13 NaN 15.0
14 india 10.0
3.0 15 sweden 25.0
16 spain 11.0
NaN 17 france 6.0
5.0 18 brazil 20.0
You can optionally reset the index after
out.reset_index()
a b c d
0 1.0 12 japan 12.0
1 2.0 13 NaN 15.0
2 2.0 14 india 10.0
3 3.0 15 sweden 25.0
4 3.0 16 spain 11.0
5 NaN 17 france 6.0
6 5.0 18 brazil 20.0
References
pd.DataFrame.set_index
pd.DataFrame.reset_index
In a pandas dataframe, I want to transpose and agrupate datetime columns into rows.
Like this (there are about 12 date columns):
Category Type 11/2021 12/2021
0 A 1 0.0 20
1 A 2 NaN 13
2 B 1 5.0 7
3 B 2 20.0 4
to one like this:
Date Category Type1 Type2
0 2021-11 A 0 NaN
1 2021-11 B 5 20.0
2 2021-12 A 20 13.0
3 2021-12 B 7 4.0
I tought about using pivot tables, but I wasnt able to do so.
You could do:
(df.melt(['Category', 'Type'], var_name = 'Date').
pivot(['Date', 'Category'],'Type').reset_index())
Date Category value
Type 1 2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
To be alitle cleaner you could use janitor:
import janitor
(df.pivot_longer(['Category', 'Type'], names_to = 'Date', values_to = 'type').
pivot_wider(['Date', 'Category'], names_from = 'Type', names_sep = ''))
Date Category type1 type2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
Another solution:
x = (
df.set_index(["Category", "Type"])
.stack()
.unstack("Type")
.add_prefix("Type")
.reset_index()
)
x = x.rename(columns={"level_1": "Date"})
x.columns.name = None
print(x)
Prints:
Category Date Type1 Type2
0 A 11/2021 0.0 NaN
1 A 12/2021 20.0 13.0
2 B 11/2021 5.0 20.0
3 B 12/2021 7.0 4.0
I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9],
'machine':['A','A','A','B','B','A','B','B','A'],
'prod':['button','tack','pin','button','tack','pin','clip','clip','button'],
'qty':[100,50,30,70,60,15,200,180,np.nan],
'hours':[4,3,1,3,2,0.5,5,6,np.nan],
'day':[1,1,1,1,1,1,2,2,2]})
my_df['prod_rate']=my_df['qty']/my_df['hours']
my_df
id machine prod qty hours day prod_rate
0 1 A button 100.0 4.0 1 25.000000
1 2 A tack 50.0 3.0 1 16.666667
2 3 A pin 30.0 1.0 1 30.000000
3 4 B button 70.0 3.0 1 23.333333
4 5 B tack 60.0 2.0 1 30.000000
5 6 A pin 15.0 0.5 1 30.000000
6 7 B clip 200.0 5.0 2 40.000000
7 8 B clip 180.0 6.0 2 30.000000
8 9 A button NaN NaN 2 NaN
And I want to count the daily activities, except when there is a NaN (which means that the machine was paralyzed due to failure).
I tried this code:
my_df['activities']=my_df.groupby(['day','machine'])['machine']\
.transform(lambda x: x['machine'].count() if x['qty'].notna() else np.nan)
But it returns me an error: KeyError: 'qty'
This is the expected result:
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4
1 2 A tack 50.0 3.0 1 16.666667 4
2 3 A pin 30.0 1.0 1 30.000000 4
3 4 B button 70.0 3.0 1 23.333333 2
4 5 B tack 60.0 2.0 1 30.000000 2
5 6 A pin 15.0 0.5 1 30.000000 4
6 7 B clip 200.0 5.0 2 40.000000 2
7 8 B clip 180.0 6.0 2 30.000000 2
8 9 A button NaN NaN 2 NaN NaN
Please, could you help me fix my lambda expression? It will help me for this question and for other operations too.
Although I prefer the solution from #steele-farnsworth, here is what OP requested. for the lambda to work
my_df['activities'] = my_df.groupby(['day','machine'])['qty']\
.transform(lambda x: x.count() if x.notna().all() else np.nan)
print(my_df)
Prints
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4.0
1 2 A tack 50.0 3.0 1 16.666667 4.0
2 3 A pin 30.0 1.0 1 30.000000 4.0
3 4 B button 70.0 3.0 1 23.333333 2.0
4 5 B tack 60.0 2.0 1 30.000000 2.0
5 6 A pin 15.0 0.5 1 30.000000 4.0
6 7 B clip 200.0 5.0 2 40.000000 2.0
7 8 B clip 180.0 6.0 2 30.000000 2.0
8 9 A button NaN NaN 2 NaN NaN
You can do the calculation as normal, and then fill in the NaNs where they are wanted afterwards.
>>> my_df['activities'] = my_df.groupby(['day', 'machine'])['machine'].transform('count')
>>> my_df.loc[my_df['qty'].isna(), 'activities'] = np.NaN
>>> my_df
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4.0
1 2 A tack 50.0 3.0 1 16.666667 4.0
2 3 A pin 30.0 1.0 1 30.000000 4.0
3 4 B button 70.0 3.0 1 23.333333 2.0
4 5 B tack 60.0 2.0 1 30.000000 2.0
5 6 A pin 15.0 0.5 1 30.000000 4.0
6 7 B clip 200.0 5.0 2 40.000000 2.0
7 8 B clip 180.0 6.0 2 30.000000 2.0
8 9 A button NaN NaN 2 NaN NaN
You should avoid using lambdas as much as possible in the context of Pandas, as they are not vectorized (and will therefore run slower) and are less communicative than using existing, idiomatic Pandas methods.
This question already has answers here:
How to fillna by groupby outputs in pandas?
(3 answers)
Closed 4 years ago.
I have a dataset as follow -
alldata.loc[:,["Age","Pclass"]].head(10)
Out[24]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 NaN 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2
Now I want to fill all the null values in Age with the mean of all the Age values for that respective Pclass type.
Example -
In the above snippet for null value of Age for Pclass = 3, it takes mean of all the age belonging to Pclass = 3. Therefore replacing null value of Age = 22.4.
I tried some solutions using groupby, but it made changes only to a specific Pclass value and converted rest of the fields to null. How to achieve 0 null values in this case.
You can use
1] transform and lambda function
In [41]: df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.mean()))
Out[41]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
2] fillna over mean
In [46]: df['Age'].fillna(df.groupby('Pclass')['Age'].transform('mean'))
Out[46]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
3] loc to replace null values
In [47]: df.loc[df['Age'].isnull(), 'Age'] = df.groupby('Pclass')['Age'].transform('mean')
In [48]: df
Out[48]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 22.4 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2
How to match values from this DataFrame source:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 09.0 08.0
3 110 23.0 12.0
4 110 18.0 32.0
5 110 21.0 16.0
5 110 12.0 02.0
And keep only those whose coords are in this second DataFrame coords:
lat lon
0 12.0 10.0
1 23.0 12.0
3 18.0 32.0
So that the resulting DataFrame result is:
car_id lat lon
1 100 12.0 10.0
3 110 23.0 12.0
4 110 18.0 32.0
I can do that in an iterative way with apply, but I'm looking for a vectorized way. I tried the following with isin() with no success:
result = source[source[['lat', 'lon']].isin({
'lat': coords['lat'],
'lon': coords['lon']
})]
The above method returns:
ValueError: ('operands could not be broadcast together with shapes (53103,) (53103,2)
DataFrame.merge() per default merges on all columns with the same names (intersection of the columns of both DFs):
In [197]: source.merge(coords)
Out[197]:
car_id lat lon
0 100 12.0 10.0
1 110 23.0 12.0
2 110 18.0 32.0
Here's one approach with NumPy broadcasting -
a = source.values
b = coords.values
out = source[(a[:,1:]==b[:,None]).all(-1).any(0)]
Sample run -
In [74]: source
Out[74]:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 9.0 8.0
3 110 23.0 12.0
4 110 18.0 32.0
5 110 21.0 16.0
5 110 12.0 2.0
In [75]: coords
Out[75]:
lat lon
0 12.0 10.0
1 23.0 12.0
3 18.0 32.0
In [76]: a = source.values
...: b = coords.values
...:
In [77]: source[(a[:,1:]==b[:,None]).all(-1).any(0)]
Out[77]:
car_id lat lon
1 100 12.0 10.0
3 110 23.0 12.0
4 110 18.0 32.0