Joining two pandas dataframes by latest date - python

I have two dataframes (d, containing date1,name) and another (d1, containing date2,name,rank). I need to join these two on name such that for each row in first dataframe I assign the latest rank as of date1.
i.e. d1.name = d2.name and d2.date2 is latest d1.date1
What is the easiest way of doing this?
import pandas as pd
d = pd.DataFrame({'date' : ['20070105', '20130105', '20150102',
'20170106', '20190106'], 'name': ['a', 'b', 'a', 'b', 'a']})
d
date name
0 20070105 a
1 20130105 b
2 20150102 a
3 20170106 b
4 20190106 a
d1 = pd.DataFrame({'date' : ['20140105', '20160105', '20180103',
'20190106'], 'rank' : [1, 2, 1,5], 'name': ['a', 'b', 'a', '
...: b']})
d1
date name rank
0 20140105 a 1
1 20160105 b 2
2 20180103 a 2
3 20190106 b 1
I'm expecting 'rank' to be added to 'd' and have output like this:
date name Rank
0 20070105 a NaN
1 20130105 b NaN
2 20150102 a 1
3 20170106 b 2
4. 20190106 a 2

I assume you required this.
sort your second dataframe in ascending order with date, then drop_duplicates with keep='last', Now apply pd.merge with first dataframe with processed second dataframe.
df2=df2.sort_values(on='date')
temp=df2.drop_duplicates(subset=['name'], keep='last')
print (pd.merge(df1,temp, on=['name'], how='left'))
Note: As you failed to post sample input and output, I assumes column name and variable like the above. For exact result provide sample input and output.

Related

Missing columns when trying to groupby aggregate multiple rows in pandas

I have a dataframe with relevant info, and I want to groupby one column, say id, with the other columns of the same id joined by "|". However, when I run my code, most of my columns end up missing (only the first 3 appear), and I don't know what is going wrong.
My code is:
df = df.groupby('id').agg(lambda col: '|'.join(set(col))).reset_index()
For instance, my data starts like
id words ... (other columns here)
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd
and I want
id ... (other columns here)
a asd|rtr
b s
c rrtttt|dsfd
but also with all the rest of my columns grouped similarly. Right now the rest of my columns just don't appear in my output dataset. Not sure what is going wrong. Thanks!
Convert to string beforehand, you can then avoid the lambda by using agg(set) and applymap after:
df.astype(str).groupby('id').agg(set).applymap('|'.join)
Minimal Verifiable Example
df = pd.DataFrame({
'id': ['a', 'a', 'b', 'c', 'c'],
'numbers': [1, 2, 2, 3, 3],
'words': ['asd', 'rtr', 's', 'rrtttt', 'dsfd']})
df
id numbers words
0 a 1 asd
1 a 2 rtr
2 b 2 s
3 c 3 rrtttt
4 c 3 dsfd
df.astype(str).groupby('id').agg(set).applymap('|'.join)
numbers words
id
a 1|2 asd|rtr
b 2 s
c 3 rrtttt|dsfd

How to use an equality condition for manipulating a Pandas Dataframe based on another dataframe?

I have a dataframe in Python, say A, which has multiple columns, including columns named ECode and FG. I have another Pandas dataframe B, also with multiple columns, including columns named ECode,F Gping (note the space in column name for F Gping) and EDesc. What I would like to do is to create a new column called EDesc in dataframe A based on following conditions (Note that EDesc, FG and F Gping contain String type values (text), while the remaining columns are numeric/floating type. Also, dataframes A and B are of different dimensions (with differing rows and columns, and I want to check equality in specific values in the dataframe columns):-
First, for all rows in dataframe A, where value in ECode matches value ECode in dataframe B, then, in the new column EDesc to be created in dataframe A, add the same values as EDesc in B.
Secondly, for all rows in dataframe A where value in FG matches F Gping values, in the new column EDesc in A, add same values as EDesc in B.
After this, if the newly created EDesc column in A still has missing values/NaNs, then add the string value MissingValue to all the rows in the Dataframe A's EDesc column.
I have tried using for loops, as well as list comprehensions, but they don't help in accomplishing this. Moreover, the space within column name F Gping in B is created problems to access the same, as though I can access it like B['F Gping'], it isn't solving the very purpose. Any help in this regard is appreciated.
I'm assuming values are unique in B['ECode'] and B['F Gping'], otherwise we'll have to choose which value we give to A['EDesc'] when we find two matching values for ECode or FG.
There might be a smarter way but here's what I would do with joins:
Example DataFrames:
A = pd.DataFrame({'ECode': [1, 1, 3, 4, 6],
'FG': ['a', 'b', 'c', 'b', 'y']})
B = pd.DataFrame({'ECode': [1, 2, 3, 5],
'F Gping': ['b', 'c', 'x', 'x'],
'EDesc': ['a', 'b', 'c', 'd']})
So they look like:
A
ECode FG
0 1 a
1 1 b
2 3 c
3 4 b
4 6 y
B
ECode F Gping EDesc
0 1 b a
1 2 c b
2 3 x c
3 5 x d
First let's create A['EDesc'] by saying that it's the result of joining A and B on ECode. We'll temporarily use EDesc as index:
A.set_index('ECode', inplace=True, drop=False)
B.set_index('ECode', inplace=True, drop=False)
A['EDesc'] = A.join(B, lsuffix='A')['EDesc']
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG ECode F Gping EDesc
ECode
1 1 a 1.0 b a
1 1 b 1.0 b a
3 3 c 3.0 x c
4 4 b NaN NaN NaN
6 6 y NaN NaN NaN
Now let's fillna on A['EDesc'], using the match on FG. Same thing:
A.set_index('FG', inplace=True, drop=False)
B.set_index('F Gping', inplace=True, drop=False)
A['EDesc'].fillna(A.join(B, lsuffix='A')['EDesc'].drop_duplicates(), inplace=True)
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG EDescA ECode F Gping EDesc
a 1 a a NaN NaN NaN
b 1 b a 1.0 b a
b 4 b NaN 1.0 b a
c 3 c c 2.0 c b
y 6 y NaN NaN NaN NaN
Also we dropped the duplicates because as you can see there are two b's in our index.
Finally let's fillna with "Missing" and reset the index:
A['EDesc'].fillna('Missing', inplace=True)
A.reset_index(drop=True, inplace=True)

loop through a single column in one dataframe compare to a column in another dataframe create new column in first dataframe using pandas

right now I have two dataframes they look like:
c = pd.DataFrame({'my_goal':[3, 4, 5, 6, 7],
'low_number': [0,100,1000,2000,3000],
'high_number': [100,1000,2000,3000,4000]})
and
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],
'Number':[50, 500, 1030, 2005 , 3575]})
what I want to do is if 'Number' falls between the low number and the high number I want it to bring back the value in 'my_goal'. For example if we look at 'a' it's 'Number is is 100 so I want it to bring back 3. I also want to create a dataframe that contains all the columns from dataframe a and the 'my_goal' column from dataframe c. I want the output to look like:
I tried making my high and low numbers into a separate list and running a for loop from that, but all that gives me are 'my_goal' numbers:
low_number= 'low_number': [0,100,1000,2000,3000]
for i in a:
if float(i) >= low_number:
a = c['my_goal']
print(a)
You can use pd.cut, when I see ranges, I first think of pd.cut:
dfa = pd.DataFrame(a)
dfc = pd.DataFrame(c)
dfa['my_goal'] = pd.cut(dfa['Number'],
bins=[0]+dfc['high_number'].tolist(),
labels=dfc['my_goal'])
Output:
a Number my_goal
0 a 50 3
1 b 500 4
2 c 1030 5
3 d 2005 6
4 e 3575 7
I changed row 4 slightly to include a test case where the condition is not met. You can concat a with rows of c where the condition is true.
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],'Number':[50, 500, 1030, 1995 , 3575]})
cond= a.Number.between( c.low_number, c.high_number)
pd.concat([a, c.loc[cond, ['my_goal']] ], axis = 1, join = 'inner')
Number a my_goal
0 50 a 3
1 500 b 4
2 1030 c 5
4 3575 e 7

Two dataframes into one

I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources