I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.
Related
I have a huge dataframe and I need to filter out the columns from the dataframe if the columns are present in a given list.
For example,
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]], columns=list('ABCDE'))
This is the dataframe.
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
I have a list.
fil_lst = ['A', 'D', 'F']
The list may contain column names that are not present in the dataframe. I need only the columns that are present in the dataframe.
I need the resulting dataframe like,
A D
0 1 4
1 6 9
I know it can be done with the help of list comprehension like,
new_df = df[[col for col in fil_lst if col in df.columns]]
But as I have a huge dataframe, it is better if I don't use this computationally expensive process.
Is it possible to vectorize this in any way?
Use Index.isin for test membership in columns and DataFrame.loc for filter by columns, so : mean select all rows and columns by mask:
fil_lst = ['A', 'D', 'F']
df = df.loc[:, df.columns.isin(fil_lst)]
print(df)
A D
0 1 4
1 6 9
Or use Index.intersection:
fil_lst = ['A', 'D', 'F']
df = df[df.columns.intersection(fil_lst)]
print(df)
A D
0 1 4
1 6 9
If you are dealing with large lists, and the focus is on performance more than order of columns, you can use set intersection:
In [2944]: fil_lst = ['A', 'D', 'F']
In [2945]: col_list = df.columns.tolist()
In [2947]: df = df[list(set(col_list) & set(fil_lst))]
In [2947]: df
Out[2947]:
D A
0 4 1
1 9 6
EDIT: If order of columns is important, then do this:
In [2953]: df = df[sorted(set(col_list) & set(fil_lst), key = col_list.index)]
In [2953]: df
Out[2953]:
A D
0 1 4
1 6 9
I have two dataframes (d, containing date1,name) and another (d1, containing date2,name,rank). I need to join these two on name such that for each row in first dataframe I assign the latest rank as of date1.
i.e. d1.name = d2.name and d2.date2 is latest d1.date1
What is the easiest way of doing this?
import pandas as pd
d = pd.DataFrame({'date' : ['20070105', '20130105', '20150102',
'20170106', '20190106'], 'name': ['a', 'b', 'a', 'b', 'a']})
d
date name
0 20070105 a
1 20130105 b
2 20150102 a
3 20170106 b
4 20190106 a
d1 = pd.DataFrame({'date' : ['20140105', '20160105', '20180103',
'20190106'], 'rank' : [1, 2, 1,5], 'name': ['a', 'b', 'a', '
...: b']})
d1
date name rank
0 20140105 a 1
1 20160105 b 2
2 20180103 a 2
3 20190106 b 1
I'm expecting 'rank' to be added to 'd' and have output like this:
date name Rank
0 20070105 a NaN
1 20130105 b NaN
2 20150102 a 1
3 20170106 b 2
4. 20190106 a 2
I assume you required this.
sort your second dataframe in ascending order with date, then drop_duplicates with keep='last', Now apply pd.merge with first dataframe with processed second dataframe.
df2=df2.sort_values(on='date')
temp=df2.drop_duplicates(subset=['name'], keep='last')
print (pd.merge(df1,temp, on=['name'], how='left'))
Note: As you failed to post sample input and output, I assumes column name and variable like the above. For exact result provide sample input and output.
What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D
combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)
Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D
This question was very hard to word..
Here is some sample code for a reproducible example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['a', 1, 10, 1], ['a', 2, 20, 1], ['b', 1, 4, 1], ['c', 1, 2, 1], ['e', 2, 10, 1]])
df2 = pd.DataFrame([['a', 1, 15, 2], ['a', 2, 20, 2], ['c', 1, 2, 2]])
df3 = pd.DataFrame([['d', 1, 10, 3], ['e', 2, 20, 3], ['f', 1, 15, 3]])
df1.columns = ['name', 'id', 'price', 'part']
df2.columns = ['name', 'id', 'price', 'part']
df3.columns = ['name', 'id', 'price', 'part']
result = pd.DataFrame([['a', 1, 10, 15, 'missing'],
['a', 2, 20, 20, 'missing'],
['b', 1, 4, 'missing', 'missing'],
['c', 1, 2, 2, 'missing'],
['e', 2, 10, 'missing', 20],
['d', 1, 'missing', 'missing', 10],
['f', 1, 'missing', 'missing', 15]])
result.columns = ['name', 'id', 'pricepart1', 'pricepart2', 'pricepart3']
So there are three DataFrames:
df1
name id price part
0 a 1 10 1
1 a 2 20 1
2 b 1 4 1
3 c 1 2 1
4 e 2 10 1
df2
name id price part
0 a 1 15 2
1 a 2 20 2
2 c 1 2 2
df3
name id price part
0 d 1 10 3
1 e 2 20 3
2 f 1 15 3
The name and id is like a composite key. It may be present in all three DataFrames, just two of the three DataFrames, in just 1 of the DataFrames. To represent which DataFrame the name, id came from, a part column exists in df1, df2 and df3.
The result I'm looking for is given by the result DataFrame.
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15
Basically, I want EVERY name, id pair to be accounted for. Even if the SAME name, id comes in both df1 and df2, I want separate columns for price from each of the part even if the prices in both the parts/DataFrames are the same.
In the results DataFrame, take row1, a 1 10 15 missing
What this represents is, the name, id pair a 1 had a price of 10 in df1, 15 in df2, and missing in df3.
If the row value is missing for a specific pricepart that means, the name, id pair did not appear in that particular DataFrame!
I've used the part to represent the DataFrame! so, you can asusme that part is ALWAYS 1 in df1, ALWAYS 2 in df2 and ALWAYS 3 in df3.
So far.. I literally just did, pd.concat([df1, df2, df3])
Not sure if this approach is going to lead to a dead end..
Keep in mind that the original three DataFrames are 62245 rows × 4 columns EACH. And each DataFrame may or may not contain the name, id pair. If the name, id pair is present in EVEN 1 of the DataFrames, and not the others, I wanted that to be accounted for with a missing for the other DataFrames.
You can use pd.merge whilst using how='outer'
# Change column names and remove 'part' column
df1 = df1.rename(columns={'price':'pricepart1'}).drop('part', axis=1)
df2 = df2.rename(columns={'price':'pricepart2'}).drop('part', axis=1)
df3 = df3.rename(columns={'price':'pricepart3'}).drop('part', axis=1)
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
Out[]:
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly