Missing columns when trying to groupby aggregate multiple rows in pandas - python

I have a dataframe with relevant info, and I want to groupby one column, say id, with the other columns of the same id joined by "|". However, when I run my code, most of my columns end up missing (only the first 3 appear), and I don't know what is going wrong.
My code is:
df = df.groupby('id').agg(lambda col: '|'.join(set(col))).reset_index()
For instance, my data starts like
id words ... (other columns here)
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd
and I want
id ... (other columns here)
a asd|rtr
b s
c rrtttt|dsfd
but also with all the rest of my columns grouped similarly. Right now the rest of my columns just don't appear in my output dataset. Not sure what is going wrong. Thanks!

Convert to string beforehand, you can then avoid the lambda by using agg(set) and applymap after:
df.astype(str).groupby('id').agg(set).applymap('|'.join)
Minimal Verifiable Example
df = pd.DataFrame({
'id': ['a', 'a', 'b', 'c', 'c'],
'numbers': [1, 2, 2, 3, 3],
'words': ['asd', 'rtr', 's', 'rrtttt', 'dsfd']})
df
id numbers words
0 a 1 asd
1 a 2 rtr
2 b 2 s
3 c 3 rrtttt
4 c 3 dsfd
df.astype(str).groupby('id').agg(set).applymap('|'.join)
numbers words
id
a 1|2 asd|rtr
b 2 s
c 3 rrtttt|dsfd

Related

How to concatenate a pandas column by a partition?

I have a pandas data frame like this:
df = pd.DataFrame({"Id": [1, 1, 1, 2, 2, 2, 2],
"Letter": ['A', 'B', 'C', 'A', 'D', 'B', 'C']})
How can I add a new column efficiently, "Merge" such that it concatenates all the values from the column "letter" by "Id", so the final data frame would look like this:
You can groupby Id column then transform
df['Merge'] = df.groupby('Id').transform(lambda x: '-'.join(x))
print(df)
Id Letter Merge
0 1 A A-B-C
1 1 B A-B-C
2 1 C A-B-C
3 2 A A-D-B-C
4 2 D A-D-B-C
5 2 B A-D-B-C
6 2 C A-D-B-C
Thanks for sammywemmy pointing out lambda is needless here, so you can use a simpler form
df['Merge'] = df.groupby('Id').transform('-'.join)

Joining two pandas dataframes by latest date

I have two dataframes (d, containing date1,name) and another (d1, containing date2,name,rank). I need to join these two on name such that for each row in first dataframe I assign the latest rank as of date1.
i.e. d1.name = d2.name and d2.date2 is latest d1.date1
What is the easiest way of doing this?
import pandas as pd
d = pd.DataFrame({'date' : ['20070105', '20130105', '20150102',
'20170106', '20190106'], 'name': ['a', 'b', 'a', 'b', 'a']})
d
date name
0 20070105 a
1 20130105 b
2 20150102 a
3 20170106 b
4 20190106 a
d1 = pd.DataFrame({'date' : ['20140105', '20160105', '20180103',
'20190106'], 'rank' : [1, 2, 1,5], 'name': ['a', 'b', 'a', '
...: b']})
d1
date name rank
0 20140105 a 1
1 20160105 b 2
2 20180103 a 2
3 20190106 b 1
I'm expecting 'rank' to be added to 'd' and have output like this:
date name Rank
0 20070105 a NaN
1 20130105 b NaN
2 20150102 a 1
3 20170106 b 2
4. 20190106 a 2
I assume you required this.
sort your second dataframe in ascending order with date, then drop_duplicates with keep='last', Now apply pd.merge with first dataframe with processed second dataframe.
df2=df2.sort_values(on='date')
temp=df2.drop_duplicates(subset=['name'], keep='last')
print (pd.merge(df1,temp, on=['name'], how='left'))
Note: As you failed to post sample input and output, I assumes column name and variable like the above. For exact result provide sample input and output.

loop through a single column in one dataframe compare to a column in another dataframe create new column in first dataframe using pandas

right now I have two dataframes they look like:
c = pd.DataFrame({'my_goal':[3, 4, 5, 6, 7],
'low_number': [0,100,1000,2000,3000],
'high_number': [100,1000,2000,3000,4000]})
and
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],
'Number':[50, 500, 1030, 2005 , 3575]})
what I want to do is if 'Number' falls between the low number and the high number I want it to bring back the value in 'my_goal'. For example if we look at 'a' it's 'Number is is 100 so I want it to bring back 3. I also want to create a dataframe that contains all the columns from dataframe a and the 'my_goal' column from dataframe c. I want the output to look like:
I tried making my high and low numbers into a separate list and running a for loop from that, but all that gives me are 'my_goal' numbers:
low_number= 'low_number': [0,100,1000,2000,3000]
for i in a:
if float(i) >= low_number:
a = c['my_goal']
print(a)
You can use pd.cut, when I see ranges, I first think of pd.cut:
dfa = pd.DataFrame(a)
dfc = pd.DataFrame(c)
dfa['my_goal'] = pd.cut(dfa['Number'],
bins=[0]+dfc['high_number'].tolist(),
labels=dfc['my_goal'])
Output:
a Number my_goal
0 a 50 3
1 b 500 4
2 c 1030 5
3 d 2005 6
4 e 3575 7
I changed row 4 slightly to include a test case where the condition is not met. You can concat a with rows of c where the condition is true.
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],'Number':[50, 500, 1030, 1995 , 3575]})
cond= a.Number.between( c.low_number, c.high_number)
pd.concat([a, c.loc[cond, ['my_goal']] ], axis = 1, join = 'inner')
Number a my_goal
0 50 a 3
1 500 b 4
2 1030 c 5
4 3575 e 7

How to create a square dataframe/matrix given 3 columns - Python

I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
You could use df.pivot:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.
See the docs for more on pivot and other reshaping methods.
Alternatively, you could use pd.crosstab:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab
(with agfunc='sum') will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot is passed column labels, pd.crosstab is passed
array-likes (such as whole columns of df). df.iloc[:, i] is the ith column
of df.

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources