Pandas aggregation function: Merge text rows, but insert spaces between them? - python

I managed to group rows in a dataframe, given one column (id).
The problem is that one column consists of parts of sentences, and when I add them together, the spaces are missing.
An example probably makes it easier to understand...
My dataframe looks something like this:
import pandas as pd
#create dataFrame
df = pd.DataFrame({'id': [101, 101, 102, 102, 102],
'text': ['The government changed', 'the legislation on import control.', 'Politics cannot solve all problems', 'but it should try to do its part.', 'That is the reason why these elections are important.'],
'date': [1990, 1990, 2005, 2005, 2005],})
id text date
0 101 The government changed 1990
1 101 the legislation on import control. 1990
2 102 Politics cannot solve all problems 2005
3 102 but it should try to do its part. 2005
4 102 That is the reason why these elections are imp... 2005
Then I used the aggregation function:
aggregation_functions = {'id': 'first','text': 'sum', 'date': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which returns:
id text date
0 101 The government changedthe legislation on import control. 1990
2 102 Politics cannot solve all problemsbut it should try to... 2005
So, for example I need a space in between ' The government changed' and 'the legislation...'. Is that possible?

If you need to put a space between the two phrases/rows, use str.join :
ujoin = lambda s: " ".join(dict.fromkeys(s.astype(str)))
​
out= df.groupby(["id", "date"], as_index=False).agg(**{"text": ("text", ujoin)})[df.columns]
# Output :
print(out.to_string())
id text date
0 101 The government changed the legislation on import control. 1990
1 102 Politics cannot solve all problems but it should try to do its part. That is the reason why these elections are important. 2005

Related

How to create a dataframe with aggregated categories?

I have a pandas dataframe (df) with the following fields:
id
name
category
01
Eddie
magician
01
Eddie
plumber
02
Martha
actress
03
Jeremy
dancer
03
Jeremy
actor
I want to create a dataframe (df2) like the following:
id
name
categories
01
Eddie
magician, plumber
02
Martha
actress
03
Jeremy
dancer, actor
So, first of all, i create df2 and add an additional column by the following commands:
df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()
(this also counts the occurrences of various categories, which is something useful for what I intend to do)
Then, I use this loop:
for i in df2.itertuples():
for entries in df.itertuples():
if i.id == entries.id:
df2["categories"].iloc[i.Index] += entries.category
else:
pass
Using this code, I get the dataframe that I wanted. However, this implementation has several problems:
Doesn't look optimal.
If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.
Therefore I would like to ask the community: is there a better way to do this?
Also keep in mind that this is my first question on this website!
Thanks in advance!
You can groupby your id and name columns and apply a function to the category one like this:
import pandas as pd
data = {
'id': ['01', '01', '02', '03', '03'],
'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}
df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2
Output:
id name category
0 01 Eddie magician, plumber
1 02 Martha actress
2 03 Jeremy dancer, actor

Pandas - Concatenate rows that are truncated

I found a db online that contains for a series of anonymous users their degrees and the inverse sequence in which they completed them (last degree first).
For each user, I have:
Their UserID
The inverse sequence
The degree title
Basically my dataframe looks like this:
User_ID
Sequence
Degree
123
1
MSc in Civil
123
1
Engineering
123
2
BSc in Engineering
As you can see, my issue is that at times degree titles are truncated and split into two separate rows (User 123 has a MSc in Civil Engineering - notice the same value in sequence).
Ideally, my dataframe should look like this:
User_ID
Sequence
Degree
123
1
MSc in Civil Engineering
123
2
BSc in Engineering
I was wondering if anyone could help me out. I will be happy to provide any more insight that may be needed for assistance.
Thanks in advance!
Try with groupby aggregate:
df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
User_ID Sequence Degree
0 123 1 MSc in Civil Engineering
1 123 2 BSc in Engineering
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'User_ID': [123, 123, 123],
'Sequence': [1, 1, 2],
'Degree': ['MSc in Civil', 'Engineering', 'BSc in Engineering']
})
df = df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
print(df)

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Update a column value based on filter python

I have two datasets say df1 and df:
df1
df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})
ids vals
0 101 apple
1 102 java
2 103 python
df
df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})
TEXT_DATA
0 apple a day keeps doctor away
1 apple tree in my farm
2 python is not new language
3 Learn python programming
4 java is second language
What I want to do is want to update the columns values based on filtered data and map the match data to the new column such that my output is
TEXT_DATA NEW_COLUMN
0 apple a day keeps doctor away 101
1 apple tree in my farm 101
2 python is not new language 103
3 Learn python programming 103
4 java is second language 102
I tried matching using
df[df['TEXT_DATA'].str.contains("apple")]
is there any way by which i can do this?
You could do something like this:
my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]
First, you need to extract the values in df1['vals']. Then, create a new column and add the extraction result to the new column. And finally, merge both dataframes.
extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')
To select the fields in the result, type the column name in the header section:
newdf[['TEXT_DATA','ids']]
You could use a cartesian product of both dataframes and then select the relevant rows and columns.
tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
.drop(columns='vals').reset_index(drop=True)

Printing yearwise popular movies from csv in Python

I have a collection of movie data in an Excel format. It has columns with year, title, and popularity. My goal is to create a dataframe with yearwise movies with top popularity. For now I am able to create only the year and the popularity rating. I want to add the movie title too.
df=pd.DataFrame(data)
xd=data.groupby(['release_year']).max()['popularity']
xf=pd.DataFrame(xd)
xd.head(100)
Output:
1960 2.610362
1961 2.631987
1962 3.170651
I also want the movie name along with this.
You just need to transform the index.
Let's say this is your data:
release_year, popularity, movie
1999, 5, a
1999, 4, c
2000, 3, b
2000, 4, d
Do the following:
import pandas as pd
data= pd.read_csv('data.csv')
idx = data.groupby(['release_year'])['popularity'].transform(max) == data['popularity']
The result of data['popularity'] be:
release_year popularity movie
1999 5 a
2000 4 b

Categories

Resources