Pandas - Concatenate rows that are truncated - python

I found a db online that contains for a series of anonymous users their degrees and the inverse sequence in which they completed them (last degree first).
For each user, I have:
Their UserID
The inverse sequence
The degree title
Basically my dataframe looks like this:
User_ID
Sequence
Degree
123
1
MSc in Civil
123
1
Engineering
123
2
BSc in Engineering
As you can see, my issue is that at times degree titles are truncated and split into two separate rows (User 123 has a MSc in Civil Engineering - notice the same value in sequence).
Ideally, my dataframe should look like this:
User_ID
Sequence
Degree
123
1
MSc in Civil Engineering
123
2
BSc in Engineering
I was wondering if anyone could help me out. I will be happy to provide any more insight that may be needed for assistance.
Thanks in advance!

Try with groupby aggregate:
df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
User_ID Sequence Degree
0 123 1 MSc in Civil Engineering
1 123 2 BSc in Engineering
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'User_ID': [123, 123, 123],
'Sequence': [1, 1, 2],
'Degree': ['MSc in Civil', 'Engineering', 'BSc in Engineering']
})
df = df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
print(df)

Related

Pandas aggregation function: Merge text rows, but insert spaces between them?

I managed to group rows in a dataframe, given one column (id).
The problem is that one column consists of parts of sentences, and when I add them together, the spaces are missing.
An example probably makes it easier to understand...
My dataframe looks something like this:
import pandas as pd
#create dataFrame
df = pd.DataFrame({'id': [101, 101, 102, 102, 102],
'text': ['The government changed', 'the legislation on import control.', 'Politics cannot solve all problems', 'but it should try to do its part.', 'That is the reason why these elections are important.'],
'date': [1990, 1990, 2005, 2005, 2005],})
id text date
0 101 The government changed 1990
1 101 the legislation on import control. 1990
2 102 Politics cannot solve all problems 2005
3 102 but it should try to do its part. 2005
4 102 That is the reason why these elections are imp... 2005
Then I used the aggregation function:
aggregation_functions = {'id': 'first','text': 'sum', 'date': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which returns:
id text date
0 101 The government changedthe legislation on import control. 1990
2 102 Politics cannot solve all problemsbut it should try to... 2005
So, for example I need a space in between ' The government changed' and 'the legislation...'. Is that possible?
If you need to put a space between the two phrases/rows, use str.join :
ujoin = lambda s: " ".join(dict.fromkeys(s.astype(str)))
​
out= df.groupby(["id", "date"], as_index=False).agg(**{"text": ("text", ujoin)})[df.columns]
# Output :
print(out.to_string())
id text date
0 101 The government changed the legislation on import control. 1990
1 102 Politics cannot solve all problems but it should try to do its part. That is the reason why these elections are important. 2005

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Update a column value based on filter python

I have two datasets say df1 and df:
df1
df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})
ids vals
0 101 apple
1 102 java
2 103 python
df
df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})
TEXT_DATA
0 apple a day keeps doctor away
1 apple tree in my farm
2 python is not new language
3 Learn python programming
4 java is second language
What I want to do is want to update the columns values based on filtered data and map the match data to the new column such that my output is
TEXT_DATA NEW_COLUMN
0 apple a day keeps doctor away 101
1 apple tree in my farm 101
2 python is not new language 103
3 Learn python programming 103
4 java is second language 102
I tried matching using
df[df['TEXT_DATA'].str.contains("apple")]
is there any way by which i can do this?
You could do something like this:
my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]
First, you need to extract the values in df1['vals']. Then, create a new column and add the extraction result to the new column. And finally, merge both dataframes.
extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')
To select the fields in the result, type the column name in the header section:
newdf[['TEXT_DATA','ids']]
You could use a cartesian product of both dataframes and then select the relevant rows and columns.
tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
.drop(columns='vals').reset_index(drop=True)

(Python) How to group unique values in column with total of another column

This is a sample what my dataframe looks like:
company_name country_code state_code software finance commerce etc......
google USA CA 1 0 0
jimmy GBR unknown 0 0 1
I would like to be able to group the industry of a company with its state code. For example I would like to have the total number of software companies in a state etc. (e.g. 200 software companies in CA, 100 finance companies in NY).
I am currently just counting the number of total companies in each state using:
usa_df['state_code'].value_counts()
But I can't figure out how to group the number of each type of industry in each individual state.
If the 1s and 0s are boolean flags for each category then you should just need sum.
df[df.country_code == 'USA'].groupby('state_code').sum().reset_index()
# state_code commerce finance software
#0 CA 0 0 1
df.groupby(['state_code']).agg({'software' : 'sum', 'finance' : 'sum', ...})
This will group by the state_code, and sum up the number of 'software', 'finance', etc in each grouping.
Could also do a pivot_table:
df.pivot_table(index = 'state_code', columns = ['software', 'finance', ...], aggfunc = 'sum')
This may help you:
result_dataframe = dataframe_name.groupby('state_code ').sum()

Python Pandas - merge rows if some values are blank

I have a dataset that looks a little like this:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345
1 Bob the Builder $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
where missing data is okay, unless it's obvious that they can be merged. What I mean by that is instead of the dataset above, I want to merge the rows where both the ID and Name are the same, and the other features can fill in each other's blanks. For example, the dataset above would become:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345 $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
I've thought about using df.groupby(["ID", "Name"]) and then concatenating the strings since the missing values are empty strings, but got no luck with it.
The data has been scraped off websites, so they've had to go through a lot of cleaning to end up here. I can't think of an elegant way of figuring this out!
This only works if rows we are potentially merging are next to each other.
setup
df = pd.DataFrame(dict(
ID=[1, 1, 2, 3, 3, 4],
Name=['Bob the Builder'] * 3 + ['Nigerian Prince', 'Pinhead Larry', 'Fox Mulder'],
Address=['123 Main St', '', '123 Sub St', 'Area 51', 'Las Vegas', 'Area 51'],
Zip=['12345', '', '54321', '33333', '31333', ''],
Cost=['', '$99,999.99', '$74,483.01', '$999.999.99', '$11.00', '$0.99']
))[['ID', 'Name', 'Address', 'Zip', 'Cost']]
fill up missing
replace('', np.nan) then forward fill then back fill
df_ = df.replace('', np.nan).ffill().bfill()
concat
take last row of filled up df_ if its a duplicate row
take non filled up df if not duplicated
pd.concat([
df_[df_.duplicated()],
df.loc[df_.drop_duplicates(keep=False).index]
])
I'll describe an algorithm:
Put aside all the rows where all fields are populated. We don't need to touch these.
Create a boolean DataFrame like the input where empty fields are False and populated fields are True. This is df.notnull().
For each name in df.Name.unique():
Take df[df.Name == name] as the working set.
Sum each pair (or tuple) of boolean rows, resulting in a boolean vector the same width as the input columns except those which are always populated. In the example this means [True, True, False] and [False, False, True], so the sum is [1, 1, 1].
If the sum is equal to 1 everywhere, that pair (or tuple) of rows can be merged.
But there are a ton of possible edge cases here, such as what to do if you have three rows A,B,C and you could merge either A+B or A+C. It will help if you can narrow down the constraints that exist in the data before implementing the merging algorithm.

Categories

Resources