Python - Group dataframe based on certain string - python

I am trying to combine these strings and rows within certain logic:
s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']
s2 = [1,1,2,2,2]
s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']
df = pd.DataFrame(list(zip(s1,s2,s3)),
columns=['file','id','book'])
df
Data Preview:
file id book
abc.txt 1 Harry Potter
abc.txt 1 Vol 1
ert.txt 2 Lord of the Rings
ert.txt 2 NaN
ert.txt 2 Harry Potter
I have bunch of files name columns with id's associated with it. I have 'book' column where vol 1 has been in separate row.
I know that this vol1 is only associated with 'Harry Potter' in the given dataset.
Based on the group by of 'file' & 'id', how do I combine 'Vol 1' in the same row where 'Harry Potter' string appears in the row?
Notice some data row doesn't have vo1 for Harry Potter I only want 'Vol 1' when looking at the file & id groupby.
2 Tries:
1st: doesn't work
if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])):
df.groupby(['file','id'],as_index=False).first()
2nd: this applies to every string (but don't want it apply every 'Harry Potter' string.
df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'
Here is the output I am looking for
file id book
abc.txt 1 Harry Potter - Vol 1
ert.txt 2 Lord of the Rings - Vol 1
ert.txt 2 NaN
ert.txt 2 Harry Potter

Start from import re (you will use it).
Then create your DataFrame:
df = pd.DataFrame({
'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],
'id': [1, 1, 2, 2, 2],
'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',
np.nan, 'Harry Potter']})
The first processing step is to add a column, let's call it book2,
containing book2 from the next row:
df["book2"] = df.book.shift(-1).fillna('')
I added fillna('') to replace NaN values with an empty string.
Then define a function to be applied to each row:
def fn(row):
return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\
and re.match(r'^Vol \d+$', row.book2) else row.book
This function checks whether book == "Harry Potter" and book2 matches
"Vol " + a sequence of digits.
If it does, it returns book + book2, otherwise it returns just book.
Then we apply this function and save the result back under book:
df["book"] = df.apply(fn, axis=1)
And the only remaining thing is to drop:
rows where book matches Vol \d+,
book2 column.
The code is:
df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\
.drop(columns=['book2'])
fillna(False) is needed because str.match returns NaN for
source content == NaN.

Assuming that "Vol x" occurs on the row following the title, I would use an auxilliary Series obtained by shifting the book column by -1. It is then enough to combine that Series with the book column when it starts with "Vol " and drop lines where the books column starts with "Vol ". Code could be:
b2 = df.book.shift(-1).fillna('')
df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')
print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))
If the order in the dataframe is not guaranteed but if a Vol x row matches the other row in dataframe with same file and id, you can split the dataframe in 2 parts one containing the Vol x rows and one containing the other ones and update the latter from the former:
g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))
for k, v in g:
if k:
df_vol = v
else:
df = v
for row in df_vol.iterrows():
r = row[1]
df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']

Utilizing merge, apply, update, drop_duplicates.
set_index and merge on index file, id between df of 'Harry Potter' and df of 'Vol 1'; join to create appropriate string and convert it to dataframe
df.set_index(['file', 'id'], inplace=True)
df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book')
Out[2059]:
book
file id
abc.txt 1 Harry Potter Vol 1
Update original df, drop_duplicate, and reset_index
df.update(df1)
df.drop_duplicates().reset_index()
Out[2065]:
file id book
0 abc.txt 1 Harry Potter Vol 1
1 ert.txt 2 Lord of the Rings - Vol 1
2 ert.txt 2 NaN
3 ert.txt 2 Harry Potter

Related

Replace string with np.nan if condition is met

I am trying to replace a string occurrence in a column if a condition is met.
My sample input dataset:
Series Name Type
Food ACG
Drinks FEG
Food at Home BON
I want to replace the strings on the Series Name column if the strings on the Type column are either ACG or BON with nan or blank. For that I tried the following code where I used conditions with not much success.
Code:
df.loc[((df['Type'] == 'ACG') | (df['Type'] == 'BON')),
df['Series Name'].replace(np.nan)]
Desired output:
Series Name Type
ACG
FEG
Food at Home BON
Since you want to set the whole cell to nan, just do this:
df.loc[((df['Type'] == 'ACG') | (df['Type'] == 'BON')), 'Series Name'] = np.nan
Output:
Series Name Type
0 NaN ACG
1 Drinks FEG
2 NaN BON
Update:
Regarding to your question in the comments, if you only wanted to change parts of the string, you could use replace like this:
#new input
df = pd.DataFrame({
'Series Name': ['Food to go', 'Fast Food', 'Food at Home'],
'Type': ['ACG', 'FEG', 'BON']
})
Series Name Type
0 Food to go ACG
1 Fast Food FEG
2 Food at Home BON
mask = df['Type'].isin(['ACG', 'BON'])
df.loc[mask, 'Series Name'] = (df.loc[mask, 'Series Name']
.replace(to_replace="Food", value='NEWVAL', regex=True))
print(df)
Series Name Type
0 NEWVAL to go ACG
1 Fast Food FEG
2 NEWVAL at Home BON
Another option is to use Series.mask:
mask = df['Type'].isin(['ACG', 'BON'])
df['Series Name'] = df['Series Name'].mask(mask)
Output:
Series Name Type
0 NaN ACG
1 Drinks FEG
2 NaN BON

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?
You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

Removing strings in a series of headers

I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.

Categories

Resources