Remove newline characters from pandas series of lists - python

I have a pandas DataFrame that contains two columns, one of tags containing numbers and the other with a list containing string elements.
Dataframe:
df = pd.DataFrame({
'tags': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'elements': {
0: ['\nā˜’', '\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 '],
1: ['', ''],
2: ['\n', '\nFor the Fiscal Year Ended June 30, 2020'],
3: ['\n', '\n'],
4: ['\n', '\nOR']
}
})
I am trying to remove all instances of \n from any element in all the lists from the column elements but I'm really struggling to do so. My solution was to use a nested loop and re.sub() to trying and replace these but it has done nothing (granted this is a horrible solution). This was my attempt:
for ls in range(len(page_table.elements)):
for st in range(len(page_table.elements[i])):
page_table.elements[i][st] = re.sub('\n', '', page_table.elements[i][st])
Is there a way to do this?

You can explode and then replace the \n values.
You can leave out the .groupby(level=0).agg(list) to not put them back into lists, though this will have a different shape to the original DataFrame.
df["elements"] = (
df["elements"]
.explode()
.str.replace(r"\n", "", regex=True)
.groupby(level=0)
.agg(list)
)
Which outputs:
0 [ā˜’, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1 [, ]
2 [, For the Fiscal Year Ended June 30, 2020]
3 [, ]
4 [, OR]

Also possible:
df['elements'] = df['elements'].map(lambda x: [y.replace('\n', '') for y in x])
0 [ā˜’, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1 [, ]
2 [, For the Fiscal Year Ended June 30, 2020]
3 [, ]
4 [, OR]

Related

Pandas aggregation function: Merge text rows, but insert spaces between them?

I managed to group rows in a dataframe, given one column (id).
The problem is that one column consists of parts of sentences, and when I add them together, the spaces are missing.
An example probably makes it easier to understand...
My dataframe looks something like this:
import pandas as pd
#create dataFrame
df = pd.DataFrame({'id': [101, 101, 102, 102, 102],
'text': ['The government changed', 'the legislation on import control.', 'Politics cannot solve all problems', 'but it should try to do its part.', 'That is the reason why these elections are important.'],
'date': [1990, 1990, 2005, 2005, 2005],})
id text date
0 101 The government changed 1990
1 101 the legislation on import control. 1990
2 102 Politics cannot solve all problems 2005
3 102 but it should try to do its part. 2005
4 102 That is the reason why these elections are imp... 2005
Then I used the aggregation function:
aggregation_functions = {'id': 'first','text': 'sum', 'date': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which returns:
id text date
0 101 The government changedthe legislation on import control. 1990
2 102 Politics cannot solve all problemsbut it should try to... 2005
So, for example I need a space in between ' The government changed' and 'the legislation...'. Is that possible?
If you need to put a space between the two phrases/rows, use str.join :
ujoin = lambda s: " ".join(dict.fromkeys(s.astype(str)))
ā€‹
out= df.groupby(["id", "date"], as_index=False).agg(**{"text": ("text", ujoin)})[df.columns]
# Output :
print(out.to_string())
id text date
0 101 The government changed the legislation on import control. 1990
1 102 Politics cannot solve all problems but it should try to do its part. That is the reason why these elections are important. 2005

why doesn't my python pandas dataframe strip method work for trailing whitespace? and how do I fix it?

I have this code to strip whitespace from the dataframe
# create a dataframe with 3 columns
dataFrame = pd.DataFrame({
'Product Category': [' Computer', ' Mobile Phone', 'Electronics ', 'Appliances', ' Furniture', 'Stationery'],'Product Name': ['Keyboard', 'Charger', ' SmartTV', 'Refrigerators', ' Chairs', 'Diaries'],'Quantity': [10, 50, 10, 20, 25, 50]})
print ("Dataframe before removing whitespaces...\n",dataFrame)
# removing whitespace from more than 1 column
dataFrame['Product Category'].str.strip()
dataFrame['Product Name'].str.strip()
# dataframe
print ("Dataframe after removing whitespaces...\n",dataFrame)
The Dataframe before removing whitespace...
Product Category Product Name Quantity
0 Computer Keyboard 10
1 Mobile Phone Charger 50
2 Electronics SmartTV 10
3 Appliances Refrigerators 20
4 Furniture Chairs 25
5 Stationery Diaries 50
The Dataframe after removing whitespace...
Product Category Product Name Quantity
0 Computer Keyboard 10
1 Mobile Phone Charger 50
2 Electronics SmartTV 10
3 Appliances Refrigerators 20
4 Furniture Chairs 25
5 Stationery Diaries 50
The whitespace after "Electronics" is not stripped. Any ideas how I can fix this?
As Umar said you need to do something like this:
# create a dataframe with 3 columns
dataFrame = pd.DataFrame({
'Product Category': [' Computer', ' Mobile Phone', 'Electronics ', 'Appliances', ' Furniture', 'Stationery'],'Product Name': ['Keyboard', 'Charger', ' SmartTV', 'Refrigerators', ' Chairs', 'Diaries'],'Quantity': [10, 50, 10, 20, 25, 50]})
print ("Dataframe before removing whitespaces...\n",dataFrame)
# removing whitespace from more than 1 column
dataFrame['Product Category'] = dataFrame['Product Category'].str.strip()
dataFrame['Product Name'] = dataFrame['Product Name'].str.strip()
# dataframe
print ("Dataframe after removing whitespaces...\n",dataFrame)

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

Removing strings in a series of headers

I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.

How to replace first two letters of a column value using index

I have a data frame
d = {'name': ['john', 'tom', 'bob', 'rock', None], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df = pd.DataFrame(data = d)
df['month'] = pd.DatetimeIndex(df['DoB']).month
df['year'] = pd.DatetimeIndex(df['DoB']).year
What I want to do: replace first two letters with 'XX' in the name column if year = 2014 .
My code:
df.loc[ (df.year == 2014) , df.name.str[0:2] ] = 'XX'
First of all I get this error :
ValueError: cannot index with vector containing NA / NaN values
But even if there was a value instead of None - say 'jimy' - I get following error: KeyError: "['jo' 'to' 'bo' 'ro' 'ji'] not in index"
I also thought of replace method but it only works if you want to replace a given string.
Any suggestions ?
You are close. Note that pd.DataFrame.loc uses a column label as the second indexer.
mask = df['year'] == 2014
df.loc[mask, 'name'] = 'XX' + df.loc[mask, 'name'].str[2:]
print(df)
Address DoB name month year
0 NY 01/02/2010 john 1 2010
1 NJ 01/02/2012 tom 1 2012
2 PA 11/22/2014 XXb 11 2014
3 NY 11/22/2014 XXck 11 2014
4 CA 09/25/2016 None 9 2016

Categories

Resources