Removing empty words from column of tokenized sentences - python

I have a dataframe containing lists of words in each row in the same column. I'd like to remove what I guess are spaces. I managed to get rid of some by doing:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
But some of them still remain.
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
Thank you very much.

Split on comma remove empty values and then combine again with comma
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)

You can use split(expand=True) to do that. Note: You dont have to specifically give spilt(' ', expand=True). By default, it takes ' ' as the value. You can replace ' ' with anything. For ex: if your words separate with , or -, then you can use that separator to split the columns.
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
The output of this will be:
Original dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe split into columns
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
If you want to limit them to 3 columns only, then use n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
The output will be:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
If you want to rename the columns to be more specific, then you can add rename to the end like this.
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
In case you want to replace all the None with '' and also prefix the column names, you can do it as follws:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok

Related

Can someone explain this please?

I have the following data:
0 Ground out of 2
1 1 out of 3
2 1 out of 3
Name: Floor, dtype: object
I want to modify this data so that I can create two columns named first floor and max floor.
Looking at the first item as an example:
0 Ground out of 2
the first floor would be 0 and max floor would be 2 etc...
This is the code I have written to extract the first floor items:
first_floor = []
lower_floors = ['Ground','Basement]
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
When I do this, I get the following output:
['0', 'Gr', '1 ', '1 ']
I am expecting
['0', '1 ', '1 ']
Can someone explain where I am going wrong?
Thanks in advance.
You loop is written in a wrong order.
But anyway, don't use a loop, rather use vectorial string extraction and fillna:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int)
Or for more flexibility (Ground -> 0 ; Basement -> -1…):
(df['Floor'].str.extract('^(\w+)', expand=False)
.replace({'Ground': 0, 'Basement': -1})
.astype(int)
)
output:
0 0
1 1
2 1
Name: Floor, dtype: int64
As list:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int).tolist()
output : [0, 1, 1]
First of all the indent of the else case is wrong. It should be:
first_floor = []
lower_floors = ['Ground','Basement']
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
And second, as you are looping through the Floor column, data will be just a cell, not a row. So data[:2] will cut the cell to 2 characters. This is why you see Gr.

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]

Converting list column to string in pandas

I have a df called df like so. The tag_position is either a string or list. but I want them to be all strings. How can i do this? I also want to remove the white space at the end.
input
id tag_positions
1 center
2 right
3 ['left']
4 ['center ']
5 [' left']
6 ['right']
7 left
expected output
id tag_positions
1 center
2 right
3 left
4 center
5 left
6 right
7 left
You can explode and then strip:
df.tag_positions = df.tag_positions.explode().str.strip()
to get
id tag_positions
0 1 center
1 2 right
2 3 left
3 4 center
4 5 left
5 6 right
6 7 left
You can join:
df['tag_positions'].map(''.join)
Or:
df['tag_positions'].str.join('')
Try with str chain with np.where
df['tag_positions'] = np.where(df['tag_positions'].map(lambda x : type(x).__name__)=='list',df['tag_positions'].str[0],df['tag_positions'])
Also my favorite explode
df = df.explode('tag_positions')
you can convert with apply method like this
df.tag_positions = df.tag_positions.apply(lambda x : ''.join(x) if type(x) == list else x)
if all the lists have a length of 1 you can do this also:
df.tag_positions = df.tag_positions.apply(lambda x : x[0] if type(x) == list else x)
You can use apply and check if an item is an instance of a list, if yes, take the first element. and then you can just use str.strip to strip off the unwanted spaces.
df['tag_positions'].apply(lambda x: x[0] if isinstance(x, list) else x).str.strip()
OUTPUT
Out[42]:
0 center
1 right
2 left
3 center
4 left
5 right
6 left
Name: 0, dtype: object

Python replace entire string if it begin with certain character in dataframe

I have data that contains 'None ...' string at random places. I am trying to replace a cell in the dataframe with empty character only when it begin with 'None ..'. Here is what I tried, but I get errors like 'KeyError'.
df = pd.DataFrame({'id': [1,2,3,4,5],
'sub': ['None ... ','None ... test','math None ...','probability','chemistry']})
df.loc[df['sub'].str.replace('None ...','',1), 'sub'] = '' # getting key error
output looking for: (I need to replace entire value in cell if 'None ...' is starting string. Notice, 3rd row shouldn't be replaced because 'None ...' is not starting character)
id sub
1
2
3 math None ...
4 probability
5 chemistry
You can use the below to identify the cells to replace and then assign them an empty value:
df.loc[df['sub'].str.startswith("None"), 'sub'] = ""
df.head()
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
You can simpy replace 'None ...' and by using a regular expression you can apply this replacement only for strings that start with None.
df['sub'] = df['sub'].str.replace(r'^None \.\.\.*','',1)
the output looks like this:
id sub
0 1
1 2 test
2 3 math None ...
3 4 probability
4 5 chemistry
df['sub'] = df['sub'].str.replace('[\w\s]*?(None \.\.\.)[\s\w]*?','',1)
Out:
sub
id
1
2 test
3
4 probability
5 chemistry
Look at startswith, then after we find the row need to be replaced we using replace
df['sub']=df['sub'].mask(df['sub'].str.startswith('None ... '),'')
df
Out[338]:
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
First, you are using the sub strings as index, that is why you received key error.
Second you can do this by:
df['sub']=df['sub'].apply(lambda x: '' if x.find('None')==0 else x)

Why does Python function return 1.0 (float) when `return 1` is specified?

I have a lot of strings, some of which consist of 1 sentence and some consisting of multiple sentences. My goal is to determine which one-sentence strings end with an exclamation mark '!'.
My code gives a strange result. Instead of returning '1' if found, it returns 1.0. I have tried: return int(1) but that does not help. I am fairly new to coding and do not understand, why is this and how can I get 1 as an integer?
'Sentences'
0 [This is a string., And a great one!]
1 [It's a wonderful sentence!]
2 [This is yet another string!]
3 [Strange strings have been written.]
4 etc. etc.
e = df['Sentences']
def Single(s):
if len(s) == 1: # Select the items with only one sentence
count = 0
for k in s: # loop over every sentence
if (k[-1]=='!'): # check if sentence ends with '!'
count = count+1
if count == 1:
return 1
else:
return ''
df['Single'] = e.apply(Single)
This returns the the correct result, except that there should be '1' instead of '1.0'.
'Single'
0 NaN
1 1.0
2 1.0
3
4 etc. etc.
Why does this happen?
The reason is np.nan is considered float. This makes the series of type float. You cannot avoid this unless you want your column to be of type Object [i.e. anything]. This is inefficient and inadvisable, and I refuse to show you how to do this.
If there is an alternative value you can use instead of np.nan, e.g. 0, then there is a workaround. You can replace NaN values with 0 and then convert to int:
s = pd.Series([1, np.nan, 2, 3])
print(s)
# 0 1.0
# 1 NaN
# 2 2.0
# 3 3.0
# dtype: float64
s = s.fillna(0).astype(int)
print(s)
# 0 1
# 1 0
# 2 2
# 3 3
# dtype: int32
Use astype(int)
Ex:
df['Single'] = e.apply(Single).astype(int)

Categories

Resources