How do i make pandas explode function work? - python

I was trying to make a project using reddit submission (post) titles.
I'll try to give some context to make it more understandable.
I added submissions that meet a certain criteria into a list. Lets call that list "data".
This was what was happening: data.append([title, score])
"title" is a string and "score" is an integer.
df = pd.DataFrame(data)
df.columns = ["title", "score"]
dfr = clean_data(df, "comments", "cleaned titles")
The column of "cleaned titles" just has strings, in order to use .explode() I tried to convert them into
lists.
dfr['cleaned titles'] = dfr['cleaned titles'].str.split(",")
dfr['cleaned titles'] = dfr['cleaned titles'].explode()
And.. explode does nothing?
Here is the clean_data() in case it is needed:
def clean_data(df, col, clean_col):
# change to lower and remove spaces on either side
df[clean_col] = df[col].apply(lambda x: x.lower().strip())
# remove extra spaces in between
df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))
#dumbshit
df[clean_col] = df[clean_col].apply(lambda x: ' '.join([word for word in x.split() if '&' not in word]))
df[clean_col] = df[clean_col].apply(lambda x: ' '.join([word for word in x.split() if '![' not in word]))
# remove punctuation
df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
# remove stopwords
df[clean_col] = df[clean_col].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df[clean_col] = df[clean_col].apply(lambda x: ' '.join([word for word in x.split() if len(word) > 2]))

Related

Avoid writing multiple lines (same piece of code) by using a one line function

I would like to write the below piece of code in one line.
All 9 lines are same (except the column name, ex: Two, Three, Four, etc.,)
Below is my code:
Note: 'df' is my data frame name.
df['Two'] = df['Two'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Three'] = df['Three'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Four'] = df['Four'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Five'] = df['Five'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Six'] = df['Six'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Seven'] = df['Seven'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Eight'] = df['Eight'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Nine'] = df['Nine'].map(lambda x: re.sub(r'\W+', ' ', x))
df['Ten'] = df['Ten'].map(lambda x: re.sub(r'\W+', ' ', x))
I tried for loop but I was able to loop only integers, not able to get column names in the loop.
I need one line of code to execute all these lines. Because in the future the column may increase and I can not keep adding lines.
df.columns holds the column names of the df object. So iterate over the column names, dynamically select each column with df[col] and apply your map function.
for col in df.columns:
df[col]=df[col].map(lambda x: re.sub(r'\W+', ' ', x))

How to remove common words from list of lists in Python?

I have a large number of "groups" of words. If any of the words from one group appears both in column A and column B, I want to remove the words in the group from the two columns. How do I loop over all the groups (i.e. over the sublists in the list)?
The flawed code below only removes the common words from the last group, not all three groups (lists) in stuff. [I first create an indicator if one of the words from the groups is in the string, and then create another indicator if both strings have a word from the group. Only for the pairs of A and B where both have a word from the group, I remove the particular group words.]
How do I correctly specify the loop?
EDIT:
In my suggested code, each loop restarted with the original columns instead looping over the columns with words removed from the previous group(s).
The solution suggestions are more elegant and neat but remove the words if they are part of another word (e.g. the word 'foo' is correctly removed from 'foo hello' but incorrectly also removed from 'foobar'.
# Input data:
data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])
A B
0 summer time third grey abc defg autumn times fourth table
1 yellow sky hello table not red skies second garnet
2 fourth autumnwind first blue chair winter
# Groups of words to be removed:
colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']
stuff = [colors, seasons, numbers]
# Code below only removes the last list in stuff (numbers):
def fA(S,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
y = 1
return y
def fB(T,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
y = 1
return y
def fARemove(S):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
return S
def fBRemove(T):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
return T
for listed in stuff:
df['A_Ind'] = 0
df['B_Ind'] = 0
df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)
df['inboth'] = 0
df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1
df['A_new'] = df['A']
df['B_new'] = df['B']
df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)
del df['inboth']
del df['A_Ind']
del df['B_Ind']
df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
df['A_new'] = df['A_new'].str.strip()
df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
df['B_new'] = df['B_new'].str.strip()
Expected output is:
A_new B_new
0 grey abc defg table
1 hello table no second garnet
2 autumnwind blue chair winter
import re
flatten_list = lambda l: [item for subl in l for item in subl]
def remove_recursive(s, l):
while len(l) > 0:
s = s.replace(l[0], '')
l = l[1:]
return re.sub(r'\ +', ' ', s).strip()
df['A_new'] = df.apply(lambda x: remove_recursive(x.A, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df['B_new'] = df.apply(lambda x: remove_recursive(x.B, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df.head()
# A_new B_new
# 0 time grey abc defg table
# 1 hello table not second garnet
# 2 wind blue chair
This is similar code to the one in comments, using a recursive lambda for matching words and a flattened list to figure the words in the lists that match in both columns.
this needs python 3.7+ to work (otherwise more codes are needed). Based on your list of keywords, I think you are trying to prioritize multi words matching.
dummy=0
def splitter(text):
global dummy
text=text.strip()
if not text:
return []
for n,s in enumerate(stuff):
for keyword in s:
p=text.find(keyword)
if p>=0:
return splitter(text[:p])+[((dummy,keyword),n)]+splitter(text[p+len(keyword):])
else:
return [((dummy,text),-1)]
def remover(row):
A=dict(splitter(row['A']))
B=dict(splitter(row['B']))
s=set(A.values()).intersection(set(B.values()))
return [' '.join([k[1] for k,v in A.items() if v<0 or v not in s]),' '.join([k[1] for k,v in B.items() if v<0 or v not in s])]
pd.concat([df,pd.DataFrame(df.apply(remover, axis=1).to_list(), columns=['newA','newB'])], axis=1)
Below is the code from the original question using regex r'\b{}\b', corrected for looping over the latest strings and not the original strings.
# Groups of words to be removed:
colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']
stuff = [colors, seasons, numbers]
df['A_new'] = df['A']
df['B_new'] = df['B']
def f_indicator(S,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
y = 1
return y
def fRemove(S):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
return S
for listed in stuff:
df['A_Ind'] = 0
df['B_Ind'] = 0
df['A_Ind'] = df.apply(lambda x: f_indicator(x.A_new, x.A_Ind), axis=1)
df['B_Ind'] = df.apply(lambda x: f_indicator(x.B_new, x.B_Ind), axis=1)
df['inboth'] = 0
df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1
df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fRemove(x.A_new), axis=1)
df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fRemove(x.B_new), axis=1)
del df['inboth']
del df['A_Ind']
del df['B_Ind']
df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
df['A_new'] = df['A_new'].str.strip()
df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
df['B_new'] = df['B_new'].str.strip()
del df['A']
del df['B']
print(df)
Output:
A_new B_new
0 grey abc defg table
1 hello table not second garnet
2 autumnwind blue chair winter

Split a Python string (sentence) with appended white spaces

Might it be possible to split a Python string (sentence) so it retains the whitespaces between words in the output, but within a split substring by appending it after each word?
For example:
given_string = 'This is my string!'
output = ['This ', 'is ', 'my ', 'string!']
I avoid regexes most of the time, but here it makes it really simple:
import re
given_string = 'This is my string!'
res = re.findall(r'\w+\W?', given_string)
# res ['This ', 'is ', 'my ', 'string!']
Maybe this will help?
>>> given_string = 'This is my string!'
>>> l = given_string.split(' ')
>>> l = [item + ' ' for item in l[:-1]] + l[-1:]
>>> l
['This ', 'is ', 'my ', 'string!']
just split and add the whitespace back:
a = " "
output = [e+a for e in given_string.split(a) if e]
output[len(output)-1] = output[len(output)-1][:-1]
the last line is for deleting space after thankyou!

How to replace strings that are similar

I am creating some code that will replace spaces.
I want a double space to turn into a single space and a single space to become nothing.
Example:
string = "t e s t t e s t"
string = string.replace(' ', ' ').replace(' ', '')
print (string)
The output is "testest" because it replaces all the spaces.
How can I make the output "test test"?
Thanks
A regular expression approach is doubtless possible, but for a quick solution, first split on the double space, then rejoin on a single space after using a comprehension to remove the single spaces in each of the elements in the split:
>>> string = "t e s t t e s t"
>>> ' '.join(word.replace(' ', '') for word in string.split(' '))
'test test'
Just another idea:
>>> s = 't e s t t e s t'
>>> s.replace(' ', ' ').replace(' ', '').replace(' ', '')
'test test'
Seems to be faster:
>>> timeit(lambda: s.replace(' ', ' ').replace(' ', '').replace(' ', ''))
2.7822862677683133
>>> timeit(lambda: ' '.join(w.replace(' ','') for w in s.split(' ')))
7.702567737466012
And regex (at least this one) is shorter but a lot slower:
>>> timeit(lambda: re.sub(' ( ?)', r'\1', s))
37.2261058654488
I like this regex solution because you can easily read what's going on:
>>> import re
>>> string = "t e s t t e s t"
>>> re.sub(' {1,2}', lambda m: '' if m.group() == ' ' else ' ', string)
'test test'
We search for one or two spaces, and substitute one space with the empty string but two spaces with a single space.

Python: Splitting a string into words, saving separators

I have a string:
'Specified, if char, else 10 (default).'
I want to split it into two tuples
words=('Specified', 'if', 'char', 'else', '10', 'default')
separators=(',', ' ', ',', ' ', ' (', ').')
Does anyone have a quick solution of this?
PS: this symbol '-' is a word separator, not part of the word
import re
line = 'Specified, if char, else 10 (default).'
words = re.split(r'\)?[, .]\(?', line)
# words = ['Specified', '', 'if', 'char', '', 'else', '10', 'default', '']
separators = re.findall(r'\)?[, .]\(?', line)
# separators = [',', ' ', ' ', ',', ' ', ' ', ' (', ').']
If you really want tuples pass the results in tuple(), if you do not want words to have the empty entries (from between the commas and spaces), use the following:
words = [x for x in re.split(r'\)?[, .]\(?', line) if x]
or
words = tuple(x for x in re.split(r'\)?[, .]\(?', line) if x)
You can use regex for that.
>>> a='Specified, if char, else 10 (default).'
>>> from re import split
>>> split(",? ?\(?\)?\.?",a)
['Specified', 'if', 'char', 'else', '10', 'default', '']
But in this solution you should write that pattern yourself. If you want to use that tuple, you should convert it contents to regex pattern for that in this solution.
Regex to find all separators (assumed anything that's not alpha numeric
import re
re.findall('[^\w]', string)
I probably would first .split() on spaces into a list, then iterate through the list, using a regex to check for a character after the word boundary.
import re
s = 'Specified, if char, else 10 (default).'
w = s.split()
seperators = []
finalwords = []
for word in words:
match = re.search(r'(\w+)\b(.*)', word)
sep = '' if match is None else match.group(2)
finalwords.append(match.group(1))
seperators.append(sep)
In pass to get both separators and words you could use findall as follows:
import re
line = 'Specified, if char, else 10 (default).'
words = []
seps = []
for w,s in re.findall("(\w*)([), .(]+)", line):
words.append(w)
seps.append(s)
Here's my crack at it:
>>> p = re.compile(r'(\)? *[,.]? *\(?)')
>>> tmp = p.split('Specified, char, else 10 (default).')
>>> words = tmp[::2]
>>> separators = tmp[1::2]
>>> print words
['Specified', 'char', 'else', '10', 'default', '']
>>> print separators
[', ', ', ', ' ', ' (', ').']
The only problem is you can have a '' at the end or the beginning of words if there is a separator at the beginning/end of the sentence without anything before/after it. However, that is easily checked for and eliminated.

Categories

Resources