Related
I have a function that is able to create triples and relationships from text. However, when I create a list of a column that contains text and pass it through the function, it only processes the first row, or item of the list. Therefore, I am wondering how the whole list can be processed within this function. Maybe a for loop would work?
The following line contains the list
rez_dictionary = {'Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed'}
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
If anyone has a suggestion, I am looking forward for it.
Would it also be possible to get the output adjusted to the following format:
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
You are removing the other entries of rez_dictionary inside the batch_decode:
triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]
Use a list comprehension instead:
from transformers import pipeline
rez = ['Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed']
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
model_output = triplet_extractor(rez, return_tensors=True, return_text=False)
extracted_text = triplet_extractor.tokenizer.batch_decode([x["generated_token_ids"] for x in model_output])
print("\n".join(extracted_text))
Output:
<s><triplet> Decent Little Reader <subj> Poor Tablet <obj> different from <triplet> Poor Tablet <subj> Decent Little Reader <obj> different from</s>
<s><triplet> Ok For What It Is <subj> film <obj> instance of</s>
<s><triplet> Too Heavy and Poor <subj> weld quality <obj> subclass of</s>
<s><triplet> difficult mount <subj> mount <obj> subclass of</s>
<s><triplet> 2008 Summer Olympics <subj> 2008 <obj> point in time</s>
Regarding the extension of the OP's question, OP wanted to know how to run the function extract_triplets. OP can simply do that via a for-loop:
for text in extracted_text:
print(extract_triplets(text))
Output:
[{'head': 'Decent Little Reader', 'type': 'different from', 'tail': 'Poor Tablet'}, {'head': 'Poor Tablet', 'type': 'different from', 'tail': 'Decent Little Reader'}]
[{'head': 'Ok For What It Is', 'type': 'instance of', 'tail': 'film'}]
[{'head': 'Too Heavy and Poor', 'type': 'subclass of', 'tail': 'weld quality'}]
[{'head': 'difficult mount', 'type': 'subclass of', 'tail': 'mount'}]
[{'head': '2008 Summer Olympics', 'type': 'point in time', 'tail': '2008'}]
So I wrote this but it doesn't accomplish what I want to do. Basically, I want to replace the number in the second index with whatever the word is at that index in the content_list list.
content_list= ['abstract', 'bow', 'button', 'chiffon', 'collar', 'cotton', 'crepe', 'crochet', 'crop', 'embroidered', 'floral', 'floralprint', 'knit', 'lace', 'longsleeve', 'peasant', 'pink', 'pintuck', 'plaid', 'pleated', 'polkadot', 'printed', 'red', 'ruffle', 'sheer', 'shirt', 'sleeve', 'sleeveless', 'split', 'striped', 'summer', 'trim', 'tunic', 'v-neck', 'woven', '']
max=[['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '24'],['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '19,15,24']]
for l in max:
e=l[1]
f=e.split(",")
for s in f:
intt=int(s)
rep=content_list[intt]
#print(rep)
e.replace(s,rep)
#print(z)
print(max)
This is the output that i get:
[['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '24'], ['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '19,15,24']]
But this is what i want:
[['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', 'sheer'], ['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', 'pleated,peasant,sheer']]
First of all, max is a built-in function I would highly recommend you to check how to name variables for the future, It may cause some big problems for you :).
You can brute-force your way out here also something like this:
arr = [
['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '24'],
['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '19,15,24'],
]
for inner in arr:
indexes=inner[1]
inner[1] = ""
for number in indexes.split(","):
inner[1] += content_list[int(number)]
print(inner)
I would do something like this, I think must be better options but it works... so it's better than nothing
content_list = ['abstract', 'bow', 'button', 'chiffon',
'collar', 'cotton', 'crepe', 'crochet',
'crop', 'embroidered', 'floral', 'floralprint',
'knit', 'lace', 'longsleeve', 'peasant', 'pink',
'pintuck', 'plaid', 'pleated', 'polkadot', 'printed',
'red', 'ruffle', 'sheer', 'shirt', 'sleeve', 'sleeveless',
'split', 'striped', 'summer', 'trim', 'tunic',
'v-neck', 'woven', '']
target_array = [['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '24'],
['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '19,15,24']]
for id_target, el in enumerate(target_array):
for num in el[1].split(','):
target_array[id_target][1] = target_array[id_target][1].replace(num,
content_list[int(num)])
Here is the solution:
import numpy as np
c= ['abstract', 'bow', 'button', 'chiffon', 'collar', 'cotton', 'crepe', 'crochet', 'crop', 'embroidered', 'floral', 'floralprint', 'knit', 'lace', 'longsleeve', 'peasant', 'pink', 'pintuck', 'plaid', 'pleated', 'polkadot', 'printed', 'red', 'ruffle', 'sheer', 'shirt', 'sleeve', 'sleeveless', 'split', 'striped', 'summer', 'trim', 'tunic', 'v-neck', 'woven', '']
m=[['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '24'],['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg', '19,15,24']]
n = np.copy(m)
for i in range(np.size(m,1)):
for j in range(np.size(m[i][1].split(','))):
idx = m[i][1].split(',')[j]
if (j==0):
n[i][1] = c[int(idx)]
else:
n[i][1] += ',' + c[int(idx)]
print(n)
The output:
[['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg' 'sheer']
['Img/img/Sheer_Pleated-Front_Blouse/img_00000001.jpg'
'pleated,peasant,sheer']]
I am completely new to python. I have been creating a vocabulary program, but I want the to mix the words, so also the ones behind :. So far it keeps throwing only the ones in front of :. How can I achieve this?
print()
print('Welcome to german vocabulary quiz!')
import random
answer = input('Ready? ')
¨
print('')
while answer=='y' or 'yes':
vocabDictionary={
'e Arbeit':'pracovat', 'oder':'nebo', 'r Abend':'večer', 'als':'jako', 'bitten':'prosit',
'buchstabieren':'hláskovat','wessen':'čí','r Koffer':'kufr','wer':'kdo','wem':'komu',
'wen':'koho','sehen':'vidět','e Tochter':'dcera','gruSen':'zdravit','warten':'čekat','sagen':'říkat',
'e Lehrerin':'učitelka','r Lehrer':'učitel','schreiben':'napsat','zeigen':'ukázat','stehen':'stát','suchen':'hledat',
'fahren':'jet','abfahren':'odjet','kommen':'přijít','hier'and'da':'tady','s Buch':'kniha',
'r Zug':'vlak','offnen':'otevřít','schlieSen':'zavřít','ab/biegen':'odbočit','e Ampel':'semafor',
'denn':'pak'and'potom','dorthin':'tam','až'and'dokud':'bis','zu':'k'and'ke','druben'and'gegenuber':'naproti','fremd':'cizí',
'r FuSganger':'chodec','gerade':'právě','geradeaus':'rovně','e Halstestelle':'zastávka','r Hauptbahnhof':'hlavní nádraží',
'ihnen':'vám','e Kreuzung':'křižovatka','links':'vlevo','nach links':'doleva','mit':'se'or's','nach':'do'or'po',
'rechts':'vpravo','e StraSe':'ulice'and'silnice','uberqueren':'přejít','ungefahr':'přibližně'or'asi',
'von hier':'odsud','weiter':'dál','zu FuS':'pěšky','aber':'ale','alles':'všechno','e Blume':'květina',
'brav':'hodný','ein bisschen':'trochu','faul':'líný','fleiSig':'pilný','e Freizeit':'volný čas','r FuSball':'fotbal',
'gern(e)':'rád','groS':'velký','haben':'mít','horen':'poslouchat','hubsch'and'schon':'hezký'or'pěkný','jetzt':'teď'or'nyní',
'e Journalistin':'novinářka','s Kaninchen':'králík','lernen':'učit se','lieb':'milý','lustig':'veselý',
'manchmal':'někdy'or'občas','nett':'milý'or'vlídný'or'hezký','noch':'ještě','nur':'jen','oft':'často',
'recht':'skutečně'or'opravdu'or'velmi','sauber':'čistý','sauber machen':'uklízet','schauen':'dívat se'or'podívat se',
'schlank':'štíhlý','sehr':'velmi','zehn':'deset','r Spaziergang':'procházka','einen Spaziergang machen':'jít na procházku',
'spielen':'hrát','studieren':'studovat','s Tier':'zvíře','treiben':'zabývat se'or'provozovat','e Zeit':'čas',
'Sport treiben':'sportovat','verheiratet':'ženatý'or'vdaná','r Unternhehmer':'podnikatel','zu Hause':'doma',
'ziemlich':'pořádně'or'značně','zwanzig':'dvacet','aus':'z','dann':'potom','dich':'tebe'or'tě',
'dir':'ti'or'tobě','e Entschuldigung':'omluva'or'prominutí','finden':'nacházet'or'shledávat','gehen':'jít',
'geil':'báječný'or'skvělý'or'super','heiSen':'jmenovat se','r Herr':'pán','e Frau':'paní','r Nachname':'příjmení',
'leider':'bohužel','r Tag':'den','viel':'hodně'and'hodně','was':'co','wie':'jak','woher':'odkud','wohnen':'bydlet',
'Tschechien':'Česko'
}
keyword_list=list(vocabDictionary.keys())
random.shuffle(keyword_list)
score=0
for keyword in keyword_list:
display='{}'
print(display.format(keyword))
userInputAnswer=input(': ')
print('')
vocabDictionary.keys() This code only returns the keys of a dictionary, which are the words before the :
To create a list containing both the keys and the values, you can use .values() to create another list, and add the two lists
keyword_list1=list(vocabDictionary.keys())
keyword_list2= list(vocabDictionary.values())
keyword_list = keyword_list1 + keyword_list2
Full codes below:
print('Welcome to german vocabulary quiz!')
import random
answer = input('Ready? ')
print('')
while answer=='y' or 'yes':
vocabDictionary={
'e Arbeit':'pracovat', 'oder':'nebo', 'r Abend':'večer', 'als':'jako', 'bitten':'prosit',
'buchstabieren':'hláskovat','wessen':'čí','r Koffer':'kufr','wer':'kdo','wem':'komu',
'wen':'koho','sehen':'vidět','e Tochter':'dcera','gruSen':'zdravit','warten':'čekat','sagen':'říkat',
'e Lehrerin':'učitelka','r Lehrer':'učitel','schreiben':'napsat','zeigen':'ukázat','stehen':'stát','suchen':'hledat',
'fahren':'jet','abfahren':'odjet','kommen':'přijít','hier'and'da':'tady','s Buch':'kniha',
'r Zug':'vlak','offnen':'otevřít','schlieSen':'zavřít','ab/biegen':'odbočit','e Ampel':'semafor',
'denn':'pak'and'potom','dorthin':'tam','až'and'dokud':'bis','zu':'k'and'ke','druben'and'gegenuber':'naproti','fremd':'cizí',
'r FuSganger':'chodec','gerade':'právě','geradeaus':'rovně','e Halstestelle':'zastávka','r Hauptbahnhof':'hlavní nádraží',
'ihnen':'vám','e Kreuzung':'křižovatka','links':'vlevo','nach links':'doleva','mit':'se'or's','nach':'do'or'po',
'rechts':'vpravo','e StraSe':'ulice'and'silnice','uberqueren':'přejít','ungefahr':'přibližně'or'asi',
'von hier':'odsud','weiter':'dál','zu FuS':'pěšky','aber':'ale','alles':'všechno','e Blume':'květina',
'brav':'hodný','ein bisschen':'trochu','faul':'líný','fleiSig':'pilný','e Freizeit':'volný čas','r FuSball':'fotbal',
'gern(e)':'rád','groS':'velký','haben':'mít','horen':'poslouchat','hubsch'and'schon':'hezký'or'pěkný','jetzt':'teď'or'nyní',
'e Journalistin':'novinářka','s Kaninchen':'králík','lernen':'učit se','lieb':'milý','lustig':'veselý',
'manchmal':'někdy'or'občas','nett':'milý'or'vlídný'or'hezký','noch':'ještě','nur':'jen','oft':'často',
'recht':'skutečně'or'opravdu'or'velmi','sauber':'čistý','sauber machen':'uklízet','schauen':'dívat se'or'podívat se',
'schlank':'štíhlý','sehr':'velmi','zehn':'deset','r Spaziergang':'procházka','einen Spaziergang machen':'jít na procházku',
'spielen':'hrát','studieren':'studovat','s Tier':'zvíře','treiben':'zabývat se'or'provozovat','e Zeit':'čas',
'Sport treiben':'sportovat','verheiratet':'ženatý'or'vdaná','r Unternhehmer':'podnikatel','zu Hause':'doma',
'ziemlich':'pořádně'or'značně','zwanzig':'dvacet','aus':'z','dann':'potom','dich':'tebe'or'tě',
'dir':'ti'or'tobě','e Entschuldigung':'omluva'or'prominutí','finden':'nacházet'or'shledávat','gehen':'jít',
'geil':'báječný'or'skvělý'or'super','heiSen':'jmenovat se','r Herr':'pán','e Frau':'paní','r Nachname':'příjmení',
'leider':'bohužel','r Tag':'den','viel':'hodně'and'hodně','was':'co','wie':'jak','woher':'odkud','wohnen':'bydlet',
'Tschechien':'Česko'
}
keyword_list1=list(vocabDictionary.keys())
keyword_list2= list(vocabDictionary.values())
keyword_list = keyword_list1 + keyword_list2
random.shuffle(keyword_list)
score=0
for keyword in keyword_list:
display='{}'
print(display.format(keyword))
userInputAnswer=input(': ')
print('')
try:
if userInputAnswer==(vocabDictionary[keyword]):
score += 1
except KeyError:
try:
if keyword == vocabDictionary[userInputAnswer]:
score +=1
except KeyError:
pass
print(score)
Currently, you are only picking words from keys (so before the semicolumn).
You could try this:
keyword_list_keys=list(vocabDictionary.keys())
keyword_list_values=list(vocabDictionary.values())
random.shuffle(keyword_list_keys + keyword_list_values)
Then you would have to differentiate depending on the two cases, to find the matching key/value.
I previously used course_rolls(records) to make the data to a dictionary below:
{'EMT001': {2286560}, 'FMKT01': {2547053}, 'CSC001': {2955520, 2656583}, 'MGM001': {2928707, 2606735}, 'MTH002': {2786372}, 'FCOM03': {2762453, 2955520, 2564885, 2606735}, 'FMCM02': {2955520, 2928707, 2656583}, 'ENG001': {2571096, 2564885}, 'MKT001': {2571096, 2656583},'AWA001': {2286560}, 'ACC002': {2762453}, 'FMTH01': {2571096}, 'EMT003': {2762453, 2656583}, 'MEA001': {2564885, 2606735}, 'FPHY01': {2564885}, 'FBIO01': {2547053}, 'MTH001': {2286560}, 'ECO002': {2928707, 2786372}, 'FCHM01': {2286560}, 'FCOM01': {2786372}, 'ENG002': {2762453}}
Records variable contains:
[(('EMT001', 'Engineering Mathematics 1'), (2286560, 'Dayton', 'Archambault')), (('FMKT01', 'Marketing'), (2547053, 'Vladimir', 'Zemanek')), (('CSC001', 'Computer Programming'), (2656583, 'Ronny', 'Ridley')), (('MGM001', 'Fundamentals of Management'), (2928707, 'Susanne', 'Eastland')), (('MTH002', 'Mathematics 2'), (2786372, 'Danella', 'Crabe')), (('FCOM03', 'Introduction to Computing'), (2564885, 'Hpone', 'Ganadry')), (('FCOM03', 'Introduction to Computing'), (2762453, 'Phelia', 'Pottle')), (('FMCM02', 'Mass Communication II (Film Studies)'), (2656583, 'Ronny', 'Ridley')), (('ENG001', 'Foundations of Engineering'), (2564885, 'Hpone', 'Ganadry')), (('MKT001', 'Principles of Marketing'), (2571096, 'Shoshanna', 'Shupe')), (('AWA001', 'Engineering Writing Skills'), (2286560, 'Dayton', 'Archambault')), (('FCOM03', 'Introduction to Computing'), (2606735, 'Aaren', 'Enns')), (('ACC002', 'Financial Accounting'), (2762453, 'Phelia', 'Pottle')), (('FMTH01', 'Advanced Mathematics I'), (2571096, 'Shoshanna', 'Shupe')), (('FCOM03', 'Introduction to Computing'), (2955520, 'Bjorn', 'Kakou')), (('EMT003', 'Mathematical Modelling and Computation'), (2762453, 'Phelia', 'Pottle')), (('MEA001', 'Mixed English Programme'), (2564885, 'Hpone', 'Ganadry')), (('MGM001', 'Fundamentals of Management'), (2606735, 'Aaren', 'Enns')), (('MEA001', 'Mixed English Programme'), (2606735, 'Aaren', 'Enns')), (('FPHY01', 'Physics'), (2564885, 'Hpone', 'Ganadry')), (('FBIO01', 'Introduction to Biology'), (2547053, 'Vladimir', 'Zemanek')), (('ENG001', 'Foundations of Engineering'), (2571096, 'Shoshanna', 'Shupe')), (('MKT001', 'Principles of Marketing'), (2656583, 'Ronny', 'Ridley')), (('MTH001', 'Mathematics 1'), (2286560, 'Dayton', 'Archambault')), (('ECO002', 'Introduction to Macroeconomics'), (2786372, 'Danella', 'Crabe')), (('FCHM01', 'Chemistry'), (2286560, 'Dayton', 'Archambault')), (('FCOM01', 'Communication Skills II'), (2786372, 'Danella', 'Crabe')), (('FMCM02', 'Mass Communication II (Film Studies)'), (2928707, 'Susanne', 'Eastland')), (('CSC001', 'Computer Programming'), (2955520, 'Bjorn', 'Kakou')), (('ENG002', 'Engineering Mechanics and Materials'), (2762453, 'Phelia', 'Pottle')), (('EMT003', 'Mathematical Modelling and Computation'), (2656583, 'Ronny', 'Ridley')), (('FMCM02', 'Mass Communication II (Film Studies)'), (2955520, 'Bjorn', 'Kakou')), (('ECO002', 'Introduction to Macroeconomics'), (2928707, 'Susanne', 'Eastland'))]
The program input is: clashes CSC001
Expected output (Command: clashes CSC001):
Clashes with CSC001:
In CSC001 and EMT003
2656583:
In CSC001 and FCOM03
2955520:
In CSC001 and FMCM02
2656583:
2955520:
In CSC001 and MKT001
2656583:
The result I get (Command: clashes CSC001):
Clashes with CSC001:
In CSC001 and FCOM03
2955520:
In CSC001 and FMCM02
2955520:
In CSC001 and EMT003
2656583:
In CSC001 and FMCM02
2656583:
In CSC001 and MKT001
2656583:
Program code: (clashes function need to fix)
def parse_course_code(commands, records):
if commands in course_rolls(records).keys():
print('Clashes with {}:'.format(commands))
return clashes(commands, records)
else:
print("Unknown course code")
def clashes(course_code, records):
unformatted_course_rolls = course_rolls(records)
for course, prior_student_number in sorted(unformatted_course_rolls.items()):
for prior_student_number_single in list(prior_student_number):
if course_code == course:
for code, student_number in sorted(unformatted_course_rolls.items()):
for num in list(student_number):
if prior_student_number_single == num and course != code:
print("In {} and {} ".format(course, code))
print(" {}: \n".format(num))
def main():
while True:
command = input('Command: ')
if command == 'quit':
print('Adios')
break
else:
parse_course_code(commands, records)
main()
The code you provided is very complicated to follow. You should try to use less for loops - it makes the code harder to read, and takes longer to execute (if you were performing something of larger scale). Take a look at my solution below, it prints the result as you would expect and is much easier to read and understand.
Here is the refactored code (note, since you didn't provide the course_roll function, I changed course_roll(records) for the variable course_rolls_records, so I could run it locally.):
def parse_course_code(commands):
if commands in course_rolls_records.keys():
print('Clashes with {}:'.format(commands))
return clashes(commands)
else:
print("Unknown course code")
def clashes(course_code):
numbers_in_course = course_rolls_records[course_code]
collisions = {}
for course_check in course_rolls_records.keys():
numbers_in_course_check = course_rolls_records[course_check]
if course_check == course_code:
# This is the course we are querying, we don't need to do anything
pass
elif numbers_in_course_check.intersection(numbers_in_course):
# We enter here if there are common student numbers between courses
# Note: if there is no intersection, bool of empty set is False
collisions[course_check] = list(numbers_in_course_check.intersection(numbers_in_course))
print("In {} and {} ".format(course_code, course_check))
for num in collisions[course_check]:
print(" {}: ".format(num))
print('\n')
def main():
while True:
command = input('Command: ')
if command == 'quit':
print('Adios')
break
else:
parse_course_code(command)
main()
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.
(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)
I didn't found any Python profanity library, so I made one myself.
Parameters
filterlist
A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.
Example:
['bad', 'un\w+']
ignore_case
Default: True
Self-explanatory.
replacements
Default: "$#%-?!"
A string with characters from which the replacements strings will be randomly generated.
Examples: "%&$?!" or "-" etc.
complete
Default: True
Controls if the entire string will be replaced or if the first and last chars will be kept.
inside_words
Default: False
Controls if words are searched inside other words too. Disabling this
Module source
(examples at the end)
"""
Module that provides a class that filters profanities
"""
__author__ = "leoluk"
__version__ = '0.0.1'
import random
import re
class ProfanitiesFilter(object):
def __init__(self, filterlist, ignore_case=True, replacements="$#%-?!",
complete=True, inside_words=False):
"""
Inits the profanity filter.
filterlist -- a list of regular expressions that
matches words that are forbidden
ignore_case -- ignore capitalization
replacements -- string with characters to replace the forbidden word
complete -- completely remove the word or keep the first and last char?
inside_words -- search inside other words?
"""
self.badwords = filterlist
self.ignore_case = ignore_case
self.replacements = replacements
self.complete = complete
self.inside_words = inside_words
def _make_clean_word(self, length):
"""
Generates a random replacement string of a given length
using the chars in self.replacements.
"""
return ''.join([random.choice(self.replacements) for i in
range(length)])
def __replacer(self, match):
value = match.group()
if self.complete:
return self._make_clean_word(len(value))
else:
return value[0]+self._make_clean_word(len(value)-2)+value[-1]
def clean(self, text):
"""Cleans a string from profanity."""
regexp_insidewords = {
True: r'(%s)',
False: r'\b(%s)\b',
}
regexp = (regexp_insidewords[self.inside_words] %
'|'.join(self.badwords))
r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)
return r.sub(self.__replacer, text)
if __name__ == '__main__':
f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")
example = "I am doing bad ungood badlike things."
print f.clean(example)
# Returns "I am doing --- ------ badlike things."
f.inside_words = True
print f.clean(example)
# Returns "I am doing --- ------ ---like things."
f.complete = False
print f.clean(example)
# Returns "I am doing b-d u----d b-dlike things."
arrBad = [
'2g1c',
'2 girls 1 cup',
'acrotomophilia',
'anal',
'anilingus',
'anus',
'arsehole',
'ass',
'asshole',
'assmunch',
'auto erotic',
'autoerotic',
'babeland',
'baby batter',
'ball gag',
'ball gravy',
'ball kicking',
'ball licking',
'ball sack',
'ball sucking',
'bangbros',
'bareback',
'barely legal',
'barenaked',
'bastardo',
'bastinado',
'bbw',
'bdsm',
'beaver cleaver',
'beaver lips',
'bestiality',
'bi curious',
'big black',
'big breasts',
'big knockers',
'big tits',
'bimbos',
'birdlock',
'bitch',
'black cock',
'blonde action',
'blonde on blonde action',
'blow j',
'blow your l',
'blue waffle',
'blumpkin',
'bollocks',
'bondage',
'boner',
'boob',
'boobs',
'booty call',
'brown showers',
'brunette action',
'bukkake',
'bulldyke',
'bullet vibe',
'bung hole',
'bunghole',
'busty',
'butt',
'buttcheeks',
'butthole',
'camel toe',
'camgirl',
'camslut',
'camwhore',
'carpet muncher',
'carpetmuncher',
'chocolate rosebuds',
'circlejerk',
'cleveland steamer',
'clit',
'clitoris',
'clover clamps',
'clusterfuck',
'cock',
'cocks',
'coprolagnia',
'coprophilia',
'cornhole',
'cum',
'cumming',
'cunnilingus',
'cunt',
'darkie',
'date rape',
'daterape',
'deep throat',
'deepthroat',
'dick',
'dildo',
'dirty pillows',
'dirty sanchez',
'dog style',
'doggie style',
'doggiestyle',
'doggy style',
'doggystyle',
'dolcett',
'domination',
'dominatrix',
'dommes',
'donkey punch',
'double dong',
'double penetration',
'dp action',
'eat my ass',
'ecchi',
'ejaculation',
'erotic',
'erotism',
'escort',
'ethical slut',
'eunuch',
'faggot',
'fecal',
'felch',
'fellatio',
'feltch',
'female squirting',
'femdom',
'figging',
'fingering',
'fisting',
'foot fetish',
'footjob',
'frotting',
'fuck',
'fucking',
'fuck buttons',
'fudge packer',
'fudgepacker',
'futanari',
'g-spot',
'gang bang',
'gay sex',
'genitals',
'giant cock',
'girl on',
'girl on top',
'girls gone wild',
'goatcx',
'goatse',
'gokkun',
'golden shower',
'goo girl',
'goodpoop',
'goregasm',
'grope',
'group sex',
'guro',
'hand job',
'handjob',
'hard core',
'hardcore',
'hentai',
'homoerotic',
'honkey',
'hooker',
'hot chick',
'how to kill',
'how to murder',
'huge fat',
'humping',
'incest',
'intercourse',
'jack off',
'jail bait',
'jailbait',
'jerk off',
'jigaboo',
'jiggaboo',
'jiggerboo',
'jizz',
'juggs',
'kike',
'kinbaku',
'kinkster',
'kinky',
'knobbing',
'leather restraint',
'leather straight jacket',
'lemon party',
'lolita',
'lovemaking',
'make me come',
'male squirting',
'masturbate',
'menage a trois',
'milf',
'missionary position',
'motherfucker',
'mound of venus',
'mr hands',
'muff diver',
'muffdiving',
'nambla',
'nawashi',
'negro',
'neonazi',
'nig nog',
'nigga',
'nigger',
'nimphomania',
'nipple',
'nipples',
'nsfw images',
'nude',
'nudity',
'nympho',
'nymphomania',
'octopussy',
'omorashi',
'one cup two girls',
'one guy one jar',
'orgasm',
'orgy',
'paedophile',
'panties',
'panty',
'pedobear',
'pedophile',
'pegging',
'penis',
'phone sex',
'piece of shit',
'piss pig',
'pissing',
'pisspig',
'playboy',
'pleasure chest',
'pole smoker',
'ponyplay',
'poof',
'poop chute',
'poopchute',
'porn',
'porno',
'pornography',
'prince albert piercing',
'pthc',
'pubes',
'pussy',
'queaf',
'raghead',
'raging boner',
'rape',
'raping',
'rapist',
'rectum',
'reverse cowgirl',
'rimjob',
'rimming',
'rosy palm',
'rosy palm and her 5 sisters',
'rusty trombone',
's&m',
'sadism',
'scat',
'schlong',
'scissoring',
'semen',
'sex',
'sexo',
'sexy',
'shaved beaver',
'shaved pussy',
'shemale',
'shibari',
'shit',
'shota',
'shrimping',
'slanteye',
'slut',
'smut',
'snatch',
'snowballing',
'sodomize',
'sodomy',
'spic',
'spooge',
'spread legs',
'strap on',
'strapon',
'strappado',
'strip club',
'style doggy',
'suck',
'sucks',
'suicide girls',
'sultry women',
'swastika',
'swinger',
'tainted love',
'taste my',
'tea bagging',
'threesome',
'throating',
'tied up',
'tight white',
'tit',
'tits',
'titties',
'titty',
'tongue in a',
'topless',
'tosser',
'towelhead',
'tranny',
'tribadism',
'tub girl',
'tubgirl',
'tushy',
'twat',
'twink',
'twinkie',
'two girls one cup',
'undressing',
'upskirt',
'urethra play',
'urophilia',
'vagina',
'venus mound',
'vibrator',
'violet blue',
'violet wand',
'vorarephilia',
'voyeur',
'vulva',
'wank',
'wet dream',
'wetback',
'white power',
'women rapping',
'wrapping men',
'wrinkled starfish',
'xx',
'xxx',
'yaoi',
'yellow showers',
'yiffy',
'zoophilia']
def profanityFilter(text):
brokenStr1 = text.split()
badWordMask = '!##$%!##$%^~!#%^~##$%!##$%^~!'
new = ''
for word in brokenStr1:
if word in arrBad:
print word + ' <--Bad word!'
text = text.replace(word,badWordMask[:len(word)])
#print new
return text
print profanityFilter("this thing sucks sucks sucks fucking stuff")
You can add or remove from the bad words list,arrBad, as you please.
WebPurify is a Profanity Filter Library for Python
You could probably combine http://spambayes.sourceforge.net/ and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt.
Profanity? What the f***'s that? ;-)
It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."
Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:
"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."
(that was from Winston Churchill, btw.)
It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:
import re
def remove_profanity(s):
def repl(word):
m = re.match(r"(\w+)(.*)", word)
if not m:
return word
word = "Bork" if m.group(1)[0].isupper() else "bork"
word += m.group(2)
return word
return " ".join([repl(w) for w in s.split(" ")])
print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")