String replace with multiple items - python

I have two pandas dataframes. One contains text, the other a set of terms i'd like to search for and replace within the text. I have created a loop which is able to replace each word in the text with a term however it's very slow, especially given that it is working over a large corpus.
My question is:
Is there a more efficient solution that replicates my method below?
Example text dataframe:
d = {'ID': [1, 2, 3], 'Text': ['here is some random text', 'random text here', 'more random text']}
text_df = pd.DataFrame(data=d)
Example terms dataframe:
d = {'Replace_item': ['<RANDOM_REPLACED>', '<HERE_REPLACED>', '<SOME_REPLACED>'], 'Text': ['random', 'here', 'some']}
replace_terms_df = pd.DataFrame(data=d)
Example of current solution:
def find_replace(text, terms):
for _, row in terms.iterrows():
term = row['Text']
item = row['Replace_item']
text.Text = text.Text.str.replace(term, item)
return text
find_replace(text_df, replace_terms_df)
Please let me know if anything above requires clarifying. Thank you,

Using zip + str.replace on the three columns, and assigning the results to the column at once, reduced the time by 50% (~400us to ~200us using %timeit):
text_df['Text'] = [z.replace(x, y) for (x, y, z) in zip(replace_terms_df.Text, replace_terms_df.Replace_item, text_df.Text)]

Related

Is it possible to change cell value by dictionaly in Pandas DataFrame by iteration over list in the cell

UPDATED
Pandas DataFram I have a column that contains a list like the below in cells
df_lost['Article]
out[6]:
37774 186-2, 185-3, 185-2
37850 358-1, 358-4
37927
38266 111-2
38409 111-2
38508
38519 185-1
41161 185-4, 357-1
42948 185-1
Name: Article, dtype: object
for each entry like '182-2', '111-2' etch I have a dictionary like
aDict = {'111-2': 'Text-1', '358-1': 'Text-2'.....}'
is it possible to iterate over the list in the df cells and change the value to the value of a key from the dictionary?
Expected result:
37774 ['Text 1, Text 2, Text -5']
....
I have tried to use the map function
df['Article'] = df['Article'].map(aDict)
but it doesn't work with the list in a cell. As a temp solution, I have created the dictionary
aDict = {'186-2, 185-3, 185-2': 'Test - 1, test -2, test -3".....}
this works but the number of combinations is extremely big
You need to split the string at the comma delimiters, and then look up each element in the dictionary. You also have to index the list to get the string out of the first element, and wrap the result string back into a list.
def convert_string(string_list, mapping):
items = string[0].split(', ')
new_items = [mapping.get(i, i) for i in items]
return [', '.join(new_items)]
df['Article'] = df['Article'].map(convert_string)
I would use a regex and str.replace here:
aDict = {'111-2': 'Text1', '358-1': 'Text 2'}
import re
pattern = '|'.join(map(re.escape, aDict))
df['Article'] = df['Article'].str.replace(pattern, lambda m: aDict[m.group()], regex=True)
NB. If the dictionary keys can overlap (ab/abc), then they should be sorted by decreasing length to generate the pattern.
Output:
Article
37774 186-2, 185-3, 185-2
37850 Text 2, 358-4
37927
38266 Text1
38409 Text1
38508
38519 185-1
41161 185-4, 357-1
42948 185-1

Isolating the Sentence in which a Term appears

I have the following script that does the following:
Extracts all text from a PowerPoint (all separated by a ":::")
Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
Creates a dataframe for the term + file which that term appeared
Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
To find the sentence containing the word "test", try:
>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])
Looping through your terms:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])

How to avoid for loops and iterate through pandas dataframe properly?

I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.

Using conditionals with variable strings in python

I'm pretty new to python, but I think I catch on fast.
Anyways, I'm making a program (not for class, but to help me) and have come across a problem.
I'm trying to document a list of things, and by things I mean close to a thousand of them, with some repeating. So my problem is this:
I would not like to add redundant names to the list, instead I would just like to add a 2x or 3x before (or after, whichever is simpler) it, and then write that to a txt document.
I'm fine with reading and writing from text documents, but my only problem is the conditional statement, I don't know how to write it, nor can I find it online.
for lines in list_of_things:
if(lines=="XXXX x (name of object here)"):
And then whatever under the if statement. My only problem is that the "XXXX" can be replaced with any string number, but I don't know how to include a variable within a string, if that makes any sense. Even if it is turned into an int, I still don't know how to use a variable within a conditional.
The only thing I can think of is making multiple if statements, which would be really long.
Any suggestions? I apologize for the wall of text.
I'd suggest looping over the lines in the input file and inserting a key in a dictionary for each one you find, then incrementing the value at the key by one for each instance of the value you find thereafter, then generating your output file from that dictionary.
catalog = {}
for line in input_file:
if line in catalog:
catalog[line] += 1
else:
catalog[line] = 1
alternatively
from collections import defaultdict
catalog = defaultdict(int)
for line in input_file:
catalog[line] += 1
Then just run through that dict and print it out to a file.
You may be looking for regular expressions and something like
for line in text:
match = re.match(r'(\d+) x (.*)', line)
if match:
count = int(match.group(1))
object_name = match.group(2)
...
Something like this?
list_of_things=['XXXX 1', 'YYYY 1', 'ZZZZ 1', 'AAAA 1', 'ZZZZ 2']
for line in list_of_things:
for e in ['ZZZZ','YYYY']:
if e in line:
print line
Output:
YYYY 1
ZZZZ 1
ZZZZ 2
You can also use if line.startswith(e): or a regex (if I am understanding your question...)
To include a variable in a string, use format():
>>> i = 123
>>> s = "This is an example {0}".format(i)
>>> s
'This is an example 123'
In this case, the {0} indicates that you're going to put a variable there. If you have more variables, use "This is an example {0} and more {1}".format(i, j)" (so a number for each variable, starting from 0).
This should do it:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
from itertools import groupby
print ["%dx %s" % (len(list(group)), key) for key, group in groupby(a)]
There are two options to approach this. 1) something like the following using a dictionary to capture the count of items and then a list to format each item with its count
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = {}
countedList = []
for lines in list_of_thing:
if lines in listItemCount:
listItemCount[lines] += 1
else:
listItemCount[lines] = 1
for id in listItemCount:
if listItemCount[id] > 1:
countedList.append(id+' - x'str(listItemCount[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon
or 2) using collections to make things simpler as shown below
import collections
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = collections.Counter(list_of_things)
listItemCountDict = dict(listItemCount)
countedList = []
for id in listItemCountDict:
if listItemCountDict[id] > 1:
countedList.append(id+' - x'str(listItemCountDict[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon

Finding a small list of strings in a large list of strings (Python)

Hi I'm new to Python, so this may come across as a simple problem but I've been searching through Google many times and I can't seem to find a way to overcome it.
Basically I have a list of strings, taken from a CSV file. And I have another list of strings in a text file. My job is to see if the words from my text file are in the CSV file.
Let's say this is what the CSV file looks like (it's made up):
name,author,genre,year
Private Series,Kate Brian,Romance,2003
Mockingbird,George Orwell,Romance,1956
Goosebumps,Mary Door,Horror,1990
Geisha,Mary Door,Romance,2003
And let's say the text file looks like this:
Romance
2003
What I'm trying to do is, create a function which returns the names of a book which have the words "Romance" and "2003" in them. So in this case, it should return "Private Series" and "Geisha" but not "Mockingbird". But my problem is, it doesn't seem to return them. However when I change my input to "Romance" it returns all three books with Romance in them. I assume it's because "Romance 2003" aren't together because if I change my input to "Mary Door" both "Goosebumps" and "Geisha" show up. So how can I overcome this?
Also, how do I make my function case insensitive?
Any help would be much appreciated :)
import csv
def read_input(filename):
f = open(filename)
return csv.DictReader(f, delimiter = ',')
def search_filter(src, term):
term = term.lower()
for s in src:
if term in map(str.lower, s.values()):
yield s
def query(src, terms):
terms = terms.split()
for t in terms:
src = search_filter(src, t)
return src
def print_query(q):
for row in q:
print row
I tried to split the logic into small, re-usable functions.
First, we have read_input which takes a filename and returns the lines of a CSV file as an iterable of dicts.
The search_filter filters a stream of results with the given term. Both the search term and the row values are changed to lowercase for the comparison to achieve case-independent matching.
The query function takes a query string, splits it into search terms and then makes a chain of filters based on the terms and returns the final, filtered iterable.
>>> src = read_input("input.csv")
>>> q = query(src, "Romance 2003")
>>> print_query(q)
{'genre': 'Romance', 'year': '2003', 'name': 'Private Series', 'author': 'Kate Brian'}
{'genre': 'Romance', 'year': '2003', 'name': 'Geisha', 'author': 'Mary Door'}
Note that the above solution only returns full matches. If you want to e.g. return the above matcher with the search query "Roman 2003", then you can use this alternative version of search_filter:
def search_filter(src, term):
term = term.lower()
for s in src:
if any(term in v.lower() for v in s.values()):
yield s

Categories

Resources