I have some JSON data converted into a Pandas DataFrame. I am looking to find all columns whose string content matches a list of multi word phrases.
I am working with a massive amount of Twitter JSON data already downloaded for public use (so Twitter API usage is not applicable). This JSON is converted into a Pandas DataFrame. One of the columns available is, text which the body of the tweet. An example is
We’re kicking off the first portion of a citywide traffic calming project to make residential streets more safe & pedestrian-friendly, next week!
Tuesday, July 30 at 10:30 AM
Nautilus Drive and 42 Street
I want to be able to have a list of phrases, phrases = ["We're kicking off", "we're starting", "we're initiating"] and do something like pd[pd['text'].str.contains(phrases)]] to ensure that I can obtain pandas DataFrame rows whose text column contains one of the phrases.
This is perhaps asking too much, but ideally I would also be able to match something like phrases = ["(We're| we are) kicking off", "(we're | we are) starting", "(we're| we are) initiating"]
Make a list with keywords or phrases you want to match, i have put on logic for perfect match, you can change it by changing regex. Also it will capture by which keywords was the text caught.
Here is the code -
for i in range(len(mustkeywords)):
for index in range(len(text)):
result = re.search(r'\s*\b'+mustkeywords[i]+r'\W\s*', text[index])
if result:
commentlist.append(text[index])
keywordlist.append(mustkeywords[i])
tempmustkeywordsdf=pd.DataFrame(columns={"Comments"},data=commentlist) #temp df for keywords
tempmustkeywordsdf["Keywords"]=keywordlist #adding keywords column to this df
Here mustkeywords is a list that contains your phrases or keywords
.text is a string that contains all the data/phrases that you want to check keywords into.
and tempmustkeywordsdf is that contains matched strings and keywords that matched them.
I hope this helps.
Related
I would like to perform text analysis like world cloud and ngram on one of the text columns. I have broken down the sentence into tokens and want to join back it to the original table.
For example here are my two rows:
Code Text
ST-441 Purpose of your visit mentioned
St-432 Describe how and where it happened
after doing text cleaning on the text column I applied the following function
After applying the split function the sentence has broken down into words, one row has become n rows and wants to add it back to the original table using a unique identifier column.
def cleans(data):
tokens = list(map(lambda data: data.split(' '), text))
Now I got the list of tokens like 'purpose', 'your', 'visit', 'mentioned', 'described' ...
I am looking for the below output
Code Text
ST-441 Purpose
ST-441 your
ST-441 Visit
ST-441 mentioned
ST-432 Describe
ST-432 how
Any help would be much appreciated.
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
I'm setting up integration between a webflow store and shippo to assist with creating labels and managing shipping. Webflow passes the data as one huge object for address information, however to create a new order in shippo, I need the information parsed, separated as individual line items. I have attempted to use formatter which allows one to extract text, split text, use regex to match data and more.
import re
details = re.search(r'(?<=city:\s).*$', input_data[All Addresses])
Regex in Python is my best option, yet the result will not find and/or display the data.
Please any experts in Zapier integrations, I need assistance in figuring out a way to parse the incoming data from webflow, pass it to the 'create a order' action with shippo.
Structure of Data:
addressee: string
city: string
country: string
more....
You can try this one:
Combine all the data in one whole string
import re
details = re.finall(r'(?<=city:\s).*$', all_addresses)
return details
It will you give the list of all matches in the text.
Based on this question How to create a word cloud from a corpus in Python?, I a did build a word cloud, using amueller's library. However, I fail to see how I can feed the cloud with more that one text sets. Here is what I have tried so far:
wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
stopwords=STOPWORDS.add("said"))
wc.generate(set_of_words)
wc.generate("foo") # this overwrites the previous line of code
# but I would like this to be appended to the set of words
I can not find any manual for the library, so I have no idea about how to proceed, do you? :)
In reality, as you see here: Dictionary with array of different types as value in Python, I have this data structure:
category = { "World news": [2, "foo bla content of", "content of 2nd article"],
"Politics": [1, "only 1 article here"],
...
}
and I would like to append to the world cloud "foo bla content of" and "content of 2nd article".
The easiest solution would be to regenerate the wordcloud with the updated corpus.
To build a corpus with the text contained in your category data structure (for all topics) you could use this comprehension:
# Update the corpus
corpus = " ".join([" ".join(value[1:]) for value in category.values()])
# Regenerate the word cloud
wc.generate(corpus)
To build the word cloud for a single key in your data structure (eg Politics):
# Update the corpus
corpus = " ".join(category["Politics"][1:])
# Regenerate the word cloud
wc.generate(corpus)
Explanation:
join glues multiple string together separated by a given delimeter
[1:] takes all the elements from a list except the first one
dict.values() gives a list of all the values in the dictionary
The expression " ".join([" ".join(value[1:]) for value in category.values()]) thus can be translated as:
First glue together all the elements per key except the first one (as it is a counter). Then glue together all the resulting strings.
From a brief skim over the class in https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py there isn't an update method, so you would need either to regenerate the wordcloud or add an update method.
Easiest way would probably be to maintain the original source text, and add to the end of this, then regenerate.
I have a dictionary of the following structure:
keywords={topic_1:{category_1:['\"phrase_1\"','\"phrase_2\"'],
catgeory_2:[''\"phrase_1\"','\"phrase_2\"']},
topic_2:{category_1:['\"phrase_1\"','\"phrase_2\"','\"phrase_3\"']}}
I have a bunch of documents in mongodb on which I want tag a [category,topic] tag as long as it matches any one of the phrases in the [topic][category].However I need to iterate phrase by phrase as follows(Pymongo):
for topic in keywords:
for category in keywords[topic]:
for phrase in keywords[topic][category]:
docs=db.collection.find({'$text':{'$search':keyword}},{'_id':1})
Instead of this I just want to scan the list of phrases and give me a list of documents that match any one phrase for every [topic][category] list.Is this possible in pymongo?...Is OR-ing of phrases possible?-If so how do I go about it?I tried concatenating the phrases as a single string but that dint work.My actual Mongo collection has a million documents and the dictionary would be very large as well-The performance is brought down if I iterate phrase by phrase