Data frame text filtering with text

Data frame text filtering with text - python

I need some help with running some filter on some data. I have a a data set made up of text. And i also have a list of words. I would like to filter each row of my data such that the remaining text in the rows will be made up of only words in the list object
words
(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )
data
ID Text
0 Cyclin-dependent kinases CDKs regulate a
1 Abstract Background Non-small cell lung
2 Abstract Background Non-small cell lung
3 Recent evidence has demonstrated that acquired
4 Oncogenic mutations in the monomeric Casitas
so after my filter i would like the data-frame to look like this
data
ID Text
0 kinases CDKs
1 Background cell lung
2 Background small cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
I tried using the iloc and similar functions but I dont seem to get it. any help with that?

You can simply use apply() along with a simple list comprehension:
>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0 kinases CDKs
1 Background cell lung
2 Background cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
Also, I made words a set to improve performance (O(1) average lookup time), I recommend you do the same.

I'm not certain this is the most elegant of solutions, but you could do:
to_remove = ['foo', 'bar']
df = pd.DataFrame({'Text': [
'spam foo& eggs',
'foo bar eggs bacon and lettuce',
'spam and foo eggs'
]})
df['Text'].str.replace('|'.join(to_remove), '')

Related

How to list all documents/words per topic in bert topic modelling?

I read the docs, but i can see the topics only show 3 or 4 documents per topic whereas the count is 2000+, is there a way i can see all the assigned documents, instead of three/four documents per topic?
For example: i want to see all the 2555 documents in the below picture, and get all words under the name column, not just first 3, 4 words I tried many things, but it doesn't work

as I understood you want to see n_words of the model and also documents representing the specific topics. First of all, you can list all topics b following code:
import pandas as pd
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
'display.precision', 3,
):
print(freq1)
with this, you will see all the topics you have on your model.
In order to get n_words from topic 3 you can run this command:
model1.get_topic(3)
and you will get
[('hood', 0.08646080854070591),
('fort', 0.07903592661513956),
('terrorist', 0.04050269806508548),
('muslim', 0.0404965762204116),
('ft', 0.04046989026050265),
('militari', 0.03581303765985982),
('armi', 0.025703775870144486),
('base', 0.025620172464129863),
('islam', 0.024491265378088094),
('attack', 0.02280540444895898)]
output like this, n_words with its c-TF-IDF score.
you can also get all topics with their own n_words by running:
model1.get_topics()
if you want to get documents that represents topics you can run:
model1.get_representative_docs(3)
where the output will be like:
['wouldnt fort hood islam terror attack radic jihadist',
'wasnt fort hood consid terrorist attack',
'kind gun use fort hood']

Applying function to pandas dataframe: is there a more efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

How to efficiently process a large file with a grouping variable in Python

I've got a dataset that looks something like the following:
ID Group
1001 2
1006 2
1008 1
1027 2
1013 1
1014 4
So basically, a long list of unsorted IDs with a grouping variable as well.
At the moment, I want to take subsets of this list based on the generation of a random number (imagine they're being drafted, or won the lottery, etc.). Right now, this is the code I'm using to just process it row-by-row, by ID.
reader = csv.reader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
for row in reader:
assignment = gcd(1,p,marg_rate,rho)
if assignment[0,0]==1:
out1.write(row[0])
out1.write("\n")
if assignment[0,1]==1:
out2.write(row[0])
out2.write("\n")
Basically, i the gcd() function goes one way, you get written to one file, another way to a second, and then some get tossed out. The issue is I'd now like to do this by Group rather than ID - basically, I'd like to assign values to the first time a member of the group appears, and then apply it to all members of that group (so for example, if 1001 goes to File 2, so does 1006 and 1027).
Is there an efficient way to do this in Python? The file's large enough that I'm a little wary of my first thought, which was to do the assignments in a dictionary or list and then have the program look it up for each line.

I used random.randint to generate a random number, but this can be easily replaced.
The idea is to use a defaultdict to have single score (dict keys are unique) for a group from the moment it's created:
import csv
import random
from collections import defaultdict
reader = csv.DictReader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
# create a dictionary with a random default integer value [0, 1] for
# keys that are accessed for the first time
group_scores = defaultdict(lambda: random.randint(0,1))
for row in reader:
# set a score for current row according to it's group
# if none found - defaultdict will call it's lambda for new keys
# and create a score for this row and all who follow
score = group_scores[row['Group']]
if score==0:
out1.write(row['ID'])
out1.write("\n")
if score==1:
out2.write(row['ID'])
out2.write("\n")
out1.close()
out2.close()
I've also used DictReader which I find nicer for csv files with headers.
Tip: you may want to use the with context manager to open files.
Example output:
reut#sharabani:~/python/ran$ cat out1.txt
1001
1006
1008
1027
1013
reut#sharabani:~/python/ran$ cat out2.txt
1014

Sounds like you're looking for a mapping. You can use dicts for that.
Once you've first decided 1001 goes to file 2, you can add to your mapping dict.
fileMap={}
fileMap[group]="fileName"
And then, when you need to check if the group has been decided yet, you just
>>>group in fileMap
True
This is instead of mapping every ID to a filename. Just map the groups.
Also, I'm wondering about whether it's worth condsidering batching the writes with .write([aListofLines]).

pandas and "re" - search for total and partial strings

This an extended question from this topic. I would like to search in strings total and partial strings like the following keywords Series "w":
rigour*
*demeanour*
centre*
*arbour
fulfil
This obviously means that I wanted to search for words like rigour and rigours, endemeanour and demeanours, centre and centres, harbour and arbour, and fulfil. So the keywords list I have is a mix of complete and partial strings to find. I would like to apply the search on this DataFrame "df":
ID;name
01;rigour
02;rigours
03;endemeanour
04;endemeanours
05;centre
06;centres
07;encentre
08;fulfil
09;fulfill
10;harbour
11;arbour
12;harbours
What I tried so far is the following:
r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)
then I've build a mask to filter the DataFrame:
mask = [m.group(1) if m else None for m in map(r.search, df['Tweet'])]
in order to get a new column with the Keyword found:
df['keyword'] = mask
What I'm expecting is the following resulting DataFrame:
ID;name;keyword
01;rigour;rigour
02;rigours;rigour
03;endemeanour;demeanour
04;endemeanours;demeanour
05;centre;centre
06;centres;centre
07;encentre;None
08;fulfil;fulfil
09;fulfill;None
10;harbour;arbour
11;arbour;arbour
12;harbours;None
This works using a w list without *. Now I had several issues in formatting the keyword w List of words with the * conditions, in order to run the re.compile function correctly.
Any help would be really appreciated.

It looks like your input series w needs to be adjusted to be used as regex pattern like this:
rigour.*
.*demeanour.*
centre.*
\\b.*arbour\\b
\\bfulfil\\b
Note that * in regex goes after something it does not work on its own. It means that whatever it follows can be repeated 0 or more times.
Note also that fulfil is a part of fulfill and if you want to have strict match you need to tell regex this. For example by using 'word separator' - \b - it will catch only string as whole.
Here is how your regex might look like to give you results that you need:
s = '({})'.format('|'.join(w.values))
r = re.compile(s, re.IGNORECASE)
r
re.compile(r'(rigour.*|.*demeanour.*|centre*|\b.*arbour\b|\bfulfil\b)', re.IGNORECASE)
And your code to have the replacement could be done with pandas .where method like this:
df['keyword'] = df.name.where(df.name.str.match(r), None)
df
ID name keyword
0 1 rigour rigour
1 2 rigours rigours
2 3 endemeanour endemeanour
3 4 endemeanours endemeanours
4 5 centre centre
5 6 centres centres
6 7 encentre None
7 8 fulfil fulfil
8 9 fulfill None
9 10 harbour harbour
10 11 arbour arbour
11 12 harbours None

2 different blocks of text are merging together. Can I separate them if i know what 1 is?

I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.

Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.

You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.