clustering of a text file

clustering of a text file - python

Original Question:
I have a flat file with each row representing text associated with an application. I would like to cluster applications based on the words associated with that application Is there a free code available for text mining a single flat file? Thank you.
Update 1:
There are 30,000 applications. I am trying to figure our what behaviors (of customers) are associated with each cluster. I dont have a pre defined set of words to start with. I could inspect a random few and determine some words, but then that would not give me a exaustive list of words. I would like to capture majority of the behaviors in a systematic way.
I tried converting the text file into an xml file and cluster using carrot2 workbench, but that didnt work. I havent used carrot2 before, so I may be doing something wrong there.

My understanding is that your have a file like:
game Solitaire
productivity OpenOffice
game MineSweeper
...
And you want to categorize everything based on their tag word, as in putting applications in buckets based on their associated tag/description/...
I think you can use a dictionary of lists for this purpose, e.g.:
f = open('input.txt')
out = {}
inline = f.readline()
while inline:
if ' ' not in inline:
continue
tag, appname = inline.strip('\n').split(' ', 1)
if tag not in out:
out[tag] = []
out[tag].append(appname)
inline = f.readline()
print out['game']
This iterates through input once and clusters application names based on their tags very efficiently.

Related

Python: How to replace every space in a dictionary to an underscore?

Im still a beginner so maybe the answer is very easy, but I could not find a solution (at least one I could understand) online.
Currently I am learning famous works of art through the app "Anki". So I imported a deck for it online containing over 700 pieces.
Sadly the names of the pieces are in english and I would like to learn them in my mother language (german). So I wanted to write a script to automate the process of translating all the names inside the app. I started out by creating a dictionary with every artist and their art pieces (to fill this dictionary automatically reading the app is a task for another time).
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
My plan is to access wikipedia (or any other database for artworks) that stores the german name of the painting (because translating it with a eng-ger dictionary often returns wrong results since the german translation can vary drastically):
replacing every space character inside the name to an underscore
letting python access the wikipedia page of said painting:
import re
from urllib.request import urlopen
painting_name = "Composition_VIII" #this is manual input of course
url = "wikipedia.org/wiki/" + painting_name
page = urlopen(url)
somehow access the german version of the site and extracting the german name of the painting.
html = page.read().decode("utf-8")
pattern = "<title.*?>.*?</title.*?>" #I think Wikipedia stores the title like <i>Title</i>
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title)
storing it in a list or variable
inserting it in the anki app
maybe this is impossible or "over-engineering", but I'm learning a lot along the way.
I tried to search for a solution online, but could not find anything similar to my problem.

You can use dictionary comprehension with the replace method to update all the values (names of art pieces in this case) of the dictionary.
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
art_dictionary = {key:value.replace(' ', '_') for key,value in art_dictionary.items()}
print(art_dictionary)
# Output: {'Wassily Kandinsky': 'Composition_VIII', 'Zhou Fang': 'Ladies_Wearing_Flowers_in_Their_Hair'}

What is the best way to extract the body of an article with Python?

Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!

Filter sentencs based on search words using python?

I got a text file containing many sentences. I want to query the text file with search words and return those sentences that contain the query words.
Effort so far:
h = input("Enter search word: ")
with open("file.txt") as openfile:
for line in openfile:
for part in line.split():
if h in part:
print (part)
file.txt contains these sentences
On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.
You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks.
When you create pictures, charts, or diagrams, they also coordinate with your current document look.
You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the Quick Styles gallery on the Home tab.
You can also format text directly by using the other controls on the Home tab.
Most controls offer a choice of using the look from the current theme or using a format that you specify directly.
To change the overall look of your document, choose new Theme elements on the Page Layout tab. To change the looks available in the Quick Style gallery, use the Change Current Quick Style Set command.
Both the Themes gallery and the Quick Styles gallery provide reset commands so that you can always restore the look of your document to the original contained in your current template.
Output: for the 'galleries' search it returns galleries twice but i need to return the sentences.
How to query multiple words search and return those sentences containing those combination (not necessarily an n gram or in order) For example if i type 'overall' as one word and 'Layout' as another search word it should return the following sentence. Search words are case insensitive
To change the overall look of your document, choose new Theme elements on the Page Layout tab.
HELP!

This for multiple search words:
myfile = "queryfile.txt"
search_wordlist = input("Enter search words, separated by a comma\n")
mylist = search_wordlist.split(",")
with open(myfile) as openfile:
for line in openfile:
for term in mylist:
if term in line:
print(line)

Extracting Fasta Moonlight Protein Sequences with Python

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!

I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?

Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

You could use OpenOffice. It can open word files, and also can run python macros.

I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

clustering of a text file - python

Related

Python: How to replace every space in a dictionary to an underscore?

What is the best way to extract the body of an article with Python?

Filter sentencs based on search words using python?

Extracting Fasta Moonlight Protein Sequences with Python

Extracting data from MS Word

Categories

Resources