String count - identify key words and dismissing compound words - python

I have created a program for an assignment which reads a txt file and returns key words. My program returns the key words but there s one issue with one of the words 'data'. I should only get 6 results for this but I am getting 7. The reason, i assume, is there is a compound word present in the text 'data - analytics'. The program seems to be picking this up and counting it in the final result. Is there anything I could insert into the end of my code to dismiss this?
import string
text = open('news1.txt').read()+open ('news2.txt').read()
print 'data:', string.count(text, 'data')

It's hard to be sure without seeing your actual input files, but there's one obvious possibility:
news1.txt:
data data data dat
news2.txt
a data data data
There are only 6 instances of the word "data" in the files. But if you concatenate the files, you get this:
data data data data data data data
… and you will count 7 instead of 6.
And it's perfectly plausible that your teacher gave you files that look like that in order to catch exactly this kind of bug. Edge cases that don't come up very often in the wild, and that you didn't think to test for, are exactly the kinds of things that cost you months of frustration—trying to drag repro information out of users, debugging the program, etc. It's a good lesson to learn early on in your programming life.

Related

Python Generic Data Engine

I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.
Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output

Adding text between 2 html tags

I am a 2 years student and I am working on text mining.
For general let me tell you about the code it first accept pdf type text and convert that in to doc.txt file, then I process that data for couple of hundred lines then after i store all sentences in that text to the list called all_text (for th future use) and also I select some texts and store them in to a list called summary.
Finally the problem is on this part:
Summary list look like this
summary=['Artificial Intelligence (AI) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason, and take action.','In reality, AI is already changing our daily lives, almost entirely in ways that improve human health, safety,and productivity.','AI is also changing how people interact with technology.']
What I want is read from doc.txt sentence by sentence and if that sentence is in the summary list modify that sentence by put it in to BOLD tag " the sentence" for all in the summary list here is small code i tried for that specific part it not help full but here it is
while i < len(lis):
if lis[i] in txt:
txt = txt.replace(lis[i], "<b>" + lis[i] + "</b>")
print(lis[i])
i += 1
This code did not work as I expected, I mean it works for some short sentences, but it doesn't work for the sentences like those I don't have any idea why it's not working help me please?
For that purpose you might use list comprehension, example:
summary = ['sentenceE','sentenceA']
text = ['sentenceA','sentenceB','sentenceC','sentenceD','sentenceE']
output = ['<b>'+i+'</b>' if (i in summary) else i for i in text]
print(output) #prints ['<b>sentenceA</b>', 'sentenceB', 'sentenceC', 'sentenceD', '<b>sentenceE</b>']
Note that summary and text should be lists of strs.

Python convert list into split lists

so I have been given the task of using an api to pull student records and learnerID's to put into an in house application. The json formatting is dreadful and the only successful way I managed to split students individually is by the last value.
Now I am at the next stumbling block, I need to split these student lists into smaller sections so I implement a for loop as so:
student = request.text.split('"SENMajorNeedsDetails"')
for students in student:
r = str(student).split(',')
print (student[0], student[1])
print (r[0], r[1])
This works perfectly except this puts it all into a single list again and each student record isn't a set length (some have more values/fields than others).
so what I am looking to do is have a list for each student split on the comma, so student1 would equal [learnerID,personID,name,etc...]
this way when I want to reference the learnerID I can call learner1[0]
It is also very possible that I am going about this the wrong way and I should be doing some other form of list comprehension
my step by step process that I am aiming towards is:
pull data from system - DONE
split data into individual students - DONE
take learnerID,name,group of each student and add database entry
I have split step 3 into two stages where one involves my issue above and the second is the create database records
Below is a shortended example of the list item student[0], followed by student[1] if more is needed then say
:null},{"LearnerId":XXXXXX,"PersonId":XXXXXX,"LearnerCode":"XXXX-XXXXXX","UPN":"XXXXXXXXXXX","ULN":"XXXXXXXXXX","Surname":"XXXXX","Forename":"XXXXX","LegalSurname":"XXXXX","LegalForename":"XXXXXX","DateOfBirth":"XX/XX/XXXX 00:00:00","Year":"XX","Course":"KS5","DateOfEntry":"XX/XX/XXXX 00:00:00","Gender":"X","RegGroup":"1XX",],
:null},{"LearnerId":YYYYYYY,"PersonId":YYYYYYYY,"LearnerCode":"XXXX-YYYYYYYY","UPN":"YYYYYYYYYY","ULN":"YYYYYYYYYY","Surname":"YYYYYYYY","Forename":"YYYYYY","LegalSurname":"YYYYYY","LegalForename":"YYYYYYY","DateOfBirth":"XX/XX/XXXX 00:00:00","Year":"XX","Course":"KS5","DateOfEntry":"XX/XX/XXXX 00:00:00","Gender":"X","RegGroup":"1YY",],
Sorry doesn't like putting it on seperate lines
EDIT* changed wording at the end and added a redacted student record
Just to clarify the resolution to my issue was to learn how to parse JSON propperly, this was pointed out by #Patrick Haugh and all credit should go to him for pointing me in the right direction. Second most helpful person was #ArndtJonasson
The problem was that I was manually trying to do the job of the JSON library and I am no where near that level of competency yet. As stated originally it was totally likely that I was going about it in completely the wrong way.

Generating organized excel spreadsheet from a word document

I have Microsoft document which we want to transfer to excel. Every sentence needs to be separated and then pasted into the next appropriate cell in excel. These sentences also need to be analyzed as a heading, requirement, or informational.
I will recreate what the typical word format looks like
2.3.4 Lightening Transient Response
The device shall meet spec 24532. Voltage must resemble figure.
Figure 1.
which translates to
<numbering> <Heading>
<Requirements/information>
In excel that is almost exactly how I would the document to look except the second requirement sentence should be in row just below the previous requirement sentence.
2.3.4 | Lightening Transient Response | Heading
| The device shall meet spec 24532. | Requirement
|Voltage must resemble figure | Requirement
|figure 1 | Informational
I have attempted this project with python using openxl and docx modules. I have code that can go into word and get sentences and then code that can analyze the sentence.I'm retrieving runs from paragraphs. I am having problems because not all sentences are coming back due to how the word document is formatted. I am typically only getting the headings back. The heading numbers are not stored in runs. The requirements underneath the headings are stored in tables. I have written some code to get into the tables an extract the text from cells so that is one way to get the requirements however that snippet of code is giving problems(giving me the same sentence three times in a row).
I'm looking for other possible ways to do this. I'm thinking a format switch. XML has been mentioned and then also the pdf and pythons pdf module may be possible.
Any thoughts or advice would be greatly appreciated.
-Chris
XML is going to be harder, not easier. You're closer than you seem to think. I recommend attacking each problem separately until you crack it.
The sentence three times problem in the table is because of merged cells. The way python-docx works on tables, there is an underlying table layout of x rows and y columns. If two side-by-side cells are merged, you get the same results for both those cells. You can detect this be comparing the two cells for equality. Roughly like "if this_cell == last_cell skip this cell".
There's no way around the heading problem. Heading numbers only exist inside a running instance of Word; they are generated at display (or print) time. To get those you need to use the same rules to generate your own numbers. So you'd need to keep track of the number of headings you've passed through etc. and form your own dot-separated numbering.
Why are you using Python for this? Just use VBA, since you are working with Excel and Word.
Something like this should get you pretty close to where you want to be. It may need some tweaking...
Sub Demo()
Dim wdApp As Word.Application
Set wdApp = Word.Application
Dim wdDoc As Word.Document
Set wdDoc = wdApp.ActiveDocument
wdDoc.Range.Copy
ActiveSheet.Paste Destination:=ActiveSheet.Range("A1")
With ActiveSheet
.Paste Destination:=Range("A" & .Cells.SpecialCells(xlCellTypeLastCell).Row + 1)
End With
Set myRange = Range("A1:A100")
For i = 1 To myRange.Rows.Count
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
myRange.Cells(i, "A").Offset(1, 0).Select
ActiveCell.EntireRow.Insert
ActiveCell.Offset(-1, 0).Select
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
position1 = InStr(1, ActiveCell.Value, "Voltage")
myRange.Cells(i + 1, "A").Value = Mid(ActiveCell.Value, position1, 99)
ActiveCell.Value = Left(ActiveCell.Value, position1 - 2)
i = i + 2
End If
End If
Next i
End Sub
So, copy the text from your Word doc, which should be open and active, and you're good to go. There are other ways to do this too.

Reading text and assigning a class to data in Python

I've been searching around, and had no luck finding anything answering my question.
Essentially I have a file with the following data:
Title - 19
Artist - Adele
Year released - 2008
1 - Daydreamer, 3:41, 1
2 - Best for Last, 4:19, 5
3 - Chasing Pavements, 3:31, 7
4 - Cold Shoulder, 3:12, 3
Title - El Camino
Artist - The Black Keys
Year released - 2011
1 - Lonely Boy, 3:13, 1
2 - Run Right Back, 3:17, 10
EOF
I know how to create classes, and how to assign an object to a class and values to that object, but I am just about ready to tear my hair out on how it is I'm supposed to process the text. From text, I need to create a title for the album, and assign the album's information to it. There's more else besides that needs to be done, and there are more lines to be read, and I just don't know where to start on this. I've found two "album.py" files via google, and I've been unable to make heads or tails of how to apply the solution to my case.
And yes, this is for a school assignment. I've done some digging around and found some things relevant, but I'm just not understanding it. I'm new to programming in general, and I've made progress but this seems too far over my head.
I know I could reduce this to lists using split (\n\n) and operating on a series of progressively smaller lists, but I am trying to avoid this method at all costs.
EDIT:
For the time being, it's best to assume I know nothing. Though, to answer below question: I can open the file and read it. If its a consistent CSV formatted file, I can write code to process the enclosed data, and create a class structure that uses that data. Right now I'm just having trouble with the first three lines, and the digits immediately below.
APRIL 4 2012:
Okay, I have some code, I've left the comments with respect to it underneath.
def getInput():
global albums
raw = open("album.txt","r")
infile = raw
raw.close
text=""
line = infile.readline()
while (line != "EOF\n" ):
text += line
line=infile.readline()
text=text.rstrip("\n\n")
albums=[str(n) for n in text.split("\n\n")]
return albums
class Album():
def __init__(self, title, artist, date):
self.title=title
self.artist=artist
self.date=date
self.track={}
def addSong(self, TrackID, title, time, ranking):
self.track+={self}
def getAlbumLength(self):
asdf=0
def getRanking(self):
asdf=0
def labels(x): #establishes labels per item to be used for Album Classifier
title=""
artist=""
date=""
for i in range(0,len(albums),1):
sublist=[str(n) for n in albums[i].split("\n")]
RANDUMB=len(albums[i])
title=sublist[0]
artist=sublist[1]
date=sublist[2]
for j in range(0,len(sublist),1):
song_info = [str(k) for k in sublist[3:].split("," and " - ")]
TrackID=song_info[0]
title=song_info[1]
time=song_info[2]
ranking=song_info[3]
getInput()
labels(albums)
Personal comments on code:
I was trying to avoid getting it into lists because I anticipated this problem. As the functions are concerned, I have to use every single bloody one, because it's in the assignment requirements... I am displeased because I could probably get around using them. The code is working sufficiently enough, except for the last part of it where I am trying to take the song information. I want to split the song information into lists, which are nested into the album information list. Something like:
[Album title, Artist, Date released,[01,Song,3:44,2],[02,Song,0:01,9]....]
The current code gives me index out of range error as of right now... I am using python3.
TLDR: The substance of my problem has thus changed from one of trying to solve how to go about starting the solution to how to take items in a list and convert them into nested lists.
If you end up editing your question to contain some more specific examples of what is giving you trouble, I will edit this answer. But to address your general question, there are some steps involved to achieving your goal.
Like you said, you need to write a class that reflects the structure you intend to have from this data.
You will need to parse this file, probably line by line. So you have to determine if this file format is consistant. If it is, then you need to determine:
What is the delimiter between each set of data, which will be conformed into a class instance?
What is the delimiter between each field of each line?
When you are looping over each line, you will know that you need to start a new album object whenever you encounter a blank line.
When you know you are starting a new album, you can assume that the first line will be a title, the second an artist, the third, the year, etc.
For each of these lines you will also have to have rules of how to split each one into the data you want. At a basic level it can be a simple set of splits. At a more advanced level you might define regular expressions for each type of line.

Categories

Resources