Compare two files and make a list

Compare two files and make a list - python

I have two files that I want to compare with each other and form a list. Each file have their own class. Book and Person. In these, I have different attributes. The ones I want to compare are: person.personalcode == book.borrowed. From this I want a list of all the borrowed books. I have started like this:
for person in person_list:
for book in booklibrary_list:
if person.personalcode == book.borrowed:
person.books.append(book, person)
for person in person_list:
if len(person.books) > 0:
print(person.personalcode + "," + person.firstname + person.lastname + "have borrowed the following books: ")
for book in person.books:
print(book)
for person in person_list:
person.books = []
But it does not work, what have I missed or done wrong?

Posting as an answer as this is too long for a comment.
First: improve your question. Show how you construct the Person and the Book class, and how you populate them. Describe what the personalcode is and how come personalcode would be the same as a book code. Some sample data and a bit more code would make this easier to answer.
Second: reading your other question, you seem to be storing your data in a text file, loading and querying, modifying and saving the data directly. This will lead you to problems and instead you should consider going down one of two lines:
Use an SQL database, possibly the easiest to start with is SQLite as it does not need a server to be set up and there is a module in the standard library that is very easy to use. Store your data there and you will find it easier in the long run.
Use Python objects (e.g. three classes: Person, Book, and BorrowedBook), manage lists of them within the program, and use shelve from the standard library to store and retrieve these lists of objects between queries.
The use of shelve would be easier if you have not used SQL before, and I hope you will forgive the pun when I say that it might be very appropriate for a book-related application!

Related

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap

I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Pass many-to-many object to variable

I have 2 classes, with many-to-many relationship, my goal is to fill an 'item' list with data from that 2 models, here are my models:
class Bakery(models.Model):
title = models.CharField('restaurant_name', max_length=100)
class DeliveryService(models.Model):
title = models.CharField('deliveryservice_name', max_length=100)
bakery = models.ManyToManyField(Bakery)
Here is the logic on my 'views' file:
item = []
bakerys = Bakery.objects.all()
for i in bakerys:
item.append(i.title)
item.append(i.deliveryservice.title)
I hope you got what exactly I want to accomplish. My current 'views' file logic is wrong and I know it, I just does not know what can I do to solver this problem. Thank you for your time.

The following seems to do what you're asking for. But it seems odd that you want to create a list with all the titles for different objects all mixed together and likely have duplicates (if a delivery service is linked to more than one bakery it'll be added twice).
item = []
bakerys = Bakery.objects.all()
for i in bakerys:
item.append(i.title)
for j in i.deliveryservice_set.all():
item.append(j.title)
You should really read up on the many-to-many functionality of the ORM. The documentation is pretty clear on how to do these things.
Sayse had a good answer too if you really just want all the titles. Their answer also groups everything in tuples and accomplishes it with more efficiency by using fewer db queries. Their answer was: Bakery.objects.values('title', 'deliveryservice__title')

Better Database Design for a Hierarchical Structure?

I've created a bilingual dictionary app1, and it's currently very simple, but we're going to be starting to develop the entries more fully and I'm trying to figure out the best database structure for it. Previous dictionary projects I've worked on have used xml (since dictionary entries are largely hierarchical), but I need to do it using a database.2
This is what a typical, medium-complexity entry would look like (simplified a bit):
dar
/dār/
noun
house, dwelling, abode
ar-rājl dkhul ad-dār, "The man entered the house."
home
rjaƷna lid-dār, "We returned home."
verb
to turn
dūr li-yamīn, "Turn right."
to turn around/about
As you can see, one word can have multiple parts of speech, so "part of speech" can't simply be an attribute of Entry, it has to be related to the senses. Each pos can have multiple senses (numbered), and of course each sense could have multiple near-synonymous translations. Senses may also have example sentences (possibly more than one), but not always. Thinking of how the entry parts relate to each other, I came up with the following structure, using five tables:
Entry
-id
-headword
-pronunciation
-...
PartOfSpeech
-id
-entry (ForeignKey)
-pos
Sense
-id
-sense_number
-part_of_speech (ForeignKey)
-...
Translation
-id
-tr
-sense (ForeignKey)
-...
Example
-id
-ex
-ex_tr
-sense (ForeignKey)
-...
Or, in other words:
_ Translation
Entry -- PartOfSpeech -- Sense --|
- Example
This seems simple and makes sense to me, but I'm wondering if it will be too complicated in the execution. For instance, to display a selection of entries, I would need to write several nested for loops (for e in entries → for p in pos → for s in senses → for tr in translations) — and all with reverse lookups!
And I don't think I could even edit a whole entry in the Django admin (unless it lets you somehow do an Inline of an Inline of an Inline). I'm going to build an editor interface anyway, but it's nice to be able to check things on the admin site when you want to.
Is there a better way to do this? I feel like there must be something clever that I'm missing.
Thanks,
Karen
1 If you're curious: tunisiandictionary.org. In its simple, current form it only has two tables (Entry, Sense), with the translations just comma-delineated in a single field. Which is bad.
2 For two reasons: 1) because it's a web app I've written with Python/Django, and 2) because I hate xml.

You can emulate saving dictionaries in sql databases as well. Someone wrote this awesome helper already:
Django Dictionary Model
I use it in my project as well.

Why not use the python dictionary data structure (or json/bson) along with mongodb?
In python, it is much more convenient than xml.
For example you can simply have a list of python dict objects to represent the entire dictionary. Each element can be a structured as follows:
[{
"_id": "1",
"word": "étudier",
'definitions': {
[(
"v",
"to study",
"j'étudie français",
"I study french"
), ...
]
}
}, ...]
where definitions is a list of tuples (first element is part of speech, second element is the definition, third element is an example in the first language, forth element is the translation of that example).
You can then easily index it within mongodb database.
This is a very simple structure and you don't need to deal with a over-complicated database with foreign keys. Using mongodb, retrieving definitions for a word is as easy as
record = db.collection.find({'word':'étudier').

How do I create a list or set object in a class in Python?

For my project, the role of the Lecturer (defined as a class) is to offer projects to students. Project itself is also a class. I have some global dictionaries, keyed by the unique numeric id's for lecturers and projects that map to objects.
Thus for the "lecturers" dictionary (currently):
lecturer[id] = Lecturer(lec_name, lec_id, max_students)
I'm currently reading in a white-space delimited text file that has been generated from a database. I have no direct access to the database so I haven't much say on how the file is formatted. Here's a fictionalised snippet that shows how the text file is structured. Please pardon the cheesiness.
0001 001 "Miyamoto, S." "Even Newer Super Mario Bros"
0002 001 "Miyamoto, S." "Legend of Zelda: Skies of Hyrule"
0003 002 "Molyneux, P." "Project Milo"
0004 002 "Molyneux, P." "Fable III"
0005 003 "Blow, J." "Ponytail"
The structure of each line is basically proj_id, lec_id, lec_name, proj_name.
Now, I'm currently reading the relevant data into the relevant objects. Thus, proj_id is stored in class Project whereas lec_name is a class Lecturer object, et al. The Lecturer and Project classes are not currently related.
However, as I read in each line from the text file, for that line, I wish to read in the project offered by the lecturer into the Lecturer class; I'm already reading the proj_id into the Project class. I'd like to create an object in Lecturer called offered_proj which should be a set or list of the projects offered by that lecturer. Thus whenever, for a line, I read in a new project under the same lec_id, offered_proj will be updated with that project. If I wanted to get display a list of projects offered by a lecturer I'd ideally just want to use print lecturers[lec_id].offered_proj.
My Python isn't great and I'd appreciate it if someone could show me a way to do that. I'm not sure if it's better as a set or a list, as well.
Update
After the advice from Alex Martelli and Oddthinking I went back and made some changes and tried to print the results.
Here's the code snippet:
for line in csv_file:
proj_id = int(line[0])
lec_id = int(line[1])
lec_name = line[2]
proj_name = line[3]
projects[proj_id] = Project(proj_id, proj_name)
lecturers[lec_id] = Lecturer(lec_id, lec_name)
if lec_id in lecturers.keys():
lecturers[lec_id].offered_proj.add(proj_id)
print lec_id, lecturers[lec_id].offered_proj
The print lecturers[lec_id].offered_proj line prints the following output:
001 set([0001])
001 set([0002])
002 set([0003])
002 set([0004])
003 set([0005])
It basically feels like the set is being over-written or somesuch. So if I try to print for a specific lecturer print lec_id, lecturers[001].offered_proj all I get is the last the proj_id that has been read in.

set is better since you don't care about order and have no duplicate.
You can parse the file easily with the csv module (with a delimiter of ' ').
Once you have the lec_name you must check if that lecturer's already know; for that purpose, keep a dictionary from lec_name to lecturer objects (that's just another reference to the same lecturer object which you also refer to from the lecturer dictionary). On finding a lec_name that's not in that dictionary you know it's a lecturer not previously seen, so make a new lecturer object (and stick it in both dicts) in that case only, with an empty set of offered courses. Finally, just .add the course to the current lecturer's offered_proj. It's really a pretty smooth flow.
Have you tried implementing this flow? If so, what problems have you had? Can you show us the relevant code -- should be a dozen lines or so, at most?
Edit: since the OP has posted code now, I can spot the bug -- it's here:
lecturers[lec_id] = Lecturer(lec_id, lec_name)
if lec_id in lecturers.keys():
lecturers[lec_id].offered_proj.add(proj_id)
this is unconditionally creating a new lecturer object (trampling over the old one in the lecturers dict, if any) so of course the previous set gets tossed away. This is the code you need: first check, and create only if needed! (also, minor bug, don't check in....keys(), that's horribly inefficient - just check for presence in the dict). As follows:
if lec_id in lecturers:
thelec = lecturers[lec_id]
else:
thelec = lecturers[lec_id] = Lecturer(lec_id, lec_name)
thelec.offered_proj.add(proj_id)
You could express this in several different ways, but I hope this is clear enough. Just for completeness, the way I would normally phrase it (to avoid two lookups into the dictionary) is as follows:
thelec = lecturers.get(lec_id)
if thelec is None:
thelec = lecturers[lec_id] = Lecturer(lec_id, lec_name)
thelec.offered_proj.add(proj_id)

Sets are useful when you want to guarantee you only have one instance of each item. They are also faster than a list at calculating whether an item is present in the collection.
Lists are faster at adding items, and also have an ordering.
This sounds like you would like a set. You sound like you are very close already.
in Lecturer.init, add a line:
self.offered_proj = set()
That will make an empty set.
When you read in the project, you can simply add to that set:
lecturer.offered_proj.add(project)
And you can print, just as you suggest (although you may like to pretty it up.)

Thanks for the help Alex and Oddthinking! I think I've figured out what was going on:
I modified the code snippet that I added to the question. Basically, every time it read the line I think it was recreating the lecturer object. Thus I put in another if statement that checks if lec_id already exists in the dictionary. If it does, then it skips the object creation and simply moves onto adding projects to the offered_proj set.
The change I made is:
if not lec_id in lecturers.keys():
projects[proj_id] = Project(proj_id, proj_name)
lecturers[lec_id] = Lecturer(lec_id, lec_name)
lecturers[lec_id].offered_proj.add(proj_id)
I only recently discovered the concept behind if not thanks to my friend Samir.
Now I get the following output:
001 set([0001])
001 set([0001, 0002])
002 set([0003])
002 set([0003, 0004])
003 set([0005])
If I print for a chosen lec_id I get the fully updated set. Glee.

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?

Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

You could use OpenOffice. It can open word files, and also can run python macros.

I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.