Adding on values to a key based on different lengths

Adding on values to a key based on different lengths - python

I'm trying to add on values to a key after making a dictionary.
This is what I have so far:
movie_list = "movies.txt" # using a file that contains this order on first line: Title, year, genre, director, actor
in_file = open(movie_list, 'r')
in_file.readline()
def list_maker(in_file):
movie1 = str(input("Enter in a movie: "))
movie2 = str(input("Enter in another movie: "))
d = {}
for line in in_file:
l = line.split(",")
title_year = (l[0], l[1]) # only then making the tuple ('Title', 'year')
for i in range(4, len(l)):
d = {title_year: l[i]}
if movie1 or movie2 == l[0]:
print(d.values())
The output I get it:
Enter in a movie: 13 B
Enter in another movie: 1920
{('13 B', '(2009)'): 'R. Madhavan'}
{('13 B', '(2009)'): 'Neetu Chandra'}
{('13 B', '(2009)'): 'Poonam Dhillon\n'}
{('1920', '(2008)'): 'Rajneesh Duggal'}
{('1920', '(2008)'): 'Adah Sharma'}
{('1920', '(2008)'): 'Anjori Alagh\n'}
{('1942 A Love Story', '(1994)'): 'Anil Kapoor'}
{('1942 A Love Story', '(1994)'): 'Manisha Koirala'}
{('1942 A Love Story', '(1994)'): 'Jackie Shroff\n'}
.... so on and so forth. I get the whole list of movies.
How would I go about doing so if I wanted to enter in those two movies (any 2 movies as a union of the values to the key (movie1, movie2) )?
Example:
{('13 B', '(2009)'): 'R. Madhavan', 'Neetu Chandra', 'Poonam Dhillon'}
{('1920', '(2008)'): 'Rajneesh Duggal', 'Adah Sharma', 'Anjori Alagh'}

Sorry if the output isn't completely what you want, but here's how you should do it:
d = {}
for line in in_file:
l = line.split(",")
title_year = (l[0], l[1])
people = []
for i in range(4, len(l)):
people.append(l[i]) # we append items to the list...
d = {title_year: people} # ...and then make the dict so that the list is in it.
if movie1 or movie2 == l[0]:
print(d.values())
Basically, what we are doing here is that we are making a list, and then setting the list to a key inside of the dict.

Related

Split each line in a file based on delimitters

This is the sample data in a file. I want to split each line in the file and add to a dataframe. In some cases they have more than 1 child. So whenever they have more than one child new set of column have to be added child2 Name and DOB
(P322) Rashmika Chadda 15/05/1995 – Rashmi C 12/02/2024
(P324) Shiva Bhupati 01/01/1994 – Vinitha B 04/08/2024
(P356) Karthikeyan chandrashekar 22/02/1991 – Kanishka P 10/03/2014
(P366) Kalyani Manoj 23/01/1975 - Vandana M 15/05/1995 - Chandana M 18/11/1998
This is the code I have tried but this splits only by taking "-" into consideration
with open("text.txt") as read_file:
file_contents = read_file.readlines()
content_list = []
temp = []
for each_line in file_contents:
temp = each_line.replace("â€“", " ").split()
content_list.append(temp)
print(content_list)
Current output:
[['(P322)', 'Rashmika', 'Chadda', '15/05/1995', 'Rashmi', 'Chadda', 'Teega', '12/02/2024'], ['(P324)', 'Shiva', 'Bhupati', '01/01/1994', 'Vinitha', 'B', 'Sahu', '04/08/2024'], ['(P356)', 'Karthikeyan', 'chandrashekar', '22/02/1991', 'Kanishka', 'P', '10/03/2014'], ['(P366)', 'Kalyani', 'Manoj', '23/01/1975', '-', 'Vandana', 'M', '15/05/1995', '-', 'Chandana', 'M', '18/11/1998']]
Final output should be like below
Code
Parent_Name
DOB
Child1_Name
DOB
Child2_Name
DOB
P322
Rashmika Chadda
15/05/1995
Rashmi C
12/02/2024
P324
Shiva Bhupati
01/01/1994
Vinitha B
04/08/2024
P356
Karthikeyan chandrashekar
22/02/1991
Kanishka P
10/03/2014
P366
Kalyani Manoj
23/01/1975
Vandana M
15/05/1995
Chandana M
18/11/1998

I'm not sure if you want it as a list or something else.
To get lists:
result = []
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
# reconstruct
for i, person in enumerate(t):
person = person.split(" ")
# print(person)
# remove code
if i==0:
res = [person.pop(0)]
res.extend([" ".join(person[:2]), person[2]])
result.append(res)
print(result)
Which would give the below output:
[['P322', 'Rashmika Chadda', '15/05/1995', 'Rashmi C', '12/02/2024'], ['P324', 'Shiva Bhupati', '01/01/1994', 'Vinitha B', '04/08/2024'], ['P356', 'Karthikeyan chandrashekar', '22/02/1991', 'Kanishka P', '10/03/2014'], ['P366', 'Kalyani Manoj', '23/01/1975', 'Vandana M', '15/05/1995', 'Chandana M', '18/11/1998']]
You can organise a bit more the data using dictionnary:
result = {}
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
for i, person in enumerate(t):
# split name
person = person.split(" ")
# remove code
if i==0:
code = person.pop(0)
if i==0:
result[code] = {"parent_name": " ".join(person[:2]), "parent_DOB": person[2], "children": [] }
else:
result[code]['children'].append({f"child{i}_name": " ".join(person[:2]), f"child{i}_DOB": person[2]})
print(result)
Which would give this output:
{'P322': {'children': [{'child1_DOB': '12/02/2024',
'child1_name': 'Rashmi C'}],
'parent_DOB': '15/05/1995',
'parent_name': 'Rashmika Chadda'},
'P324': {'children': [{'child1_DOB': '04/08/2024',
'child1_name': 'Vinitha B'}],
'parent_DOB': '01/01/1994',
'parent_name': 'Shiva Bhupati'},
'P356': {'children': [{'child1_DOB': '10/03/2014',
'child1_name': 'Kanishka P'}],
'parent_DOB': '22/02/1991',
'parent_name': 'Karthikeyan chandrashekar'},
'P366': {'children': [{'child1_DOB': '15/05/1995',
'child1_name': 'Vandana M'},
{'child2_DOB': '18/11/1998', 'child2_name': 'Chandana M'}],
'parent_DOB': '23/01/1975',
'parent_name': 'Kalyani Manoj'}}
In the end, to have an actual table, you would need to use pandas but that will require for you to fix the number of children max so that you can pad the empty cells.

How to find all longest common substrings that exist in multiple documents?

I have many text documents that I want to compare to one another and remove all text that is exactly the same between them. This is to remove find boiler plate text that is consistent so it can be removed for NLP.
The best way I figured to do this is to find Longest Common Sub-strings that exist or are mostly present in all the documents. However, doing this has been incredibly slow.
Here is an example of what I am trying to accomplish:
DocA:
Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960
DocB:
Title: 1984
Author: George Orwell
Published: June 1949
DocC:
Title: The Great Gatsby
Author: F. Scott Fitzgerald
The output would show something like:
{
'Title': 3,
'Author': 3,
'Published': 2,
}
The results would then be used to strip out the commonalities between documents.
Here is some code I have tested in python. It's incredibly with any significant amount of permutations:
file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
doc_a = p[0]
doc_b = p[1]
while True:
seq_match = SequenceMatcher(a=doc_a, b=doc_b)
match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b))
if (match.size >= 5):
doc_a_start, doc_a_stop = match.a, match.a + match.size
doc_b_start, doc_b_stop = match.b, match.b + match.size
match_word = doc_a[doc_a_start:doc_a_stop]
if match_word in results:
results[match_word] += 1
else:
results[match_word] = 1
doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
else:
break
df = pd.DataFrame(
{
'Value': [x for x in results.keys()],
'Count': [x for x in results.values()]
}
)
print(df)

create a set from each document,
build a counter for every word how many time it appears
iterate over every document, when you find a word that appears in 70% -90% of documents,
append it and the word after it as a tuple to a new counter
and again..
from collections import Counter
one_word = Counter()
for doc in docs:
word_list = docs.split(" ")
word_set = set(word_list)
for word in word_set:
one_word[word]+=1
two_word = Counter()
threshold = len(docs)*0.7
for doc in docs:
word_list = doc.split(" ")
for i in range(len(word_list)-1):
if one_word[word_list[i]]>threshold:
key = (word_list[i], word_list[i+1])
you can play with the threshold and continue as long as the counter is not empty
the docs are lyrics of songs believer, by the river of Babylon, I could stay awake, rattlin bog
from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:\docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
one_word[file] = set()
all_docs[file] = []
with open(file) as doc:
for row in doc:
for word in row.split():
one_word[file].add(word)
all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
for word in one_word[key]:
common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)
two_word = {}
for key in all_docs:
two_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-1):
if common_Frase[doc[index]]>TR:
val = (doc[index], doc[index+1])
two_word[key].add(val)
for key in two_word:
for word in two_word[key]:
common_Frase[word]+=1
#now common_Frase contain a count of all two words frase
three_word = {}
for key in all_docs:
three_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-2):
val2 = (doc[index], doc[index+1])
if common_Frase[val2]>TR:
val3 = (doc[index], doc[index+1], doc[index+2])
three_word[key].add(val3)
for key in three_word:
for word in three_word[key]:
common_Frase[word]+=1
for k in common_Frase:
if common_Frase[k]>1:
print(k)
this is the outpot
when like all Don't And one the my hear and feeling Then your of I'm in me The you away I never to be what a ever thing there from By down Now words that was ('all', 'the') ('And', 'the') ('the', 'words') ('By', 'the') ('and', 'the') ('in', 'the')

add values to a list from specific part of a text file

I am having this text
/** Goodmorning
Alex
Dog
House
Red
*/
/** Goodnight
Maria
Cat
Office
Green
*/
I would like to have Alex , Dog , House and red in one list and Maria,Cat,office,green in an other list.
I am having this code
with open(filename) as f :
for i in f:
if i.startswith("/** Goodmorning"):
#add files to list
elif i.startswith("/** Goodnight"):
#add files to other list
So, is there any way to write the script so it can understands that Alex belongs in the part of the text that has Goodmorning?

I'd recommend you to use dict, where "section name" will be a key:
with open(filename) as f:
result = {}
current_list = None
for line in f:
if line.startswith("/**"):
current_list = []
result[line[3:].strip()] = current_list
elif line != "*/":
current_list.append(line.strip())
Result:
{'Goodmorning': ['Alex', 'Dog', 'House', 'Red'], 'Goodnight': ['Maria', 'Cat', 'Office', 'Green']}
To search which key one of values belongs you can use next code:
search_value = "Alex"
for key, values in result.items():
if search_value in values:
print(search_value, "belongs to", key)
break

I would recommend to use Regular expressions. In python there is a module for this called re
import re
s = """/** Goodmorning
Alex
Dog
House
Red
*/
/** Goodnight
Maria
Cat
Office
Green
*/"""
pattern = r'/\*\*([\w \n]+)\*/'
word_groups = re.findall(pattern, s, re.MULTILINE)
d = {}
for word_group in word_groups:
words = word_group.strip().split('\n\n')
d[words[0]] = words[1:]
print(d)
Output:
{'Goodmorning': ['Alex', 'Dog', 'House', 'Red'], 'Goodnight':
['Maria', 'Cat', 'Office', 'Green']}

expanding on Olvin Roght (sorry can't comment - not enough reputation) I would keep a second dictionary for the reverse lookup
with open(filename) as f:
key_to_list = {}
name_to_key = {}
current_list = None
current_key = None
for line in f:
if line.startswith("/**"):
current_list = []
current_key = line[3:].strip()
key_to_list[current_key] = current_list
elif line != "*/":
current_name=line.strip()
name_to_key[current_name]=current_key
current_list.append(current_name)
print key_to_list
print name_to_key['Alex']
alternative is to convert the dictionary afterwards:
name_to_key = {n : k for k in key_to_list for n in key_to_list[k]}
(i.e if you want to go with the regex version from ashwani)
Limitation is that this only permits one membership per name.

How to compare 2 list where string matches element in alternate list

Hi I'm in the process of learning so you may have to bear with me. I have 2 lists I'd like to compare whilst keeping any matches and append them whilst appending any non matches to another output list.
Heres my code:
def EntryToFieldMatch(Entry, Fields):
valid = []
invalid = []
for c in Entry:
count = 0
for s in Fields:
count +=1
if s in c:
valid.append(c)
elif count == len(Entry):
invalid.append(s)
Fields.remove(s)
print valid
print "-"*50
print invalid
def main():
vEntry = ['27/04/2014', 'Hours = 28', 'Site = Abroad', '03/05/2015', 'Date = 28-04-2015', 'Travel = 2']
Fields = ['Week_Stop', 'Date', 'Site', 'Hours', 'Travel', 'Week_Start', 'Letters']
EntryToFieldMatch(vEntry, Fields)
if __name__ = "__main__":
main()
the output seems fine except its not returning all the fields in the 2 output lists. This is the output I receive:
['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------
['Week_Start', 'Letters']
I just have no idea why the second list doesn't include "Week_Stop". I've run the debugger and followed the code through a few times to no avail. I've read about sets but I didn't see any way to return fields that match and discard fields that don't.
Also im open to suggestion's if anybody knows of a way to simplify this whole process, I'm not asking for free code, just a nod in the right direction.
Python 2.7, Thanks

You only have two conditions, either it is in the string or the count is equal to the length of Entry, neither of which catch the first element 'Week_Stop', the length goes from 7-6-5 catching Week_Start but never gets to 0 so you never reach Week_Stop.
A more efficient way would be to use sets or a collections.OrderedDict if you want to keep order:
from collections import OrderedDict
def EntryToFieldMatch(Entry, Fields):
valid = []
# create orderedDict from the words in Fields
# dict lookups are 0(1)
st = OrderedDict.fromkeys(Fields)
# iterate over Entry
for word in Entry:
# split the words once on whitespace
spl = word.split(None, 1)
# if the first word/word appears in our dict keys
if spl[0] in st:
# add to valid list
valid.append(word)
# remove the key
del st[spl[0]]
print valid
print "-"*50
# only invalid words will be left
print st.keys()
Output:
['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------
['Week_Stop', 'Week_Start', 'Letters']
For large lists this would be significantly faster than your quadratic approach. Having 0(1) dict lookups means your code goes from quadratic to linear, every time you do in Fields that is an 0(n) operation.
Using a set the approach is similar:
def EntryToFieldMatch(Entry, Fields):
valid = []
st = set(Fields)
for word in Entry:
spl = word.split(None,1)
if spl[0] in st:
valid.append(word)
st.remove(spl[0])
print valid
print "-"*50
print st
The difference using sets is order is not maintained.

Using list comprehension:
def EntryToFieldMatch(Entries, Fields):
# using list comprehension
# (typically they go on one line, but they can be multiline
# so they look more like their for loop equivalents)
valid = [entry for entry in Entries
if any([field in entry
for field in Fields])]
invalidEntries = [entry for entry in Entries
if not any([field in entry
for field in Fields])]
missedFields = [field for field in Fields
if not any([field in entry
for entry in Entries])]
print 'valid entries:', valid
print '-' * 80
print 'invalid entries:', invalidEntries
print '-' * 80
print 'missed fields:', missedFields
vEntry = ['27/04/2014', 'Hours = 28', 'Site = Abroad', '03/05/2015', 'Date = 28-04-2015', 'Travel = 2']
Fields = ['Week_Stop', 'Date', 'Site', 'Hours', 'Travel', 'Week_Start', 'Letters']
EntryToFieldMatch(vEntry, Fields)
valid entries: ['Hours = 28', 'Site = Abroad', 'Date = 28-04-2015', 'Travel = 2']
--------------------------------------------------------------------------------
invalid entries: ['27/04/2014', '03/05/2015']
--------------------------------------------------------------------------------
missed fields: ['Week_Stop', 'Week_Start', 'Letters']

Appending to create a list instead of creating a list inside a list

Currently my code looks as following:
data = ""
pattern1 = re.compile('')
pattern2 = re.compile('')
pattern3 = re.compile('')
items = re.findall(pattern1, data
mainlist = []
for item in items:
forename = re.findall(pattern2, item)
surname = re.findall(pattern3, item)
mainlist.append(surname)
The only problem with this layout is that I am getting lists like:
[['Smith', 'Patricks', 'Clark'], ['Austin', 'Hamilton', 'Day', 'Sidders'], ['Bennet']]
I'm wanting my lists to come out as follows:
['Smith', 'Patricks', 'Clark', 'Austin', 'Hamilton', 'Day', 'Sidders', 'Bennet']
Any ideas?
Thanks in advance
- Hy

Use extend:
for item in items:
forename = re.findall(pattern2, item)
surname = re.findall(pattern3, item)
mainlist.extend(surname)
myList.extend(L) adds the individual elements of L onto myList. It's similar to:
for element in L:
myList.append(element)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding on values to a key based on different lengths - python

Related

Split each line in a file based on delimitters

How to find all longest common substrings that exist in multiple documents?

add values to a list from specific part of a text file

How to compare 2 list where string matches element in alternate list

Appending to create a list instead of creating a list inside a list

Categories

Resources