I have run into a problem with my Python code.
I am creating a movie filter after i scraped IMDB for certain movies.
However, the problem is that movies with multiple genres will show up identically in my movie_filter.
So my code is following:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres = self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
if row["title"] not in movies:
movies.append(dict(movie = row["title"][0].upper()+row["title"][1:],
imdbPageID = row["imdbPageID"]))
return movies
Now i.e because the movie "DRUK" has the genres "Comende and drama" it will show up 2 times with the same title and imdb page ID.
I have tried with multiple arguments, but can't seem to solve the specific reason why this happens.
Can anyone help here?
Edit: The mapping for 1 movie is like this:
[{'_id': '6028139039cba4ae2722f8d9', 'castList': '[Rosa Salazar, Christoph Waltz, Jennifer Connelly, Mahershala Ali, Ed Skrein]', 'clientID': 'FILM', 'dcmCampaignID': [''], 'director': 'Robert Rodriguez', 'dv360InsertionOrderID': ['7675053', '7675055', '7675065', '768
3006', '7863461'], 'genres': ['action', 'adventure', 'sci-fi'], 'imdbPageID': '0437086', 'imdbPageURL': 'https://www.imdb.com/title/tt0437086', 'imdbRating': '7.3', 'marathonCountryID': 'PMDK', 'posterURL': 'https://m.media-amazon.com/images/M/MV5BMTQzYWYwYjctY2JhZS00
NTYzLTllM2UtZWY5ZTk0NmYwYzIyXkEyXkFqcGdeQXVyMzgxODM4NjM#.V1_UX182_CR0,0,182,268_AL.jpg', 'title': 'alita: battle angel\xa0(2019)'}
Since movies is a list of dictionaries (which are unhashable), converting it to a set to get rid of duplicates will not work. Instead you have to iterate and append each movie to the movies list on the condition that it does not already exist there. You have already tried to do this with the if statement inside the for loop. The problem is that your if statement is always True because your are checking just for a Title and not for the whole dictionary object. You can fix it like this:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in
dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres=self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
movie_dic = dict(movie=row["title"][0].upper() + row["title"][1:],
imdbPageID=row["imdbPageID"])
if movie_dic not in movies:
movies.append(movie_dic)
return movies
Related
I have a dictionary setup like this:
company = {'Honda': {}
,'Toyota':{}
,'Ford':{}
}
I have a list containing data like this:
years = ['Year 1', 'Year 2', 'Year 3']
Finally, I also have a list of lists containing data like this:
sales = [[55,9,90],[44,22,67],[83,13,91]]
I am trying to achieve a final result that looks like this:
{'Honda': {'Year 1':55,'Year 2':9,'Year 3':90}
,'Toyota':{'Year 1':44,'Year 2':22,'Year 3':67}
,'Ford':{'Year 1':83,'Year 2':13,'Year 3':91}
}
I can access the sub-list if sales like this:
for i in sales:
for j in i:
#this would give me a flat list of all sales
I can't seem to wrap my head around constructing the final dictionary that would tie everything together.
Any help is appreciated!
You can use a dict comprehension with zip.
res = {k : dict(zip(years, sale)) for k, sale in zip(company, sales)}
You can use zip to pair corresponding information together. First, zip the brand names with the values in sales. Next, zip years with a particular brand's sales numbers.
company = {brand: dict(zip(years, sales_nums))
for brand, sales_nums in zip(["Honda", "Toyota", "Ford"], sales)}
You can use zip and a double for-loop to zip all 3 lists. Here you are:
final_dict = {}
for i, (name, sub_dict) in enumerate(company.items()):
for year, sale in zip(years, sales[i]):
sub_dict[year] = sale
final_dict[name] = sub_dict
import csv
def partytoyear():
party_in_power = {}
with open("presidents.txt") as f:
reader = csv.reader(f)
for row in reader:
party = row[1]
for year in row[2:]:
party_in_power[year] = party
print(party_in_power)
return party_in_power
partytoyear()
def statistics():
with open("BLS_private.csv") as f:
statistics = {}
reader = csv.DictReader(f)
for row in reader:
statistics = row
print(statistics)
return statistics
statistics()
These two functions return two dictionaries.
Here is a sample of the first dictionary:
'Democrat', '1981': 'Republican', '1982': 'Republican', '1983'
Sample of the second dictionary:
'2012', '110470', '110724', '110871', '110956', '111072', '111135', '111298', '111432', '111560', '111744'
The first dictionary associates a year and the political party. The next dictionary associates the year with job statistics.
I need to combine these two dictionaries, so I can have the party inside the dictionary with the job statistics.
I would like the dictioary to look like this:
'Democrat, '2012','110470', '110724', '110871', '110956', '111072', '111135', '111298', '111432', '111560', '111744'
How would I go about doing this? I've looked at the syntax for update() but that didn't work for my program
You can’t have a dictionary in that manor in python it’s syntactically wrong but you can have each value be a collection such as a list. Here’s a comprehension that does just that using dict lookups:
first_dict = {'Democrat': '1981': 'Republican': '1982': 'Republican': '1983', ...}
second_dict = {'2012': ['110470', '110724', '110871', '110956', '111072', '111135', '111298', '111432', '111560', '111744'], ...}
result = {party: [year, *second_dict[year] for party, year in first_dict.items()}
Pseudo result dict structure:
{'Party Name': [year, stats, ...], ...}
I have a dataframe full of scientific paper information.
My Dataframe:
database authors title
0 sciencedirect [{'surname': 'Sharafaldin', 'first_name': 'Iman'}, An eval...
{'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
1 sciencedirect [{'surname': 'Srinivas', 'first_name': 'Jangirala'}, Governmen...
{'surname': 'Das', 'first_name': 'Ashok Kumar'}]
2 sciencedirect [{'surname': 'Bongiovanni', 'first_name': 'Ivano'}] The last...
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac...
As you can see, the authors column contains a list of dictionarys, but only where the database is sciencedirect. In order to perform some analysis, I need to clean my data. Therefore, my goal is to put the names just into lists like in row 4.
What i want:
# From:
[{'surname': 'Sharafaldin', 'first_name': 'Iman'}, {'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
# To:
[Iman Sharafaldin, Arash Habibi Lashkari]
My appraoch is to createa two masks, one for the database column, extracting only sciencedirect papers and the other mask is the whole authors column. From these mask, a new dataframe is created, on which column "authors" i run the code shown below. It extracts the author names of each cell and stores them in a list, just as i want it:
scidir_mask = df["database"] == 'sciencedirect'
authors_col = df["authors"] is not None
only_scidir = df[authors_col & scidir_mask]
for cell in only_scidir["authors"]:
# get each list from cell
cell_list = []
for dictionary in cell:
# get the values from dict and reverse into list
name_as_list = [*dictionary.values()][::-1]
# make list from first and surname a string
author = ' '.join(name_as_list)
cell_list.append(author)
So at the end of the above code, the cell_list contains the authors names in the way I want.
But I can't get my head around, on how to store these cell_lists back into the original dataframe.
So, how do I get the authors cell, where the database is sciencedirect,perform my little function and store the output of my function back into the cell?
Idea is create custom function with f-strings and apply only to filtered rows:
scidir_mask = df["database"] == 'sciencedirect'
f = lambda x: [f"{y['first_name']} {y['surname']}" for y in x]
df.loc[scidir_mask, 'authors'] = df.loc[scidir_mask, 'authors'].apply(f)
print (df)
database authors title
0 sciencedirect [Iman Sharafaldin, Arash Habibi Lashkari] An eval
1 sciencedirect [Jangirala Srinivas, Ashok Kumar Das] Governmen
2 sciencedirect [Ivano Bongiovanni] The last
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac
I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]
I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)
Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]
I have CSV file that looks like the following,
1994, Category1, Something Happened 1
1994, Category2, Something Happened 2
1995, Category1, Something Happened 3
1996, Category3, Something Happened 4
1998, Category2, Something Happened 5
I want to create two lists,
Category = [Category1, Category2, Category3]
and
Year = [1994, 1995, 1996, 1998]
I want to omit the duplicates in the column. I am reading the file as following,
DataCaptured = csv.reader(DataFile, delimiter=',')
DataCaptured.next()
and Looping through,
for Column in DataCaptured:
You can do:
DataCaptured = csv.reader(DataFile, delimiter=',', skipinitialspace=True)
Category, Year = [], []
for row in DataCaptured:
if row[0] not in Year:
Year.append(row[0])
if row[1] not in Category:
Category.append(row[1])
print Category, Year
# ['Category1', 'Category2', 'Category3'] ['1994', '1995', '1996', '1998']
As stated in the comments, if order does not matter, using a set would be easier and faster:
Category, Year = set(), set()
for row in DataCaptured:
Year.add(row[0])
Category.add(row[1])
A very concise way to do this is to use pandas, the benefits are: it has a faster CSV pharser; and it works in columns (so it only requires one df.apply(set) to get you there) :
In [244]:
#Suppose the CSV is named temp.csv
df=pd.read_csv('temp.csv',header=None)
df.apply(set)
Out[244]:
0 set([1994, 1995, 1996, 1998])
1 set([ Category2, Category3, Category1])
2 set([ Something Happened 4, Something Happene...
dtype: object
The downside is that it returns a pandas.Series, and to get access each list, you need to do something like list(df.apply(set)[0]).
Edit
If the order has to be preserved, it can be also done very easily, for example:
for i, item in df.iteritems():
print item.unique()
item.unique() will return numpy.arrays, instead of lists.
dawg pointed out one of the greatest tricks in Python: using set() to remove duplicates from a list. dawg shows how to build the unique list from scratch by adding each item to a set, which is perfect. But here's another equivalent way to do it, generating a list with duplicates and a list without duplicates using a list(set()) approach:
import csv
in_str = [
'year, category, event',
'1994, Category1, Something Happened 1',
'1994, Category2, Something Happened 2',
'1995, Category1, Something Happened 3',
'1996, Category3, Something Happened 4',
'1998, Category2, Something Happened 5'
]
cdr = csv.DictReader(in_str, skipinitialspace=True)
col = []
for i in cdr:
col.append(i['category'])
# all items in the column...
print(col)
# only unique items in the column...
print(list(set(col)))