Related
I have a large data set like ~30000 records. I would like to extract words like "Animation", "Comedy", "Family". It is successful for me to extract the words out and delete the id, however I do not know how to stack the words back according to their row.
My code currently:
import ast, json
import pandas as pd
from csv import reader
file_name = 'xx.csv'
data = []
with open(file_name, 'r', encoding= 'unicode_escape') as read_obj:
csv_reader = reader(read_obj)
headings = next(csv_reader)
for i in csv_reader:
data.extend(ast.literal_eval(i[7]))
df = pd.DataFrame(data)
del df["id"]
print(df)
And it would produce result:
name
0 Animation
1 Comedy
2 Family
3 Adventure
4 Fantasy
...
40060 Drama
40061 Thriller
40062 Action
40063 Drama
40064 Thriller
The large data set is in csv format, but the cell should be in json formatting.
Sample data:
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}]
I think this does everything you need:
import json
import pandas as pd
df = pd.read_csv(file_name, encoding='unicode_escape', usecols=['name'])
result = df.to_json(orient='records')
parsed = json.loads(result)
json.dumps(parsed, indent=4)
I have a list of dictionaries:
movies['genres'].head()
where each line looks like:
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
4 [{'id': 35, 'name': 'Comedy'}]
Name: genres, dtype: object
I would like to save it in a data frame where one column is 'id' and the rows are the id values and another column 'name' where the rows are the name values. I tried with:
pd.DataFrame(movies['genres'])
However when I ran it I obtained:
genres
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
Could you help me?
Regards
You should use the command .from_dict() as described here
df = pd.DataFrame.from_dict(movies["genres"])
This is my data set, this is the column I separated from the csv file.
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4 [{'id': 35, 'name': 'Comedy'}]
How to get just a list with the content ['Animation', 'Adventure', 'Romance', 'Comedy', 'Comedy'] as output?
I guess you want to see something like that.
list_of_items = [[{'id': 16, 'name': 'Animation'}, {'id': 16, 'name': 'Animation2'}],[{'id': 16, 'name': 'Animation3'}, {'id': 16, 'name': 'Animation4'}]]
output_list = []
for item in list_of_items:
for dict in item:
output_list.append(dict['name'])
Output:
>>> print(output_list)
['Animation', 'Animation2', 'Animation3', 'Animation4']
I don't know if you made a typo but you have some errors with the ' in what you wrote.
But nevertheless from what I can see you have a list with dictionaries. So we loop through that list to access each dictionary and select what in the dictionary we want and append it to the list you created:
d = [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
list_1 = []
for el in d:
list_1.append(el['name'])
print(list_1)
The output will be: ['Romance', 'Comedy']
It's unclear if you have a list of lists or just one list.
For a single list you can use a list comprehension:
dict_list = [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
[dict_item['name'] for dict_item in dict_list]
Otherwise, you can unnest the first list and then do a list comprehension
dict_list = [[{'id': 1, 'name': 'Animation'}, {'id': 2, 'name': 'Comedy'}],[{'id': 3, 'name': 'Romance'}, {'id': 4, 'name': 'Comedy'}]]
[dict_item['name'] for dict_item in [dict_item for sublist in dict_list for dict_item in sublist]]
I have a csv file which looks like this -
id genres
1 [{'id': 35, 'name': 'Comedy'}]
2 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]
3 [1,2,3]
4 [{'id':31, 'name':'Comedy'}]
When I import the csv as dataframe, the lists in genres column are loaded as strings. For example - "[{'id': 35, 'name': 'Comedy'}]"
How do I load the lists without the quotes?
Use:
import ast, json
df['genres'] = df['genres'].apply(ast.literal_eval)
Or:
df['genres'] = df['genres'].apply(json.loads)
Also using strip()+split():
df['genres']= [x.strip("[]").split(',') for x in df['genres']]
or,
df['genres']= df['genres'].apply(lambda x: x.strip("[]").split(','))
I have list of dictionaries as follows:
[
{'id': 16419, 'name': 'Audi'},
{'id': 13, 'name': 'BMW'},
{'id': 31, 'name': 'Honda'},
{'id': 50060, 'name': 'KTM'},
{'id': 54, 'name': 'Opel'},
{'id': 55, 'name': 'Peugeot'},
{'id': 50083, 'name': 'PGO'},
{'id': 16350, 'name': 'Skoda'},
{'id': 68, 'name': 'Suzuki'},
{'id': 2120, 'name': 'Triumph'},
{'id': 16328, 'name': 'Others'},
{'id': 16396, 'name': 'Seat'},
{'id': 14979, 'name': 'Opel'},
{'id': 6, 'name': 'Volkswagen'}
]
What I want to do is to order it. And I want that some dictionaries with some name values show in the beginning of the list.
I want that for example Volkswagen, Audi, BMW, Opel, Peugeot as first params appears in list.
Thus the wanted result should be something like this:
[
{'id': 6, 'name': 'Volkswagen'}
{'id': 16419, 'name': 'Audi'},
{'id': 13, 'name': 'BMW'},
{'id': 54, 'name': 'Opel'},
{'id': 55, 'name': 'Peugeot'},
{'id': 31, 'name': 'Honda'},
{'id': 50060, 'name': 'KTM'},
{'id': 50083, 'name': 'PGO'},
{'id': 16350, 'name': 'Skoda'},
{'id': 68, 'name': 'Suzuki'},
{'id': 2120, 'name': 'Triumph'},
{'id': 16328, 'name': 'Others'},
{'id': 16396, 'name': 'Seat'},
{'id': 14979, 'name': 'Opel'},
]
Any idea how to do that?
You can use an appropriate key function for your sorting. This one orders by the given names first (in the given order). All other brands come after that with no order specified among themselves:
>>> rank = {x: i for i, x in enumerate(['Volkswagen', 'Audi', 'BMW', 'Opel', 'Peugeot'])}
# {'Volkswagen': 0, 'Audi': 1, ...}
>>> sorted(lst, key=lambda x: rank.get(x['name'], len(rank)))
[{'id': 6, 'name': 'Volkswagen'},
{'id': 16419, 'name': 'Audi'},
{'id': 13, 'name': 'BMW'},
{'id': 54, 'name': 'Opel'},
{'id': 14979, 'name': 'Opel'},
{'id': 55, 'name': 'Peugeot'},
{'id': 31, 'name': 'Honda'},
{'id': 50060, 'name': 'KTM'},
{'id': 50083, 'name': 'PGO'},
{'id': 16350, 'name': 'Skoda'},
{'id': 68, 'name': 'Suzuki'},
{'id': 2120, 'name': 'Triumph'},
{'id': 16328, 'name': 'Others'},
{'id': 16396, 'name': 'Seat'}]
You can use a dictionary to define a custom sorting order.
dicts = [
{'id': 16419, 'name': 'Audi'},
{'id': 13, 'name': 'BMW'},
{'id': 31, 'name': 'Honda'},
{'id': 50060, 'name': 'KTM'},
{'id': 54, 'name': 'Opel'},
{'id': 55, 'name': 'Peugeot'},
{'id': 50083, 'name': 'PGO'},
{'id': 16350, 'name': 'Skoda'},
{'id': 68, 'name': 'Suzuki'},
{'id': 2120, 'name': 'Triumph'},
{'id': 16328, 'name': 'Others'},
{'id': 16396, 'name': 'Seat'},
{'id': 14979, 'name': 'Opel'},
{'id': 6, 'name': 'Volkswagen'}
]
brand_order = ['Volkswagen', 'Audi', 'BMW', 'Opel', 'Peugeot']
order = dict(zip(brand_order, range(len(brand_order))))
dicts_sorted = sorted(dicts, key=lambda d: order.get(d['name'], float('inf')))
print(dicts_sorted)
Output:
[{'id': 6, 'name': 'Volkswagen'},
{'id': 16419, 'name': 'Audi'},
{'id': 13, 'name': 'BMW'},
{'id': 54, 'name': 'Opel'},
{'id': 14979, 'name': 'Opel'},
{'id': 55, 'name': 'Peugeot'},
{'id': 31, 'name': 'Honda'},
{'id': 50060, 'name': 'KTM'},
{'id': 50083, 'name': 'PGO'},
{'id': 16350, 'name': 'Skoda'},
{'id': 68, 'name': 'Suzuki'},
{'id': 2120, 'name': 'Triumph'},
{'id': 16328, 'name': 'Others'},
{'id': 16396, 'name': 'Seat'}]
Falling back to float('inf') ensures that whatever is not in order comes last.