pandas - list of dicts inside a dataframe, keeping their index - python

I have a dataframe with column values list of dictionaries that looks like this:
id comments
1 [{review:{review_id: 8987, review_text: 'wonderful'}, {review:{review_id: 8988, review_text: 'good'}]
2 [{review:{review_id: 9098, review_text: 'not good'}, {review:{review_id: 9895, review_text: 'terrible'}]
i figured out how to flatten that specific comments by doing:
pd.io.json.json_normalize(json.loads(df['comments'].iloc[0].replace("'", '"')))
It makes a new dataframe from the column value. which is good but what I actually need to happen is the id extends as well like so:
id review_id review_text
1 8987 wonderful
1 8988 good
2 9098 not good
2 9895 terrible
notice that the id extended along with the reviews. How do i Implement a solution to this?
as reference, here is a small sample of the dataset: https://aimedu-my.sharepoint.com/:x:/g/personal/matthewromero_msds2021_aim_edu/EfhdrrlYJy1KmGWhECf91goB7jpHuPFKyz8L3UTfyCSDiA?e=pYcap3

Based on the file that you provided and the way you say you wish the result you can try this code:
import pandas as pd
import ast
#import data
df = pd.read_excel('./restaurants_reviews_sample.xlsx', usecols = [1,2])
#change column to list of dictionaries
df.user_reviews = df.user_reviews.apply(lambda x: list(ast.literal_eval(x)))
#explode the reviews
df = df.explode('user_reviews')
#resetting index
df.reset_index(inplace = True, drop = True)
#unnesting the review dictionary
df.user_reviews = df.user_reviews.apply(lambda x: x['review'])
#creating new columns (only the ones we need)
df = df.assign(id='', review_text='')
#populate the columns from dictionary in user_reviews
cols = list(df.columns[2:4])
for i in list(range(0, len(df))):
for c in cols:
df[c][i] = df.user_reviews[i][c]
#cleaning columns
df.drop(columns = 'user_reviews' , inplace = True)
df.rename(columns = {'id':'review_id',
'index':'id'}, inplace = True)
The new dataframe looks like this:
id review_id review_text
0 6301456 46743270
1 6301456 41974132 A yuppies place, meaning for the young urban poor the place is packed with the young crowd of the early 20’s and mid 20’s to early 30’s that can still take a loud music pumping on the background with open space where you can check out the girls for a possible get to know and possible pick up. Quite affordable for the combo bucket with pulutan for the limited budget crowd but is there to look for a hook up.
2 6301456 38482279 I celebrated my birthday here and it was awesome! My team enjoyed the place, foods and drinks. *tip: if you will be in a group, consider getting the package with cocktail tower and beers plus the platter. It is worth your penny! Kudos to Dylan and JP for the wonderful service that they have provided us and for making sure that my celebration will be a really good one, and it was! Thank you guys! See you again soon! ðŸ˜ÂðŸ˜Â
3 6301456 35971612 Sa lahat nang Central na napuntahan ko, dito ko mas bet! Unang una sa lahat, masarap yung foods and yung pagka gawa ng drinks nila. Hindi pa masyado pala away yung mga customers dito. 😂
4 6301456 35714330 Good place to chill and hang out. Not to mention the comfort room is clean. The staff are quite busy to attend us immediately but they are polite and courteous. Would definitely comeback! Cheers! ðŸÂºðŸ˜Š
5 6301475 47379863 Underrated chocolate cake under 500 pesos! I definitely recommend this Cloud 9 cake!!! I’m not into chocolate but this one is good. This cake has a four layers, i loved the creamy white moose part. I ordered it via Grab Food and it was hassle free! 😀 The packaging was bad, its just white plastic container, Better handle it with care.
6 6301475 42413329 We loved the Cloud9 cake, its just right taste. We ordered it for our office celebration. However, we went back there to try other food. We get to try a chocolate cake that's too sweet, a cheese cake that's just right, and sansrival that's kind weird and i didnt expect that taste it's sweet and have a lot of nuts and.. i don't know i just didnt feel it. We also hand a lasagna, which is too saucey for is, it's like a soup of tomato. it's a bit disappointing, honestly. Other ordered from our next table looks good, and a lot of serving. They ordered rice meal, maybe you should try that .
7 6301475 42372938 Best cake i’ve eaten vs cakes from known brands such as Caramia and the like. Lots of white chocolate on top, not so sweet and similar to brazo de mercedes texture and, the merengue is the best!
8 6301475 41699036 This freaking piece of chicken costs 220Php. Chicken Cacciatore. Remember the name. DO NOT ORDER! This was my first time ordering something from your resto and I can tell you I AM NOT HAPPY!
9 6301475 40973213 Heard a lot about their famous chocolate cake. Bought a slice to try but found it quite sweet for my taste. Hope to try their other cakes though.

Related

Google colab: read txt files and convert them to pandas

I am using google colab and there is a file which called 'examples' and inside there are three txt files.
I am using the following code to read and convert them to pandas
dataset_filepaths = glob.glob('examples/*.txt')
for filepath in tqdm.tqdm(dataset_filepaths):
df = pd.read_csv(filepath)
If you print the dataset_filepaths you will see
['examples/kate_middleton.txt',
'examples/jane_doe.txt',
'examples/daniel_craig.txt']
which is correct. However, in the df there is only the first document. Could you please let me know how we can create a pandas in the following form
index text
-----------------
0 text0
1 text1
. .
. .
. .
Updated: #Steven Rumbalski using your code
dfs = [pd.read_csv(filepath) for filepath in tqdm.tqdm(dataset_filepaths)]
dfs
The output looks like this
[Empty DataFrame
Columns: [Kate Middleton is the wife of Prince William. She is a mother of 3 children; 2 boys and a girl. Kate is educated to university level and that is where she met her future husband. Kate dresses elegantly and is often seen carrying out charity work. However, she is a mum first and foremost and the interactions we see with her children are adorable. Kate’s sister, Pippa, has followed Kate into the public eye. She was born in 1982 and will soon turn 40. When pregnant, Kate suffers from a debilitating illness called Hyperemesis Gravidarum, which was little known about until it was reported that Kate had it.]
Index: [], Empty DataFrame
Columns: [Jane Doe was born in December 1978 and is currently living in London, United Kingdom.]
Index: [], Empty DataFrame
Columns: [He is an English film actor known for playing James Bond in the 007 series of films. Since 2005, he has been playing the character but he confirmed that No Time to Die would be his last James Bond film. He was born in Chester on 2nd of March in 1968. He moved to Liverpool when his parents divorced and lived there until he was sixteen years old. He auditioned and was accepted into the National Youth Theatre and moved down to London. He studied at Guildhall School of Music and Drama. He has appeared in many films.]
Index: []]
How can I convert it in the form that I want?

How can I get the max value from groupby object with multiple values?

Sorry if the question is confusing, I was not sure how to word it. Please let me know if this is duplicated question.
I have a groupby object looks like this:
us.groupby(['category_id', 'title']).sum()[['views']]
us
category_id title views
Autos & Vehicle 1980 toyota corolla liftback commercial 13061
1992 Chevy Lumina Euro commercial 18470406
2019 Chevrolet Silverado First Look 13061
Music Backyard Boys 133
Eminem - Song 1223
Cardi B - Wap 1111122
Travel & Events Welcome to Winter PUNderland 437576
What Spring Looks Like Around The World 17554672
And I want to get only max value for each category, such as:
category_id title views
Autos & Vehicle 1992 Chevy Lumina Euro commercial 18470406
Music Cardi B - Wap 1111122
Travel & Events What Spring Looks Like Around The World 17554672
How can I do this?
I tried .first() method, and also us.groupby(['category_id', 'title']).sum()[['views']].sort_values(by='views', ascending=False)[:1] something like this, but it only gives first row of entire dataframe. Is there any function I can use to only filter max value of groupby object?
Thank you!
You can try:
us_group = us.groupby(['category_id', 'title']).sum()[['views']]
(us_group.reset_index().sort_values(['views'])
.drop_duplicates('category_id', keep='last')
)

by changing dataframe some columns are duplicated

I have dataset:
,target,text
0,0,awww thats bummer shoulda got david carr third day
1,0,upset cant update facebook texting might cry result school today also blah
2,0,dived many times ball managed save 50 rest go bounds
3,0,whole body feels itchy like fire
4,0,behaving im mad cant see
5,0,whole crew
6,0,need hug
I wanted to separate my csv and bring all data whoch has target = 0 to another .csv
data_neg = df['target'] == '0'
df_neg = df[data_neg]
df_neg.to_csv("negative.csv")
And aftrer doing this column in negative.csv which has no name is duplicated:
,Unnamed: 0,target,text
0,0,0,awww thats bummer shoulda got david carr third day
1,1,0,upset cant update facebook texting might cry result school today also blah
2,2,0,dived many times ball managed save 50 rest go bounds
3,3,0,whole body feels itchy like fire
4,4,0,behaving im mad cant see
5,5,0,whole crew
why it happens and how to avoid duplicating it? it only happens with the first column with id
Create a copy and specify which column is your index when reading the CSV file:
# ...
df_neg = df[data_neg].copy()
df_neg.to_csv("negative.csv")
# For reading it
df_neg = pd.read_csv("negative.csv", index_col=0)

How do I find frequency of Authors and plot this using Python?

Here ABC news is observed 5 times but the column Times reflects it as 1 for each row. Expected output is ABC news one time in each row, but the total in Times as 5, since ABC has overall published 5 titles.
So that while plotting Author is on X-axis, and associated Times it has been published, is on Y-axis.
Code for the below dataframe that needs to be changed like mentioned above:
a=df1.groupby(['author','title'])['title'].count().reset_index(name="Time")
a.head()
author title Time
0 ABC News WATCH: How to get the most bang for your buck ... 1
1 ABC News WATCH: Man who confessed to killing wife, chil... 1
2 ABC News WATCH: Nearly 1,000 still missing 11 days afte... 1
3 ABC News WATCH: Teen hockey player skates after brain i... 1
4 ABC News WATCH: Trump: Will not do in-person interview ... 1
5 Ali Dukakis and Mike Levine Mueller 'has no eff... 1
Following shall keep updating your Times column with appropriate numbers. You may opt to declare the loop within a function to reuse later on.
import pandas as pd
df = pd.DataFrame( data=[['ABC News','WATCH: How to get the most bang for your buck...','1'], ['ABC News','WATCH: Man who confessed to killing wife, chil...','1'], ['ABC News','WATCH: Nearly 1,000 still missing 11 days afte...','1'], ['ABC News','WATCH: Teen hockey player skates after brain i...','1'], ['ABC News','WATCH: Trump: Will not do in-person interview ...','1'], ['Ali Dukakis and Mike Levine',"Mueller 'has no eff...",'1'] ], columns=['author','title','Times'])
word_count = dict(df['author'].value_counts())
for i,v in df["author"].iteritems():
if v in word_count.keys():
df.loc[i, "Times"] = word_count[v]
print(df)
This shall get your desired result like:
Plotting author against Times now shouldn't be an issue, I believe. Kindly accept the answer if it meets your requirement or else please let me know if this doesn't work for you.
The problem is that you are grouping on 'title' when you want to only group by 'author', it seems. Remove 'title' from groupby.

Create Pandas dataframe using List

Im trying to place a list that I created from reading in a textfile into a pandas dataframe but its not working for some reason. Below you will find some test data and my functions. The first piece of code does some checking and splitting and the second part appends it to a list called data. Here is some test data
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
product/productId: B00813GRG4
review/userId: A1D87F6ZCVE5NK
review/profileName: dll pa
review/helpfulness: 0/0
review/score: 1.0
review/time: 1346976000
review/summary: Not as Advertised
review/text: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Here is my code:
import pandas as pd
import numpy as np
def grab_next_entry(food_file):
record={'id':-1,'helpfulness':'','number rated':'','score':'','review':''}
line=food_file.readline()
#food_dataframe=pd.DataFrame(columns=column_names)
while line:
if 'product/productId' in line:
split_product_id=line.split(':')
record['id']=split_product_id[1]
if 'review/helpfulness' in line:
split_helpfulness=line.split(':')
split_helpfulness=split_helpfulness[1].split('/')
record['helpfulness']=eval(split_helpfulness[0])
record['number rated']=eval(split_helpfulness[-1])
if 'review/score' in line:
split_score = line.split(':')
record['score']=eval(split_score[1])
if 'review/text' in line:
split_review_text=line.split('review/text:')
record['review']=split_review_text[1:]
if line=='\n':
return record
line=food_file.readline()
The next piece of code is creating the list and trying to put it into a pandas dataframe.
import os
fileLoc = "/Users/brawdyll/Documents/ds710fall2017assignment11/finefoods_test.txt"
column_names=('Product ID', 'People who voted Helpful','Total votes','Rating','Review')
food_dataframe=[]
data=[]
with open(fileLoc,encoding = "ISO 8859-1") as food_file:
fs=os.fstat(food_file.fileno()).st_size
num_read = 0
while not food_file.tell()==fs:
data.append(grab_next_entry(food_file))
num_read+=1
Food_dataframe = pd.DataFrame(data,column_names)
print(Food_dataframe)
There's a lot of improvements that could be made in this code, but the reason why your program isn't working is because you're setting the indices to be column_names. Running:
pd.DataFrame(data)
will work just fine, and then setting:
df.columns = column_names
Will give you the results you want.

Categories

Resources