Im trying to place a list that I created from reading in a textfile into a pandas dataframe but its not working for some reason. Below you will find some test data and my functions. The first piece of code does some checking and splitting and the second part appends it to a list called data. Here is some test data
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
product/productId: B00813GRG4
review/userId: A1D87F6ZCVE5NK
review/profileName: dll pa
review/helpfulness: 0/0
review/score: 1.0
review/time: 1346976000
review/summary: Not as Advertised
review/text: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Here is my code:
import pandas as pd
import numpy as np
def grab_next_entry(food_file):
record={'id':-1,'helpfulness':'','number rated':'','score':'','review':''}
line=food_file.readline()
#food_dataframe=pd.DataFrame(columns=column_names)
while line:
if 'product/productId' in line:
split_product_id=line.split(':')
record['id']=split_product_id[1]
if 'review/helpfulness' in line:
split_helpfulness=line.split(':')
split_helpfulness=split_helpfulness[1].split('/')
record['helpfulness']=eval(split_helpfulness[0])
record['number rated']=eval(split_helpfulness[-1])
if 'review/score' in line:
split_score = line.split(':')
record['score']=eval(split_score[1])
if 'review/text' in line:
split_review_text=line.split('review/text:')
record['review']=split_review_text[1:]
if line=='\n':
return record
line=food_file.readline()
The next piece of code is creating the list and trying to put it into a pandas dataframe.
import os
fileLoc = "/Users/brawdyll/Documents/ds710fall2017assignment11/finefoods_test.txt"
column_names=('Product ID', 'People who voted Helpful','Total votes','Rating','Review')
food_dataframe=[]
data=[]
with open(fileLoc,encoding = "ISO 8859-1") as food_file:
fs=os.fstat(food_file.fileno()).st_size
num_read = 0
while not food_file.tell()==fs:
data.append(grab_next_entry(food_file))
num_read+=1
Food_dataframe = pd.DataFrame(data,column_names)
print(Food_dataframe)
There's a lot of improvements that could be made in this code, but the reason why your program isn't working is because you're setting the indices to be column_names. Running:
pd.DataFrame(data)
will work just fine, and then setting:
df.columns = column_names
Will give you the results you want.
Related
I am trying to import a dataset from a text file, which looks like this.
id book author
1 Cricket World Cup: The Indian Challenge Ashis Ray
2 My Journey Dr. A.P.J. Abdul Kalam
3 Making of New India Dr. Bibek Debroy
4 Whispers of Time Dr. Krishna Saksena
When I used for importing:
df = pd.read_csv('book.txt', sep=' ')
it results into:
and when I use:
df = pd.read_csv('book.txt')
it results into:
Is there a way to get something like:
Any help on this will be appreciated. Thank you
Try with tab as a seperator:
df = pd.read_csv('book.txt', sep='\t')
I have a dataframe with column values list of dictionaries that looks like this:
id comments
1 [{review:{review_id: 8987, review_text: 'wonderful'}, {review:{review_id: 8988, review_text: 'good'}]
2 [{review:{review_id: 9098, review_text: 'not good'}, {review:{review_id: 9895, review_text: 'terrible'}]
i figured out how to flatten that specific comments by doing:
pd.io.json.json_normalize(json.loads(df['comments'].iloc[0].replace("'", '"')))
It makes a new dataframe from the column value. which is good but what I actually need to happen is the id extends as well like so:
id review_id review_text
1 8987 wonderful
1 8988 good
2 9098 not good
2 9895 terrible
notice that the id extended along with the reviews. How do i Implement a solution to this?
as reference, here is a small sample of the dataset: https://aimedu-my.sharepoint.com/:x:/g/personal/matthewromero_msds2021_aim_edu/EfhdrrlYJy1KmGWhECf91goB7jpHuPFKyz8L3UTfyCSDiA?e=pYcap3
Based on the file that you provided and the way you say you wish the result you can try this code:
import pandas as pd
import ast
#import data
df = pd.read_excel('./restaurants_reviews_sample.xlsx', usecols = [1,2])
#change column to list of dictionaries
df.user_reviews = df.user_reviews.apply(lambda x: list(ast.literal_eval(x)))
#explode the reviews
df = df.explode('user_reviews')
#resetting index
df.reset_index(inplace = True, drop = True)
#unnesting the review dictionary
df.user_reviews = df.user_reviews.apply(lambda x: x['review'])
#creating new columns (only the ones we need)
df = df.assign(id='', review_text='')
#populate the columns from dictionary in user_reviews
cols = list(df.columns[2:4])
for i in list(range(0, len(df))):
for c in cols:
df[c][i] = df.user_reviews[i][c]
#cleaning columns
df.drop(columns = 'user_reviews' , inplace = True)
df.rename(columns = {'id':'review_id',
'index':'id'}, inplace = True)
The new dataframe looks like this:
id review_id review_text
0 6301456 46743270
1 6301456 41974132 A yuppies place, meaning for the young urban poor the place is packed with the young crowd of the early 20’s and mid 20’s to early 30’s that can still take a loud music pumping on the background with open space where you can check out the girls for a possible get to know and possible pick up. Quite affordable for the combo bucket with pulutan for the limited budget crowd but is there to look for a hook up.
2 6301456 38482279 I celebrated my birthday here and it was awesome! My team enjoyed the place, foods and drinks. *tip: if you will be in a group, consider getting the package with cocktail tower and beers plus the platter. It is worth your penny! Kudos to Dylan and JP for the wonderful service that they have provided us and for making sure that my celebration will be a really good one, and it was! Thank you guys! See you again soon! ðŸ˜ÂðŸ˜Â
3 6301456 35971612 Sa lahat nang Central na napuntahan ko, dito ko mas bet! Unang una sa lahat, masarap yung foods and yung pagka gawa ng drinks nila. Hindi pa masyado pala away yung mga customers dito. 😂
4 6301456 35714330 Good place to chill and hang out. Not to mention the comfort room is clean. The staff are quite busy to attend us immediately but they are polite and courteous. Would definitely comeback! Cheers! ðŸÂºðŸ˜Š
5 6301475 47379863 Underrated chocolate cake under 500 pesos! I definitely recommend this Cloud 9 cake!!! I’m not into chocolate but this one is good. This cake has a four layers, i loved the creamy white moose part. I ordered it via Grab Food and it was hassle free! 😀 The packaging was bad, its just white plastic container, Better handle it with care.
6 6301475 42413329 We loved the Cloud9 cake, its just right taste. We ordered it for our office celebration. However, we went back there to try other food. We get to try a chocolate cake that's too sweet, a cheese cake that's just right, and sansrival that's kind weird and i didnt expect that taste it's sweet and have a lot of nuts and.. i don't know i just didnt feel it. We also hand a lasagna, which is too saucey for is, it's like a soup of tomato. it's a bit disappointing, honestly. Other ordered from our next table looks good, and a lot of serving. They ordered rice meal, maybe you should try that .
7 6301475 42372938 Best cake i’ve eaten vs cakes from known brands such as Caramia and the like. Lots of white chocolate on top, not so sweet and similar to brazo de mercedes texture and, the merengue is the best!
8 6301475 41699036 This freaking piece of chicken costs 220Php. Chicken Cacciatore. Remember the name. DO NOT ORDER! This was my first time ordering something from your resto and I can tell you I AM NOT HAPPY!
9 6301475 40973213 Heard a lot about their famous chocolate cake. Bought a slice to try but found it quite sweet for my taste. Hope to try their other cakes though.
I have dataset:
,target,text
0,0,awww thats bummer shoulda got david carr third day
1,0,upset cant update facebook texting might cry result school today also blah
2,0,dived many times ball managed save 50 rest go bounds
3,0,whole body feels itchy like fire
4,0,behaving im mad cant see
5,0,whole crew
6,0,need hug
I wanted to separate my csv and bring all data whoch has target = 0 to another .csv
data_neg = df['target'] == '0'
df_neg = df[data_neg]
df_neg.to_csv("negative.csv")
And aftrer doing this column in negative.csv which has no name is duplicated:
,Unnamed: 0,target,text
0,0,0,awww thats bummer shoulda got david carr third day
1,1,0,upset cant update facebook texting might cry result school today also blah
2,2,0,dived many times ball managed save 50 rest go bounds
3,3,0,whole body feels itchy like fire
4,4,0,behaving im mad cant see
5,5,0,whole crew
why it happens and how to avoid duplicating it? it only happens with the first column with id
Create a copy and specify which column is your index when reading the CSV file:
# ...
df_neg = df[data_neg].copy()
df_neg.to_csv("negative.csv")
# For reading it
df_neg = pd.read_csv("negative.csv", index_col=0)
I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}
So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...
Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")
My code is working, which is good lol, but the output needs to be different in how it is viewed.
UPDATED CODE SINCE RECIEVING ANSWER
import pandas as pd
# Import File
YMM = pd.read_excel('C:/Users/PCTR261010/Desktop/OMIX_YMM_2016.xlsx').groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'})
print(YMM)
The output looks like Make | Model | StartYear | EndYear, with all the makes listed down column the Make Column next to the Model Column. But the Makes are filtered like a Pivot table.
Here is a screen shot:
I need American Motors next to every American Motors Model, every Buick next to every Buick Model and so on.
Here is the link to sample data:
http://jmp.sh/KLZKWVZ
Try this:
res = YMM.groupby(['Make','Model'], as_index=False).agg({'StartYear':'min', 'EndYear':'max'})
or
res = YMM.groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'}).reset_index()
With your own code
Min = YMM.groupby(['Make','Model']).StartYear.min()
Max = YMM.groupby(['Make','Model']).EndYear.max()
Min['Endyear']=Max.EndYear