Google colab: read txt files and convert them to pandas - python

I am using google colab and there is a file which called 'examples' and inside there are three txt files.
I am using the following code to read and convert them to pandas
dataset_filepaths = glob.glob('examples/*.txt')
for filepath in tqdm.tqdm(dataset_filepaths):
df = pd.read_csv(filepath)
If you print the dataset_filepaths you will see
['examples/kate_middleton.txt',
'examples/jane_doe.txt',
'examples/daniel_craig.txt']
which is correct. However, in the df there is only the first document. Could you please let me know how we can create a pandas in the following form
index text
-----------------
0 text0
1 text1
. .
. .
. .
Updated: #Steven Rumbalski using your code
dfs = [pd.read_csv(filepath) for filepath in tqdm.tqdm(dataset_filepaths)]
dfs
The output looks like this
[Empty DataFrame
Columns: [Kate Middleton is the wife of Prince William. She is a mother of 3 children; 2 boys and a girl. Kate is educated to university level and that is where she met her future husband. Kate dresses elegantly and is often seen carrying out charity work. However, she is a mum first and foremost and the interactions we see with her children are adorable. Kate’s sister, Pippa, has followed Kate into the public eye. She was born in 1982 and will soon turn 40. When pregnant, Kate suffers from a debilitating illness called Hyperemesis Gravidarum, which was little known about until it was reported that Kate had it.]
Index: [], Empty DataFrame
Columns: [Jane Doe was born in December 1978 and is currently living in London, United Kingdom.]
Index: [], Empty DataFrame
Columns: [He is an English film actor known for playing James Bond in the 007 series of films. Since 2005, he has been playing the character but he confirmed that No Time to Die would be his last James Bond film. He was born in Chester on 2nd of March in 1968. He moved to Liverpool when his parents divorced and lived there until he was sixteen years old. He auditioned and was accepted into the National Youth Theatre and moved down to London. He studied at Guildhall School of Music and Drama. He has appeared in many films.]
Index: []]
How can I convert it in the form that I want?

Related

How to count the occurrences of a value on a dataframe and plot it while considering the time

I have a dataframe that looks like this but much larger:
title of the novel author publishing year mentionned cities
0 Beasts and creatures Bruno Ivory 1850 New York
0 Monsters Renata Mcniar 1866 New York
0 At risk Charles Dobi 1870 New York
0 Manuela and Ricardo Lucas Zacci 1889 New York
0 War against the machine Angelina Trotter 1854 New York
My objective is to create a line chart that shows the decades in which the city (in this case, "New York") was mentioned in a novel. I have tried several things, as you can see in a previous post about the same problem. I thought I had solved it, but I did not.
(How to count the occurrences of a value in a data frame?)
Here is an image I made in Excel. This should exemplify the desired outcome.
Update:
Someone tried to help me, but deleted the answer. Fortunately, I had gotten it already. However, I did not work.
I think the code is worth mentioning:
counts = df[['publishing year', 'mentionned cities']].value_counts().reset_index(name='counts').sort_values('publishing year')
counts[counts.mentionned cities == 'New York'][['publishing year', 'counts']].set_index('publishing year').plot()
You can try groupby.count() then use Series.plot()
import matplotlib.pyplot as plt
axes = df.groupby('publishing year')['title of the novel'].count().plot()
axes.set_ylim(0, 5)
axes.set_xlabel('19th Century')
axes.set_ylabel('Number of novels')
axes.set_title('City of New York')
plt.show()

Skipping spaces in words of a given column while importing text file in pandas

I am trying to import a dataset from a text file, which looks like this.
id book author
1 Cricket World Cup: The Indian Challenge Ashis Ray
2 My Journey Dr. A.P.J. Abdul Kalam
3 Making of New India Dr. Bibek Debroy
4 Whispers of Time Dr. Krishna Saksena
When I used for importing:
df = pd.read_csv('book.txt', sep=' ')
it results into:
and when I use:
df = pd.read_csv('book.txt')
it results into:
Is there a way to get something like:
Any help on this will be appreciated. Thank you
Try with tab as a seperator:
df = pd.read_csv('book.txt', sep='\t')

Pandas: Remove all words from specific list within dataframe strings in large dataset

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?
Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')

pandas - list of dicts inside a dataframe, keeping their index

I have a dataframe with column values list of dictionaries that looks like this:
id comments
1 [{review:{review_id: 8987, review_text: 'wonderful'}, {review:{review_id: 8988, review_text: 'good'}]
2 [{review:{review_id: 9098, review_text: 'not good'}, {review:{review_id: 9895, review_text: 'terrible'}]
i figured out how to flatten that specific comments by doing:
pd.io.json.json_normalize(json.loads(df['comments'].iloc[0].replace("'", '"')))
It makes a new dataframe from the column value. which is good but what I actually need to happen is the id extends as well like so:
id review_id review_text
1 8987 wonderful
1 8988 good
2 9098 not good
2 9895 terrible
notice that the id extended along with the reviews. How do i Implement a solution to this?
as reference, here is a small sample of the dataset: https://aimedu-my.sharepoint.com/:x:/g/personal/matthewromero_msds2021_aim_edu/EfhdrrlYJy1KmGWhECf91goB7jpHuPFKyz8L3UTfyCSDiA?e=pYcap3
Based on the file that you provided and the way you say you wish the result you can try this code:
import pandas as pd
import ast
#import data
df = pd.read_excel('./restaurants_reviews_sample.xlsx', usecols = [1,2])
#change column to list of dictionaries
df.user_reviews = df.user_reviews.apply(lambda x: list(ast.literal_eval(x)))
#explode the reviews
df = df.explode('user_reviews')
#resetting index
df.reset_index(inplace = True, drop = True)
#unnesting the review dictionary
df.user_reviews = df.user_reviews.apply(lambda x: x['review'])
#creating new columns (only the ones we need)
df = df.assign(id='', review_text='')
#populate the columns from dictionary in user_reviews
cols = list(df.columns[2:4])
for i in list(range(0, len(df))):
for c in cols:
df[c][i] = df.user_reviews[i][c]
#cleaning columns
df.drop(columns = 'user_reviews' , inplace = True)
df.rename(columns = {'id':'review_id',
'index':'id'}, inplace = True)
The new dataframe looks like this:
id review_id review_text
0 6301456 46743270
1 6301456 41974132 A yuppies place, meaning for the young urban poor the place is packed with the young crowd of the early 20’s and mid 20’s to early 30’s that can still take a loud music pumping on the background with open space where you can check out the girls for a possible get to know and possible pick up. Quite affordable for the combo bucket with pulutan for the limited budget crowd but is there to look for a hook up.
2 6301456 38482279 I celebrated my birthday here and it was awesome! My team enjoyed the place, foods and drinks. *tip: if you will be in a group, consider getting the package with cocktail tower and beers plus the platter. It is worth your penny! Kudos to Dylan and JP for the wonderful service that they have provided us and for making sure that my celebration will be a really good one, and it was! Thank you guys! See you again soon! ðŸ˜ÂðŸ˜Â
3 6301456 35971612 Sa lahat nang Central na napuntahan ko, dito ko mas bet! Unang una sa lahat, masarap yung foods and yung pagka gawa ng drinks nila. Hindi pa masyado pala away yung mga customers dito. 😂
4 6301456 35714330 Good place to chill and hang out. Not to mention the comfort room is clean. The staff are quite busy to attend us immediately but they are polite and courteous. Would definitely comeback! Cheers! ðŸÂºðŸ˜Š
5 6301475 47379863 Underrated chocolate cake under 500 pesos! I definitely recommend this Cloud 9 cake!!! I’m not into chocolate but this one is good. This cake has a four layers, i loved the creamy white moose part. I ordered it via Grab Food and it was hassle free! 😀 The packaging was bad, its just white plastic container, Better handle it with care.
6 6301475 42413329 We loved the Cloud9 cake, its just right taste. We ordered it for our office celebration. However, we went back there to try other food. We get to try a chocolate cake that's too sweet, a cheese cake that's just right, and sansrival that's kind weird and i didnt expect that taste it's sweet and have a lot of nuts and.. i don't know i just didnt feel it. We also hand a lasagna, which is too saucey for is, it's like a soup of tomato. it's a bit disappointing, honestly. Other ordered from our next table looks good, and a lot of serving. They ordered rice meal, maybe you should try that .
7 6301475 42372938 Best cake i’ve eaten vs cakes from known brands such as Caramia and the like. Lots of white chocolate on top, not so sweet and similar to brazo de mercedes texture and, the merengue is the best!
8 6301475 41699036 This freaking piece of chicken costs 220Php. Chicken Cacciatore. Remember the name. DO NOT ORDER! This was my first time ordering something from your resto and I can tell you I AM NOT HAPPY!
9 6301475 40973213 Heard a lot about their famous chocolate cake. Bought a slice to try but found it quite sweet for my taste. Hope to try their other cakes though.

How do I find frequency of Authors and plot this using Python?

Here ABC news is observed 5 times but the column Times reflects it as 1 for each row. Expected output is ABC news one time in each row, but the total in Times as 5, since ABC has overall published 5 titles.
So that while plotting Author is on X-axis, and associated Times it has been published, is on Y-axis.
Code for the below dataframe that needs to be changed like mentioned above:
a=df1.groupby(['author','title'])['title'].count().reset_index(name="Time")
a.head()
author title Time
0 ABC News WATCH: How to get the most bang for your buck ... 1
1 ABC News WATCH: Man who confessed to killing wife, chil... 1
2 ABC News WATCH: Nearly 1,000 still missing 11 days afte... 1
3 ABC News WATCH: Teen hockey player skates after brain i... 1
4 ABC News WATCH: Trump: Will not do in-person interview ... 1
5 Ali Dukakis and Mike Levine Mueller 'has no eff... 1
Following shall keep updating your Times column with appropriate numbers. You may opt to declare the loop within a function to reuse later on.
import pandas as pd
df = pd.DataFrame( data=[['ABC News','WATCH: How to get the most bang for your buck...','1'], ['ABC News','WATCH: Man who confessed to killing wife, chil...','1'], ['ABC News','WATCH: Nearly 1,000 still missing 11 days afte...','1'], ['ABC News','WATCH: Teen hockey player skates after brain i...','1'], ['ABC News','WATCH: Trump: Will not do in-person interview ...','1'], ['Ali Dukakis and Mike Levine',"Mueller 'has no eff...",'1'] ], columns=['author','title','Times'])
word_count = dict(df['author'].value_counts())
for i,v in df["author"].iteritems():
if v in word_count.keys():
df.loc[i, "Times"] = word_count[v]
print(df)
This shall get your desired result like:
Plotting author against Times now shouldn't be an issue, I believe. Kindly accept the answer if it meets your requirement or else please let me know if this doesn't work for you.
The problem is that you are grouping on 'title' when you want to only group by 'author', it seems. Remove 'title' from groupby.

Categories

Resources