Python speech recognition error converting mp3 file - python

My first try on audio to text.
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("/path/to/.mp3") as source:
audio = r.record(source)
When I execute the above code, the following error occurs,
<ipython-input-10-72e982ecb706> in <module>()
----> 1 with sr.AudioFile("/home/yogaraj/Documents/Python workouts/Python audio to text/show_me_the_meaning.mp3") as source:
2 audio = sr.record(source)
3
/usr/lib/python2.7/site-packages/speech_recognition/__init__.pyc in __enter__(self)
197 aiff_file = io.BytesIO(aiff_data)
198 try:
--> 199 self.audio_reader = aifc.open(aiff_file, "rb")
200 except aifc.Error:
201 assert False, "Audio file could not be read as WAV, AIFF, or FLAC; check if file is corrupted"
/usr/lib64/python2.7/aifc.pyc in open(f, mode)
950 mode = 'rb'
951 if mode in ('r', 'rb'):
--> 952 return Aifc_read(f)
953 elif mode in ('w', 'wb'):
954 return Aifc_write(f)
/usr/lib64/python2.7/aifc.pyc in __init__(self, f)
345 f = __builtin__.open(f, 'rb')
346 # else, assume it is an open file object already
--> 347 self.initfp(f)
348
349 #
/usr/lib64/python2.7/aifc.pyc in initfp(self, file)
296 self._soundpos = 0
297 self._file = file
--> 298 chunk = Chunk(file)
299 if chunk.getname() != 'FORM':
300 raise Error, 'file does not start with FORM id'
/usr/lib64/python2.7/chunk.py in __init__(self, file, align, bigendian, inclheader)
61 self.chunkname = file.read(4)
62 if len(self.chunkname) < 4:
---> 63 raise EOFError
64 try:
65 self.chunksize = struct.unpack(strflag+'L', file.read(4))[0]
I don't know what I'm going wrong. Can someone say me what I'm wrong in the above code?

Speech recognition supports WAV file format.
Here is a sample WAV to text program using speech_recognition:
Sample code (Python 3)
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("woman1_wb.wav") as source:
audio = r.record(source)
try:
s = r.recognize_google(audio)
print("Text: "+s)
except Exception as e:
print("Exception: "+str(e))
Output:
Text: to administer medicine to animals is frequency of very difficult matter and yet sometimes it's necessary to do so
Used WAV File URL: http://www-mobile.ecs.soton.ac.uk/hth97r/links/Database/woman1_wb.wav

This is what was wrong:
Speech recognition only supports WAV file format.
But this is a more complete answer on how to get MP3-to-text:
This is a processing function that uses speech_recognition and pydub to convert MP3 into WAV then to TEXT using Google's Speech API. It chunks the MP3 file into 60s portions to fit inside google's limits and will allow you to run about 50 minutes of audio in a day. But it will block you after 50 API calls.
from pydub import AudioSegment # uses FFMPEG
import speech_recognition as sr
from pathlib import Path
#from pydub.silence import split_on_silence
#import io
#from pocketsphinx import AudioFile, Pocketsphinx
def process(filepath, chunksize=60000):
#0: load mp3
sound = AudioSegment.from_mp3(filepath)
#1: split file into 60s chunks
def divide_chunks(sound, chunksize):
# looping till length l
for i in range(0, len(sound), chunksize):
yield sound[i:i + chunksize]
chunks = list(divide_chunks(sound, chunksize))
print(f"{len(chunks)} chunks of {chunksize/1000}s each")
r = sr.Recognizer()
#2: per chunk, save to wav, then read and run through recognize_google()
string_index = {}
for index,chunk in enumerate(chunks):
#TODO io.BytesIO()
chunk.export('/Users/mmaxmeister/Downloads/test.wav', format='wav')
with sr.AudioFile('/Users/mmaxmeister/Downloads/test.wav') as source:
audio = r.record(source)
#s = r.recognize_google(audio, language="en-US") #, key=API_KEY) --- my key results in broken pipe
s = r.recognize_google(audio, language="en-US")
print(s)
string_index[index] = s
break
return string_index
text = process('/Users/mmaxmeister/Downloads/UUCM.mp3')
My test MP3 file was a sermon from archive.org:
https://ia801008.us.archive.org/24/items/UUCMService20190602IfWeBuildIt/UUCM%20Service%202019-06-02%20-%20If%20We%20Build%20It.mp3
And this is the text returned (each line is 60s of audio):
13 chunks of 60.0s each
please join me in a spirit of prayer Spirit of Life known in many ways by a million names gracious Spirit of Life unfolding never known in its fullness be with us hear our cries for deliverance dance with us in exultation hold us when we fall keep before us the reality that every day is a gift to be unwrapped a gift to help discover why we live why we are cast Here and Now
Austin teaches us that the days come and go like muffled veiled figures Sent From A Distant friendly party but they say nothing and if we do not use the gifts they bring us they will carry them away as silently as they came through buying source of all Bend us towards gratitude and compassion Modern Life demands much misery and woe get created all around us but there is more much more show us that much more belongs to us to light Dawns on those who live love and sing the truth Joy John's on those who humbly toiled
do what is just so you who can shine pass on your light when it Dawns on you and let us all find the space to see Life as a gift to see our days as blessings and let us return life's gift and promise with grateful hearts and acts of kindness in the name of all the each of us teams holiest within our hearts we pray Amon
my character at least when I was younger I'm sure I don't really do this anymore the most challenging aspect of my character is that I want wisdom yesterday I don't want to have to learn something now I should have known it already right I used to drive my poor parents crazy as they tried to help me with my homework my father with my math my mother with my spelling if I didn't know the answer as soon as the problem was in front of me I would get angry frustrated with myself how come I didn't already know that I'm supposed to be wise only child I wonder if that has anything to do around the room has been throughout my life
but I still see it manifest in one particular aspects of my being I want us all to know how to love one another perfectly with wisdom already we should have learned that yesterday I want the Beloved Community right now and it frustrates me to no end that it isn't here what was that song that we saying after the prayer response how could anyone ever tell you you were anything less than beautiful how do we do that how do we tell ourselves that and others that we are not all the time how do we do that how come we haven't figured that out yesterday there's been a great Salve and corrective
to this challenge of my personality that I found in Community First the bomb the South I find the in community in this started when I was a youth in my youth group when we were ten people sitting on a floor together on pillows telling one another about what we've been through that week and how much pain we were carrying and how much we needed one another I found in that youth group with just 10 of us are sitting on the floor that we could be the Beloved Community for one another if only just for one hour a week sometimes just for 5 minutes sometimes just for a moment that was the Sal that was the bomb I realize that maybe we can't do it all the time but we can do it in moments and in spaces and that only happen for me in the space
community that community that we created with one another and the corrective to my need to have things everything done yesterday that also happens in community because Community is the slowest place on Earth We're going to have our annual meeting later let's see how slow that's going to be but the truth of the matter is that even in that slowness when you're working really hard to set up or cleanup connection Cafe when you're trying to figure out how to set up membership so that we actually do talk to everybody who comes through the doors right when you're doing that work of the Care team and that big list of all the different people that we need to reach out to and and we have to figure out how we reached out to all of them and who's done it
when you're waiting for the sermon to be over in all of these waiting times and all of these phases of process what I've learned in that slowness something amazing something remarkable is happening we are dedicating ourselves over and over again to still being together cuz it's not always easy because we're all broken and we're all whole because sometimes this is incredibly difficult and painful but when we're in those times what we're doing is we're saying it's worth it something about this matters to me and to all of us and Becca's got a great story to illustrate the this comes from
used to have a radio show in Boston maybe you heard him at some point Unitarian Universalist Minister from Boston 66 driving lessons and five road test she was 70 years old and had never driven a car but on July 25th 1975 she went to the Rockland County driving school and took her first lesson her husband had already had heart trouble and might someday be unable to drive if that happened she wanted to be the one to do the shopping and Shake him to the doctor she began the slow and painful process of learning to start stop turn into traffic back up after 5 difficult month she took the driving test
before ever she wrote in her diary and was not nervous she just a test a month later and slumped again I did everything wrong she told her diary demonic in August of 1976 she resumed the lessons with the Eaton driving school and took her third road test in October with half the world praying for me
she took a double driving lesson the next day and parallel park 6 times after three more lessons she took her Fifth and final test on January 21st 1977 and passed she had spent $860 on 66 plus 5 road test and at the age of 71 she had her license good three years later he did for several months she was the one who drove to the hospital Supermarket Pharmacy in church when we were children
someone rafter do another instructed by the spider's persistence Robert brute Robert Bruce left the Hut gathered his men and defeated the dance my mother and body of the story but it was not just persistence that moved her but love for the man who was her other self do you want to know what love is it's 66 driving lessons and five road tests and a very tough lady
who won't give up because her love is that great thank you all for bringing this to this beloved in moments community
That's pretty good for FREE. Unfortunately the google-cloud API version is prohibitively expensive if I wanted to transcribe hours of content.

Related

Gtts stutters after 2 sentences

I'm trying to make two gtts voices, Sarah and Mary, talk to each other reading a standard script. After 2 sentences of using the same voice, they start repeating the last word until the duration of the sentence is over.
from moviepy.editor import *
import moviepy.editor as mp
from gtts import gTTS
dialog = "Mary: Mom, can we get a dog?
\nSarah: What kind of dog do you want?
\nMary: I’m not sure. I’ve been researching different breeds and I think I like corgis.
\nSarah: Corgis? That’s a pretty popular breed. What do you like about them?
\nMary: Well, they’re small, so they won’t take up too much room. They’re also very loyal and friendly. Plus, they’re really cute!
\nSarah: That’s true. They do seem like a great breed. Have you done any research on their care and grooming needs?
\nMary: Yes, I have. They don’t need a lot of grooming, but they do need regular brushing and occasional baths. They’re also very active, so they need plenty of exercise.
\nSarah: That sounds like a lot of work. Are you sure you’re up for it?
\nMary: Yes, I am. I’m willing to put in the effort to take care of a corgi.
\nSarah: Alright, if you’re sure. Let’s look into getting a corgi then.
\nMary: Yay! Thank you, Mom!"
lines = dialog.split("\n")
combined = AudioFileClip("Z:\Programming Stuff\Music\Type_Beat__BPM105.wav").set_duration(10) #ADD INTRO MUSIC
for line in lines:
if "Sarah:" in line:
# Use a voice for Person 1
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en') #Accent Changer
tts.save("temp6.mp3")#temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
elif "Mary:" in line:
# Use a voice for Person 2
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en', tld = 'com.au') #Accent Changer
tts.save("temp6.mp3") #temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
combined.write_audiofile("output3.mp3") #Final File Nmae
OUTPUT:
It's an audio file that outputs almost exactly the intended output, except after "They're also very loyal and friendly." it keeps repeating "plus". It also repeats at "Yes, I have. They don't need a lot of grooming, but they do need regular brushing and occasional baths." It repeats "baths" many times.
It appears it just repeats after saying 2 sentences and I have no idea why.

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

I'm a newb in Python and data-mining. Have issues regarding tokenizer & data type issues

Hi~ I am having a problem while I am trying to tokenize facebook comments which are in CSV format. I have my CSV data ready, and I completed reading the file.
I am using Anaconda3; Python 3.5. (My CSV data has about 20k in rows and 1 in cols)
The codes are,
import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader) #list(reader)
print (your_list)
What comes, as a result, is something like this:
[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
Sorry for your inconvenience in reading the result. My bad.
To continue, I added,
tokens = [word_tokenize(i) for i in your_list]
for i in tokens:
print (i)
print (tokens)
This is the part where I get the following error:
C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object
What I want to do next is,
import nltk
en = nltk.Text(tokens)
print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')
And finally, I want to draw a wordcloud based on the data.
Being aware of the fact that every seconds of your time is precious, I am terribly sorry to ask for your help. I have been working this for a couple of days, and cannot find the right solution for my problem. Thank you for reading.
The error you're getting is because your CSV file is turned into a list of lists-- one for each row in the file. The file only contains one column, so each of these lists has one element: The string containing the message you want to tokenize. To get past the error, unpack the sublists by using this line instead:
tokens = [word_tokenize(row[0]) for row in your_list]
After that, you'll need to learn some more python and learn how to examine your program and your variables.

How to recognize a music sample using Python and Gracenote?

I recently discovered the GNSDK (Gracenote SDK) that seems to provide examples in several programming languages to recognize music samples by fingerprinting them, and then to request their audio database to get the corresponding artist and song title.
But the documentation is horrible.
How can I, using Python and the GNSDK, perform a recognition of an audio sample file? There isn't any examples or tutorials in the provided docs.
Edit: I really want to use the GNSDK with Python. Don't post anything unrelated, you'll waste your time.
I ended up using ACRCloud which works very well.
Python example:
from acrcloud.recognizer import ACRCloudRecognizer
config = {
'host': 'eu-west-1.api.acrcloud.com',
'access_key': 'access key',
'access_secret': 'secret key',
'debug': True,
'timeout': 10
}
acrcloud = ACRCloudRecognizer(config)
print(acrcloud.recognize_by_file('sample of a track.wav', 0))
https://github.com/acrcloud/acrcloud_sdk_python
Keywords are: Beat Spectrum Analysis and Rhythm Detection.
This is a well know Python library can contain a solution for your question:
https://github.com/aubio/aubio
Also I recommend that you should check this page for other libraries:
https://wiki.python.org/moin/PythonInMusic
Lastly this project more Python friendly solution and easy way to start:
https://github.com/librosa/librosa
an example from Librosa to calculate tempo(beats per minute) for the song:
# Beat tracking example
from __future__ import print_function
import librosa
# 1. Get the file path to the included audio example
filename = librosa.util.example_audio_file()
# 2. Load the audio as a waveform `y`
# Store the sampling rate as `sr`
y, sr = librosa.load(filename)
# 3. Run the default beat tracker
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
print('Estimated tempo: {:.2f} beats per minute'.format(tempo))
# 4. Convert the frame indices of beat events into timestamps
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
print('Saving output to beat_times.csv')
librosa.output.times_csv('beat_times.csv', beat_times)
But I have to mention that this field is a very immature field in computer science and every a new paper comes up for that. So it will be useful for you if you also follow scholars for recent discoveries.
ADDITION:
Web API Wrappers mentioned in Gracenote's official docs:
https://developer.gracenote.com/web-api#python
For Python:
https://github.com/cweichen/pygn
But as you can see this wrapper is not well documented and immature. Because of that I suggest you to use this Ruby wrapper instead of Python;
For Ruby:
https://github.com/JDiPierro/tmsapi
require 'tmsapi'
# Create Instace of the API
tms = TMSAPI::API.new :api_key => 'API_KEY_HERE'
# Get all movie showtimes for Austin Texas
movie_showings = tms.movies.theatres.showings({ :zip => "78701" })
# Print out the movie name, theatre name, and date/time of the showing.
movie_showings.each do |movie|
movie.showtimes.each do |showing|
puts "#{movie.title} is playing at '#{showing.theatre.name}' at #{showing.date_time}."
end
end
# 12 Years a Slave is playing at 'Violet Crown Cinema' at 2013-12-23T12:45.
# A Christmas Story is playing at 'Alamo Drafthouse at the Ritz' at 2013-12-23T16:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T11:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T13:40.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T16:20.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T19:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T21:40.
If you are not comfortable with Ruby or Ruby on Rails then the only option is developing your own Python wrapper.
Just reading your headline question and because there are no examples or tutorials for GNSDK, try looking at other options,
for one:
dejavu
Audio fingerprinting and recognition algorithm implemented in Python,
see the explanation here:
Dejavu can memorize audio by listening to it once and fingerprinting
it. Then by playing a song and recording microphone input, Dejavu
attempts to match the audio against the fingerprints held in the
database, returning the song being played.
https://github.com/worldveil/dejavu
seems about right.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

Categories

Resources