pickle data was truncated - python

i created a corpus file then stored in a pickle file.
my messages file is a collection of different news articles dataframe.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
review = re.sub('[^a-zA-Z]', ' ', messages['text'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
#print(i)
corpus.append(review)
import pickle
with open('corpus.pkl', 'wb') as f:
pickle.dump(corpus, f)
same code I ran on my laptop (jupyter notebook) and on google colab.
corpus.pkl => Google colab, downloaded with the following code:
from google.colab import files
files.download('corpus.pkl')
corpus1.pkl => saved from jupyter notebook code.
now When I run this code:
import pickle
with open('corpus.pkl', 'rb') as f: # google colab
corpus = pickle.load(f)
I get the following error:
UnpicklingError: pickle data was truncated
But this works fine:
import pickle
with open('corpus1.pkl', 'rb') as f: # jupyter notebook saved
corpus = pickle.load(f)
The only difference between both is that corpus1.pkl is run and saved through Jupyter notebook (on local) and corpus.pkl is saved on google collab and downloaded.
Could anybody tell me why is this happening?
for reference..
corpus.pkl => 36 MB
corpus1.pkl => 50.5 MB

i would use pickle file created by my local machine only, that works properly

Problem occurs due to partial download of glove vectors. I have uploaded the data through colab upload to session storage and after that simply write this command:
with open('/content/glove_vectors', 'rb') as f:
model = pickle.load(f)
glove_words = set(model.keys())

Related

Why am I getting 'BadZipFile' error when trying to load saved model

import pickle
import streamlit as st
from streamlit_option_menu import option_menu
#loading models
breast_cancer_model = pickle.load(open('C:/Users/Jakub/Desktop/webap/breast_cancer_classification_nn_model.sav', 'rb')) #here is the error #BadZipFile
wine_quality_model = pickle.load(open('wine_nn_model.sav', 'rb')) #BadZipFile
Since it's not a zip file i tried zipping it, moving it to a different location, nothing I could think of worked

NER using spaCy & Transformers - different result when running inside and outside of a loop

I am using NER (spacy & Transformer) for finding and anonymizing personal information. I noticed that the output I get when giving an input line directly is different than when the input line is read from a file (see screenshot below). Does anyone have suggestions on how to fix this?
Here is my code:
import pandas as pd
import csv
import spacy
from spacy import displacy
from transformers import pipeline
import re
!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')
sent = nlp('Yesterday I went out with Andrew, johanna and Jonathan Sparow.')
displacy.render(sent, style = 'ent')
with open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
sent = nlp(str(row))
displacy.render(sent, style = 'ent')
You are using the csv module to read your file and then trying to convert each row (aka line) of the file to a string with str(row).
If your file just has one sentence per line, then you do not need the csv module at all. You could just do
with open('Synth_dataset_raw.txt', 'r') as fd:
for line in fd:
# Remove the trailing newline
line = line.rstrip()
sent = nlp(line)
displacy.render(sent, style = 'ent')
If you in fact have a csv (with presumably multiple columns and a header) you could do
open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
header = next(reader)
text_column_index = 0
for row in reader:
sent = nlp(row[text_column_index])
displacy.render(sent, style = 'ent')

How to access/use Google's pre-trained Word2Vec model without manually downloading the model?

I want to analyse some text on a Google Compute server on Google Cloud Platform (GCP) using the Word2Vec model.
However, the un-compressed word2vec model from https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ is over 3.5GB and it will take time to download it manually and upload it to a cloud instance.
Is there any way to access this (or any other) pre-trained Word2Vec model on a Google Compute server without uploading it myself?
You can also use Gensim to download them through the downloader api:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
or from the command line:
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
for a list of available datasets check: https://github.com/RaRe-Technologies/gensim-data
Alternative to manually downloading stuff, you can use the pre-packaged version (third-party not from Google) on Kaggle dataset.
First sign up for Kaggle and get the credentials https://github.com/Kaggle/kaggle-api#api-credentials
Then, do this on the command line:
pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip
Finally, in Python:
import pickle
import numpy as np
import os
home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]
Full example on Colab: https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl
P/S: To access other pre-packed word embeddings, see https://github.com/alvations/vegetables
The following code will do the job on Colab (or any other Jupyter notebook) in about 10 sec:
result = !wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p'
code = result[-1]
arg =' --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" -O GoogleNews-vectors-negative300.bin.gz' % code
!wget $arg
If you need it inside python script, replace wget requests with requests library:
import requests
import re
import shutil
url1 = 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
resp = requests.get(url1)
code = re.findall('.*confirm=([0-9A-Za-z_]+).*', str(resp.content))
url2 = "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" % code[0]
with requests.get(url2, stream=True, cookies=resp.cookies) as r:
with open('GoogleNews-vectors-negative300.bin.gz', 'wb') as f:
shutil.copyfileobj(r.raw, f)

In Google Colab, I created a file with Pickle Library but I can't find it in Google Drive

#Saving the best model with Pickle (Neural %83.43)
import pickle
pickle.dump(classifier, open("NeuralNews", 'wb'))
loading = pickle.load(open("NeuralNews", 'rb'))
predictionPickleNeural = loading.predict(testResult2)
predictionPickleNeural = (predictionPickleNeural > 0.5)
acScorePickleNeural =
accuracy_score(lb.fit_transform(testDataForComparison),
predictionPickleNeural)
print("Accuracy Pickle Neural : " + str(acScorePickleNeural))
I can't find the 'Neural News' file that I created on Google Drive.
Is there a way to find out where it is?
Its inside the current directory of Google Cloud VM. You can try:
import os
os.listdir('.')
If you get some output like,
['.config', 'sample_data']
then you can even get listing by issuing the command like below,
!ls sample_data
to look inside the sample_data folder. Anyway, you can upload/save it to your Google drive or download it to your local machine also.

Import GoogleNews-vectors-negative300.bin

I am working on code using the gensim and having a tough time troubleshooting a ValueError within my code. I finally was able to zip GoogleNews-vectors-negative300.bin.gz file so I could implement it in my model. I also tried gzip which the results were unsuccessful. The error in the code occurs in the last line. I would like to know what can be done to fix the error. Is there any workarounds? Finally, is there a website that I could reference?
Thank you respectfully for your assistance!
import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode
pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname,
fvocab, binary, encoding, unicode_errors, limit, datatype)
244 word.append(ch)
245 word = utils.to_unicode(b''.join(word),
encoding=encoding, errors=unicode_errors)
--> 246 weights = fromstring(fin.read(binary_len),
dtype=REAL)
247 add_word(word, weights)
248 else:
ValueError: string size must be a multiple of element size
Edit: The S3 url has stopped working. You can download the data from Kaggle or use this Google Drive link (be careful downloading files from Google Drive).
The below commands no longer work work.
brew install wget
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This downloads the GZIP compressed file that you can uncompress using:
gzip -d GoogleNews-vectors-negative300.bin.gz
You can then use the below command to get wordVector.
from gensim import models
w = models.KeyedVectors.load_word2vec_format(
'../GoogleNews-vectors-negative300.bin', binary=True)
you have to write the complete path.
use this path:
https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
try this -
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']
also, visit this link : https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
Here is what worked for me. I loaded a part of the model and not the entire model as it's huge.
!pip install wget
import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)
f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)
You can use this URL that points to Google Drive's download of the bin.gz file:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
Alternative mirrors (including the S3 mentioned here) seem to be broken.

Categories

Resources