Import GoogleNews-vectors-negative300.bin - python

I am working on code using the gensim and having a tough time troubleshooting a ValueError within my code. I finally was able to zip GoogleNews-vectors-negative300.bin.gz file so I could implement it in my model. I also tried gzip which the results were unsuccessful. The error in the code occurs in the last line. I would like to know what can be done to fix the error. Is there any workarounds? Finally, is there a website that I could reference?
Thank you respectfully for your assistance!
import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode
pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname,
fvocab, binary, encoding, unicode_errors, limit, datatype)
244 word.append(ch)
245 word = utils.to_unicode(b''.join(word),
encoding=encoding, errors=unicode_errors)
--> 246 weights = fromstring(fin.read(binary_len),
dtype=REAL)
247 add_word(word, weights)
248 else:
ValueError: string size must be a multiple of element size

Edit: The S3 url has stopped working. You can download the data from Kaggle or use this Google Drive link (be careful downloading files from Google Drive).
The below commands no longer work work.
brew install wget
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This downloads the GZIP compressed file that you can uncompress using:
gzip -d GoogleNews-vectors-negative300.bin.gz
You can then use the below command to get wordVector.
from gensim import models
w = models.KeyedVectors.load_word2vec_format(
'../GoogleNews-vectors-negative300.bin', binary=True)

you have to write the complete path.
use this path:
https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

try this -
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']
also, visit this link : https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

Here is what worked for me. I loaded a part of the model and not the entire model as it's huge.
!pip install wget
import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)
f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)

You can use this URL that points to Google Drive's download of the bin.gz file:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
Alternative mirrors (including the S3 mentioned here) seem to be broken.

Related

Why am I getting 'BadZipFile' error when trying to load saved model

import pickle
import streamlit as st
from streamlit_option_menu import option_menu
#loading models
breast_cancer_model = pickle.load(open('C:/Users/Jakub/Desktop/webap/breast_cancer_classification_nn_model.sav', 'rb')) #here is the error #BadZipFile
wine_quality_model = pickle.load(open('wine_nn_model.sav', 'rb')) #BadZipFile
Since it's not a zip file i tried zipping it, moving it to a different location, nothing I could think of worked

Error loading .h5 model from Google Drive

I am confused. The file exists in the directory, I have checked it with 2 methods from Python. Why can't I load the model? Is there any other method to load the .h5 file? I think this screenshot will explain this all.
Code:
from keras.models import Sequential, load_model
import os.path
model_path = "./drive/MyDrive/1117002_Code Skripsi/Epoch-Train/300-0.0001-train-file.h5"
print(os.path.exists(model_path))
if os.path.isfile(model_path):
print ("File exist")
else:
print ("File not exist")
model = load_model(model_path)
File in the Drive folder:
In response to Experience_In_AI's answer, I made the file look like this:
and this is the structure:
The problem reproduced and solved:
import tensorflow as tf
from tensorflow import keras
from keras.models import load_model
try:
#model_path="drive/MyDrive/1117002_Code_Skripsi/Epoch-Train/300-0.001-train-file.h5"
model_path=r".\drive\MyDrive\1117002_Code_Skripsi\Epoch-Train\300-0.001-train-file.h5"
model=load_model(model_path)
except:
model_path=r".\drive\MyDrive\1117002_Code_Skripsi\Epoch-Train\experience_in_ai.h5"
model=load_model(model_path)
print("...it seems to be better to use more simple naming with the .h5 file!")
model.summary()
...note that the .h5 files in the simulated location are exact copies but having only different name.
I think this will work :
model = keras.models.load_model('path/to/location')

pickle data was truncated

i created a corpus file then stored in a pickle file.
my messages file is a collection of different news articles dataframe.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
review = re.sub('[^a-zA-Z]', ' ', messages['text'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
#print(i)
corpus.append(review)
import pickle
with open('corpus.pkl', 'wb') as f:
pickle.dump(corpus, f)
same code I ran on my laptop (jupyter notebook) and on google colab.
corpus.pkl => Google colab, downloaded with the following code:
from google.colab import files
files.download('corpus.pkl')
corpus1.pkl => saved from jupyter notebook code.
now When I run this code:
import pickle
with open('corpus.pkl', 'rb') as f: # google colab
corpus = pickle.load(f)
I get the following error:
UnpicklingError: pickle data was truncated
But this works fine:
import pickle
with open('corpus1.pkl', 'rb') as f: # jupyter notebook saved
corpus = pickle.load(f)
The only difference between both is that corpus1.pkl is run and saved through Jupyter notebook (on local) and corpus.pkl is saved on google collab and downloaded.
Could anybody tell me why is this happening?
for reference..
corpus.pkl => 36 MB
corpus1.pkl => 50.5 MB
i would use pickle file created by my local machine only, that works properly
Problem occurs due to partial download of glove vectors. I have uploaded the data through colab upload to session storage and after that simply write this command:
with open('/content/glove_vectors', 'rb') as f:
model = pickle.load(f)
glove_words = set(model.keys())

How to access/use Google's pre-trained Word2Vec model without manually downloading the model?

I want to analyse some text on a Google Compute server on Google Cloud Platform (GCP) using the Word2Vec model.
However, the un-compressed word2vec model from https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ is over 3.5GB and it will take time to download it manually and upload it to a cloud instance.
Is there any way to access this (or any other) pre-trained Word2Vec model on a Google Compute server without uploading it myself?
You can also use Gensim to download them through the downloader api:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
or from the command line:
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
for a list of available datasets check: https://github.com/RaRe-Technologies/gensim-data
Alternative to manually downloading stuff, you can use the pre-packaged version (third-party not from Google) on Kaggle dataset.
First sign up for Kaggle and get the credentials https://github.com/Kaggle/kaggle-api#api-credentials
Then, do this on the command line:
pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip
Finally, in Python:
import pickle
import numpy as np
import os
home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]
Full example on Colab: https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl
P/S: To access other pre-packed word embeddings, see https://github.com/alvations/vegetables
The following code will do the job on Colab (or any other Jupyter notebook) in about 10 sec:
result = !wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p'
code = result[-1]
arg =' --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" -O GoogleNews-vectors-negative300.bin.gz' % code
!wget $arg
If you need it inside python script, replace wget requests with requests library:
import requests
import re
import shutil
url1 = 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
resp = requests.get(url1)
code = re.findall('.*confirm=([0-9A-Za-z_]+).*', str(resp.content))
url2 = "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" % code[0]
with requests.get(url2, stream=True, cookies=resp.cookies) as r:
with open('GoogleNews-vectors-negative300.bin.gz', 'wb') as f:
shutil.copyfileobj(r.raw, f)

How to parse, edit and generate object_detection/pipeline.config files using Google Protobuf

I'm training multiple models in a common Ensemble learning paradigm,
currently I'm working with a few detectors and each time I train I
have to edit the config file of each detector, this obviously causes
confusion and a few times I started training with the wrong config
files.
As a solution I'm trying to build an editor to the Google Object Detection API
config files. The config file works with Google Protocol Buffer.
Link to the files I use: pipeline.proto, object_detection/protos, example .config file
I've tried the following code:
from object_detection.protos import input_reader_pb2
with open('/models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config', 'rb') as f:
config = f.read()
read = input_reader_pb2.InputReader().ParseFromString(config)
And I get the following error:
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-19-8043e6bb108f>", line 1, in <module>
input_reader_pb2.InputReader().ParseFromString(txt)
google.protobuf.message.DecodeError: Error parsing message
What am I missing here? what is the appropriate way to Parse and Edit the config file?
Thanks,
Hod
using the following code I was able to parse a config file.
import tensorflow as tf
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def get_configs_from_pipeline_file(pipeline_config_path, config_override=None):
'''
read .config and convert it to proto_buffer_object
'''
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(pipeline_config_path, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)
if config_override:
text_format.Merge(config_override, pipeline_config)
#print(pipeline_config)
return pipeline_config
def create_configs_from_pipeline_proto(pipeline_config):
'''
Returns the configurations as dictionary
'''
configs = {}
configs["model"] = pipeline_config.model
configs["train_config"] = pipeline_config.train_config
configs["train_input_config"] = pipeline_config.train_input_reader
configs["eval_config"] = pipeline_config.eval_config
configs["eval_input_configs"] = pipeline_config.eval_input_reader
# Keeps eval_input_config only for backwards compatibility. All clients should
# read eval_input_configs instead.
if configs["eval_input_configs"]:
configs["eval_input_config"] = configs["eval_input_configs"][0]
if pipeline_config.HasField("graph_rewriter"):
configs["graph_rewriter_config"] = pipeline_config.graph_rewriter
return configs
configs = get_configs_from_pipeline_file('faster_rcnn_resnet101_pets.config')
config_as_dict = create_configs_from_pipeline_proto(configs)
referred from here
Since you have the object_detection API installed, you can just do the following:
from object_detection.utils import config_util
pipeline_config = config_util.get_configs_from_pipeline_file('/path/to/config/file')
This is what I've found to be a useful approach to overriding the object detection pipeline.config:
from object_detection.utils import config_util
from object_detection import model_lib_v2
PIPELINE_CONFIG_PATH = 'path_to_your_pipeline.config'
# Load the pipeline config as a dictionary
pipeline_config_dict = config_util.get_configs_from_pipeline_file(PIPELINE_CONFIG_PATH)
# OVERRIDE EXAMPLES
# Example 1: Override the train tfrecord path
pipeline_config_dict['train_input_config'].tf_record_input_reader.input_path[0] = 'your/override/path/to/train.record'
# Example 2: Override the eval tfrecord path
pipeline_config_dict['eval_input_config'].tf_record_input_reader.input_path[0] = 'your/override/path/to/test.record'
# Convert the pipeline dict back to a protobuf object
pipeline_config = config_util.create_pipeline_proto_from_configs(pipeline_config_dict)
# EXAMPLE USAGE:
# Example 1: Run the object detection train loop with your overrides (has to be string representation)
model_lib_v2.train_loop(config_override=str(pipeline_config)
# Example 2: Save the pipeline config to disk
config_util.save_pipeline_config(config, 'path/to/save/new/pipeline.config)

Categories

Resources