Generate 2d images of molecules from PubChem FTP data

Generate 2d images of molecules from PubChem FTP data - python

Rather than crawl PubChem's website, I'd prefer to be nice and generate the images locally from the PubChem ftp site:
ftp://ftp.ncbi.nih.gov/pubchem/specifications/
The only problem is that I'm limited to OSX and Linux and I can't seem to find a way of programmatically generating the 2d images that they have on their site. See this example:
https://pubchem.ncbi.nlm.nih.gov/compound/6#section=Top
Under the heading "2D Structure" we have this image here:
https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=6&t=l
That is what I'm trying to generate.

If you want something working out of the box I would suggest using molconvert from ChemAxon's Marvin (https://www.chemaxon.com/products/marvin/), which is free for academics. It can be used easily from the command line and it supports plenty of input and output formats. So for your example it would be:
molconvert "png" -s "C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl" -o cdnb.png
Resulting in the following image:
It also allows you to set parameters such as width, height, quality, background color and so on.
However, if you are a programmer I would definitely recommend RDKit. Follows a code which generates images for a pair of compounds given as smiles.
from rdkit import Chem
from rdkit.Chem import Draw
ms_smis = [["C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl", "cdnb"],
["C1=CC(=CC(=C1)N)C(=O)N", "3aminobenzamide"]]
ms = [[Chem.MolFromSmiles(x[0]), x[1]] for x in ms_smis]
for m in ms: Draw.MolToFile(m[0], m[1] + ".svg", size=(800, 800))
This gives you following images:

So I also emailed the PubChem guys and they got back to me very quickly with this response:
The only bulk access we have to images is through the download
service: https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi
You can request up to 50,000 images at a time.
Which is better than I was expecting, but still not amazing since it requires downloading things that I in theory could generate locally. So I'm leaving this question open until some kind soul writes an open source library to do the same.
Edit:
I figure I might as well save people some time if they are doing the same thing as I am. I've created a Ruby Gem backed on Mechanize to automate the downloading of images. Please be kind to their servers and only download what you need.
https://github.com/zachaysan/pubchem
gem install pubchem

An open source option is the Indigo Toolkit, which also has pre-compiled packages for Linux, Windows, and MacOS and language bindings for Python, Java, .NET, and C libraries. I chose the 1.4.0 beta.
I had a similar interest to yours in converting SMILES to 2D structures and adapted my Python to address your question and to capture timing information. It uses the PubChem FTP (Compound/Extras) download of CID-SMILES.gz. The following script is an implementation of a local SMILES-to-2D-structure converter that reads a range of rows from the PubChem CID-SMILES file of isomeric SMILES (which contains over 102 million compound records) and converts the SMILES to PNG images of the 2D structures. In three tests with 1000 SMILES-to-structure conversions, it took 35, 50, and 60 seconds to convert 1000 SMILES at file row offsets of 0, 100,000, and 10,000,000 on my Windows 10 laptop (Intel i7-7500U CPU, 2.70GHz) with a solid state drive and running Python 3.7.4. The 3000 files totaled 100 MB in size.
from indigo import *
from indigo.renderer import *
import subprocess
import datetime
def timerstart():
# start timer and print time, return start time
start = datetime.datetime.now()
print("Start time =", start)
return start
def timerstop(start):
# end timer and print time and elapsed time, return elapsed time
endtime = datetime.datetime.now()
elapsed = endtime - start
print("End time =", endtime)
print("Elapsed time =", elapsed)
return elapsed
numrecs = 1000
recoffset = 0 # 10000000 # record offset
starttime = timerstart()
indigo = Indigo()
renderer = IndigoRenderer(indigo)
# set render options
indigo.setOption("render-atom-color-property", "color")
indigo.setOption("render-coloring", True)
indigo.setOption("render-comment-position", "bottom")
indigo.setOption("render-comment-offset", "20")
indigo.setOption("render-background-color", 1.0, 1.0, 1.0)
indigo.setOption("render-output-format", "png")
# set data path (including data file) and output file path
datapath = r'../Download/CID-SMILES'
pngpath = r'./2D/'
# read subset of rows from data file
mycmd = "head -" + str(recoffset+numrecs) + " " + datapath + " | tail -" + str(numrecs)
print(mycmd)
(out, err) = subprocess.Popen(mycmd, stdout=subprocess.PIPE, shell=True).communicate()
lines = str(out.decode("utf-8")).split("\n")
count = 0
for line in lines:
try:
cols = line.split("\t") # split on tab
key = cols[0] # cid in cols[0]
smiles = cols[1] # smiles in cols[1]
mol = indigo.loadMolecule(smiles)
s = "CID=" + key
indigo.setOption("render-comment", s)
#indigo.setOption("render-image-size", 200, 250)
#indigo.setOption("render-image-size", 400, 500)
renderer.renderToFile(mol, pngpath + key + ".png")
count += 1
except:
print("Error processing line after", str(count), ":", line)
pass
elapsedtime = timerstop(starttime)
print("Converted", str(count), "SMILES to PNG")

Related

How to get only 30 seconds of a song using python

I want when i play a song in python it only plays the first 30 seconds of a song only no matter the length of it. I want this to be done without creating a new mp3 file. I have pydub but what is actually doing is creating a new file with 30 seconds cut which is taking too long to load on my Django page.
startMin = 9
startSec = 50
endMin = 13
endSec = 30
# Time to miliseconds
startTime = startMin*60*1000+startSec*1000
endTime = endMin*60*1000+endSec*1000
# Opening file and extracting segment
song = AudioSegment.from_mp3( item_media.file.path )
extract = song[startTime:endTime]
extract.export( 'Zoo'+'-extract.mp3', format="mp3")

I want this to be done without creating a new mp3 file.
Unless you're willing for the client to have the full file, and then manipulate e.g. an <audio> element to only play a part of it, that's not easily possible. (Technically, for an MP3 file, you might be able to seek the MP3 frame for a given time offset, and write MP3 frames until you've found the end offset, and browsers should be able to play a file mangled like that, but it's probably not worth the technical effort.)
If the issue is only that it takes too long for the extraction, only do it once for a given file and cache the result:
import os
import tempfile
import hashlib
def get_extract_mp3(original_path, startMin, startSec, endMin, endSec):
# Time to miliseconds
startTime = int(startMin * 60 * 1000 + startSec * 1000)
endTime = int(endMin * 60 * 1000 + endSec * 1000)
original_path_hash = hashlib.sha256(original_path.encode("utf-8")).hexdigest()
cache_file = os.path.join(
tempfile.gettempdir(), f"extract-{original_path_hash}-{startTime}-{endTime}.mp3"
)
if not os.path.isfile(cache_file): # No cache file yet
# Opening file and extracting segment
song = AudioSegment.from_mp3(original_path)
extract = song[startTime:endTime]
# Write the extract to a temporary file...
tmp_cache_file = cache_file + ".tmp"
extract.export(tmp_cache_file, format="mp3")
# ... and replace it into place
os.replace(tmp_cache_file, cache_file)
return cache_file
...
def my_view(...):
extract_path = get_extract_mp3(item_media.file.path, 9, 50, 13, 50)
return FileResponse(open(extract_path, 'rb'))
This way only the first client to request an extract for a given file and a given time will have to wait for it to happen.
You may wish to put the cache files somewhere else than gettempdir(), maybe a subdirectory thereof, and will want to make sure to clean it every once in a while too (e.g. delete the oldest files).

MoviePy write_videofile taking hours

I am trying to concatenate every clip in a directory as well as add an intro and outro. In the future I will also be adding editing functions such as zooms and rotations, which is why I am not directly calling ffmpeg, but instead using MoviePy.
Everything runs smoothly, until final_vid.write_videofile(). It first renders the audio at a fairly good speed.
chunk: 55%|█████▍ | 8721/15977 [00:14<00:10, 672.54it/s, now=None]
When it then tries to render video, the speed slows down massively, to an expected rendering time of 72 hours. This is running on a ryzen 2600 with 16 gigs of ram, so I doubt the hardware is the bottleneck.
t: 0%| | 5/43473 [00:28<72:17:35, 5.99s/it, now=None]
I have tried this with different codecs, fps settings, logger off and multiple other settings. How would I go about speeding this up, since this cannot reasonably be the maximum speed of MoviePy?
Full code below:
def edit(game):
intro = VideoFileClip("intro.mp4")
final_vid = intro
game = game.replace(" ", "")
game_treated = game.replace(" ", "%20")
for clip_name in os.listdir("current_vids"):
new_clip = VideoFileClip(os.path.join("current_vids", clip_name), target_resolution=(1920, 1080))
final_vid = concatenate_videoclips(clips=[final_vid, new_clip], method="compose")
outro = VideoFileClip("outro.mp4")
final_vid = concatenate_videoclips(clips=(final_vid, outro), method="compose")
final_vid.write_videofile(game + datetime.today().strftime("%Y-%m-%d") + ".mp4")
for clip_name in os.listdir("current_vids"):
os.remove(os.path.join("current_vids", clip_name))
return game_treated + datetime.today().strftime("%Y-%m-%d") + ".mp4"

Split audio on timestamps librosa

I have an audio file and I want to split it every 2 seconds. Is there a way to do this with librosa?
So if I had a 60 seconds file, I would split it into 30 two second files.

librosa is first and foremost a library for audio analysis, not audio synthesis or processing. The support for writing simple audio files is given (see here), but it is also stated there:
This function is deprecated in librosa 0.7.0. It will be removed in 0.8. Usage of write_wav should be replaced by soundfile.write.
Given this information, I'd rather use a tool like sox to split audio files.
From "Split mp3 file to TIME sec each using SoX":
You can run SoX like this:
sox file_in.mp3 file_out.mp3 trim 0 2 : newfile : restart
It will create a series of files with a 2-second chunk of the audio each.
If you'd rather stay within Python, you might want to use pysox for the job.

You can split your file using librosa running the following code. I have added comments necessary so that you understand the steps carried out.
# First load the file
audio, sr = librosa.load(file_name)
# Get number of samples for 2 seconds; replace 2 by any number
buffer = 2 * sr
samples_total = len(audio)
samples_wrote = 0
counter = 1
while samples_wrote < samples_total:
#check if the buffer is not exceeding total samples
if buffer > (samples_total - samples_wrote):
buffer = samples_total - samples_wrote
block = audio[samples_wrote : (samples_wrote + buffer)]
out_filename = "split_" + str(counter) + "_" + file_name
# Write 2 second segment
librosa.output.write_wav(out_filename, block, sr)
counter += 1
samples_wrote += buffer
[Update]
librosa.output.write_wav() has been removed from librosa, so now we have to use soundfile.write()
Import required library
import soundfile as sf
replace
librosa.output.write_wav(out_filename, block, sr)
with
sf.write(out_filename, block, sr)

MODIS AQUA Data - Stacking / Mosaic data with python GDAL

I know how to access and plot subdatasets using gdal and python. However, I'm wondering if there's a way to use the GEO data contained in the HDF4 file so I could look at the same area over many years.
And if possible, can an area be cut out of the data and how?
UPDATE:
To be more specific: I plotted MODIS Data and as you can see below the river moves downwards (rectangular structure top left corner). So over a whole year it's not the same location that i'm observing.
There's a directory in the subdatasets called Geolocation Fields with Long and Alt directories. So is it possible to access this information or lay it over the data to cut out a specific area?
If we for example take a look at the NASA picture below would it be possible to cut it between 10-15 alt. and -5 to 0 long.
You can download a sample file by copying the url below:
https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MYD021KM/2009/034/MYD021KM.A2009034.1345.006.2012058160107.hdf
UPDATE:
I ran
x0, dx, dxdy, y0, dydx, dy = hdf_file.GetGeoTransform()
which gave me the following output:
x0: 0.0
dx: 1.0
dxdy: 0.0
y0: 0.0
dydx: 0.0
dy: 1.0
As well as
gdal.Warp(workdir2+"/output.tif",workdir1+"/MYD021KM.A2009002.1345.006.2012058153105.hdf")
which gave me the following error:
ERROR 1: Input file /Volumes/Transcend/Master_Thesis/Data/AQUA_002_1345/MYD021KM.A2009002.1345.006.2012058153105.hdf has no raster bands.
**UPDATE 2: **
Here's my code on how I open and read my hdf files:
all_files is a list containing file names like:
MYD021KM.A2008002.1345.006.2012066153213.hdf
MYD021KM.A2008018.1345.006.2012066183305.hdf
MYD021KM.A2008034.1345.006.2012067035823.hdf
MYD021KM.A2008050.1345.006.2012067084421.hdf
etc .....
for fe in all_files:
print "\nopening file: ", fe
try:
hdf_file = gdal.Open(workdir1 + "/" + fe)
print "getting subdatasets..."
subDatasets = hdf_file.GetSubDatasets()
Emissiv_Bands = gdal.Open(subDatasets[2][0])
print "getting bands..."
Bands = Emissiv_Bands.ReadAsArray()
print "unit conversion ... "
get_name_tag = re.findall(".A(\d{7}).", all_files[i])[0]
print "name tag of current file: ", get_name_tag
# Code for 1 Band:
L_B_1 = radiance_scales[specific_band] * (Bands[specific_band] - radiance_offsets[specific_band]) # Source: MODIS Level 1B Product User's Guide Page 36 MOD_PR02 V6.1.12 (TERRA)/V6.1.15 (AQUA)
data_1_band['%s' % get_name_tag] = L_B_1
L_B_1_mean['%s' % get_name_tag] = L_B_1.mean()
# Code for many different Bands:
data_all_bands["%s" % get_name_tag] = []
for k in Band_nrs[lowest_band:highest_band]: # Bands 8-11
L_B = radiance_scales[k] * (Bands[k] - radiance_offsets[k]) # List with all bands
print "Appending mean value of {} for band {} out of {}".format(L_B.mean(), Band_nrs[k], len(Band_nrs))
data_all_bands['%s' % get_name_tag].append(L_B.mean()) # Mean radiance values
i=i+1
print "data added. Adding i+1 = ", i
except AttributeError:
print "\n*******************************"
print "Can't open file {}".format(workdir1 + "/" + fe)
print "Skipping this file..."
print "*******************************"
broken_files.append(workdir1 + "/" + fe)
i=i+1

Without knowing your exact data source and desired output etc. it is hard to give you a specific answer. With that said, it appears that you have the native .hdf format of MODIS images and wish to do some subsetting to get the images referenced to the same area, then plot etc.
It might help for you to look at gdal.Warp() from the gdal module. This method is able to take a .hdf file and subset a series of images to the same bounding box with the same resolution/number of rows and columns.
You can then analyse and plot these images/compare pixels etc.
I hope that this gives you a good starting point to get started.
gdal.Warp docs: https://gdal.org/python/osgeo.gdal-module.html#Warp
More general warp help: https://www.gdal.org/gdalwarp.html
Something like this:
import gdal
# Set up the gdal.Warp options such as desired spatial resolution,
# resampling algorithm to use and output format.
# See: https://gdal.org/python/osgeo.gdal-module.html#WarpOptions
# for other options that can be specified.
warp_options = gdal.WarpOptions(format="GTiff",
outputBounds=[min_x, min_y, max_x, max_y],
xRes=res,
yRes=res,
# PROBABLY NEED TO SET t_srs TOO
)
# Apply the warp.
# (output_file, input_file, options)
gdal.Warp("/path/to/output_file.tif",
"/path/to/input_file.hdf",
options=warp_options)
Exact code to write:
# Apply the warp.
# (output_file, input_file, options)
gdal.Warp('/path/to/output_file.tif',
'/path/to/HDF4_EOS:EOS_SWATH:"MYD021KM.A2009034.1345.006.2012058160107.hdf":MODIS_SWATH_Type_L1B:EV_1KM_RefSB',
options=warp_options)

Split speech audio file on words in python

I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?

An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as setting up silence threahold , setting up silence length. etc and simplifies code significantly as opposed to other methods mentioned.
Here is an demo implementation , inspiration from here
Setup:
I had a audio file with spoken english letters from A to Z in the file "a-z.wav". A sub-directory splitAudio was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.
Observations:
Some of the syllables were cut off, possibly needing modification of following parameters,
min_silence_len=500
silence_thresh=-16
One may want to tune these to one's own requirement.
Demo Code:
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file,
# must be silent for at least half a second
min_silence_len=500,
# consider it silent if quieter than -16 dBFS
silence_thresh=-16
)
for i, chunk in enumerate(audio_chunks):
out_file = ".//splitAudio//chunk{0}.wav".format(i)
print "exporting", out_file
chunk.export(out_file, format="wav")
Output:
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>>

You could look at Audiolab It provides a decent API to convert the voice samples into numpy arrays.
The Audiolab module uses the libsndfile C++ library to do the heavy lifting.
You can then parse the arrays to find the lower values to find the pauses.

Use IBM STT. Using timestamps=true you will get the word break up along with when the system detects them to have been spoken.
There are a lot of other cool features like word_alternatives_threshold to get other possibilities of words and word_confidence to get the confidence with which the system predicts the word. Set word_alternatives_threshold to between (0.1 and 0.01) to get a real idea.
This needs sign on, following which you can use the username and password generated.
The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.
An extracted and modified form looks like:
def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
word_confidence=False, word_alternatives_threshold=0.1):
assert isinstance(username, str), "``username`` must be a string"
assert isinstance(password, str), "``password`` must be a string"
flac_data = audio_data.get_flac_data(
convert_rate=None if audio_data.sample_rate >= 16000 else 16000, # audio samples should be at least 16 kHz
convert_width=None if audio_data.sample_width >= 2 else 2 # audio samples should be at least 16-bit
)
url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
"profanity_filter": "false",
"continuous": "true",
"model": "{}_BroadbandModel".format(language),
"timestamps": "{}".format(str(timestamps).lower()),
"word_confidence": "{}".format(str(word_confidence).lower()),
"word_alternatives_threshold": "{}".format(word_alternatives_threshold)
}))
request = Request(url, data=flac_data, headers={
"Content-Type": "audio/x-flac",
"X-Watson-Learning-Opt-Out": "true", # prevent requests from being logged, for improved privacy
})
authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
request.add_header("Authorization", "Basic {}".format(authorization_value))
try:
response = urlopen(request, timeout=None)
except HTTPError as e:
raise sr.RequestError("recognition request failed: {}".format(e.reason))
except URLError as e:
raise sr.RequestError("recognition connection failed: {}".format(e.reason))
response_text = response.read().decode("utf-8")
result = json.loads(response_text)
# return results
if show_all: return result
if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
raise Exception("Unknown Value Exception")
transcription = []
for utterance in result["results"]:
if "alternatives" not in utterance:
raise Exception("Unknown Value Exception. No Alternatives returned")
for hypothesis in utterance["alternatives"]:
if "transcript" in hypothesis:
transcription.append(hypothesis["transcript"])
return "\n".join(transcription)

pyAudioAnalysis can segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:
python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3
More details on my blog.

My variant of function, which probably will be easier to modify for your needs:
from scipy.io.wavfile import write as write_wav
import numpy as np
import librosa
def zero_runs(a):
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
def split_in_parts(audio_path, out_dir):
# Some constants
min_length_for_silence = 0.01 # seconds
percentage_for_silence = 0.01 # eps value for silence
required_length_of_chunk_in_seconds = 60 # Chunk will be around this value not exact
sample_rate = 16000 # Set to None to use default
# Load audio
waveform, sampling_rate = librosa.load(audio_path, sr=sample_rate)
# Create mask of silence
eps = waveform.max() * percentage_for_silence
silence_mask = (np.abs(waveform) < eps).astype(np.uint8)
# Find where silence start and end
runs = zero_runs(silence_mask)
lengths = runs[:, 1] - runs[:, 0]
# Left only large silence ranges
min_length_for_silence = min_length_for_silence * sampling_rate
large_runs = runs[lengths > min_length_for_silence]
lengths = lengths[lengths > min_length_for_silence]
# Mark only center of silence
silence_mask[...] = 0
for start, end in large_runs:
center = (start + end) // 2
silence_mask[center] = 1
min_required_length = required_length_of_chunk_in_seconds * sampling_rate
chunks = []
prev_pos = 0
for i in range(min_required_length, len(waveform), min_required_length):
start = i
end = i + min_required_length
next_pos = start + silence_mask[start:end].argmax()
part = waveform[prev_pos:next_pos].copy()
prev_pos = next_pos
if len(part) > 0:
chunks.append(part)
# Add last part of waveform
part = waveform[prev_pos:].copy()
chunks.append(part)
print('Total chunks: {}'.format(len(chunks)))
new_files = []
for i, chunk in enumerate(chunks):
out_file = out_dir + "chunk_{}.wav".format(i)
print("exporting", out_file)
write_wav(out_file, sampling_rate, chunk)
new_files.append(out_file)
return new_files

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.