safe loading yaml file on stable diffusion deforum - python

So the main goal is:
Loading safetensors checkpoint file on stable diffusion
secondary goal = safe loading the .yaml model config
Here's the google colab I was working with:
https://colab.research.google.com/github/deforum-art/deforum-stable-diffusion/blob/main/Deforum_Stable_Diffusion.ipynb
github page = https://github.com/deforum/deforum-stable-diffusion
With regular stable diffusion I'm able to add safetensor checkpoint files to replace the ckpt files, however I think the issue I'm having with deforum is the .yaml file.. I heard that you're supposed to safe_load() these guys otherwise there's a possibility of it running arbitrary code, similar to the pickle issue in ckpt files.
I tried renaming the file to v1-inference.yaml.safe_load(), but that just looks wrong and totally didn't work
I tried to just not give a path to a custom .yaml file but then the error is "where is this custom file?"
I can find example of how to do this however, not any specific examples of how it works with deforum.
Maybe I'm just freaking out over nothing, here's the source code for the v1-inference.yaml file, does this look suspicious or seem like it would run background remote arbitrary code execution?
model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: "jpg"
cond_stage_key: "txt"
image_size: 64
channels: 4
cond_stage_trainable: false # Note: different from the one we trained before
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: False
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 10000 ]
cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32 # unused
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_heads: 8
use_spatial_transformer: True
transformer_depth: 1
context_dim: 768
use_checkpoint: True
legacy: False
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
Honestly, it seems harmless but I dunno maybe i'm missing something. Reading too much cyber security stuff makes you super paranoid about everything.
Thanks!

Related

Split PDF into Images by Line (OCR Model Training)

I have a large collection of PDFs containing scanned text that I'd like to OCR.
No commercial (Abby, PhantomPDF, Acrobat Pro), service (Google Vision API), or open-source (pre-trained models using tesseract, kraken) tool has been able to OCR the text in a sufficiently accurate manner.
I have some of the PDFs in their original form (with the text intact), meaning I have a reasonable amount of exact, ground-truth training data with enormous overlap in fonts, page structure, etc.
It seems that every method to train your own OCR model requires your training data be set up line-by-line, meaning I need to cut each line of hundreds of pages in the training-PDFs into separate images (then I can simply split the text in the training-PDFs by line to create the corresponding gt.txt files for tesseract or kraken).
I've used tools to split PDFs by page and convert/save each page to an image file, but I have not been able to find a way to automate doing the same thing line-by-line. But, R's {pdftools} makes it seem like getting the y-coordinates of each line is possible...
pdftools::pdf_data(pdf_path)[[3]][1:4, ]
#> width height x y space text
#> 1 39 17 245 44 TRUE Table
#> 2 13 17 288 44 TRUE of
#> 3 61 17 305 44 FALSE Contents
#> 4 41 11 72 74 FALSE Overview
... but it's unclear to me how that can be adjusted to match the resolution scaling of any PDF-to-image routine.
All that being said...
Is there a tool out there that already does this?
If not, in what direction should I head to build my own?
It seems Magick is fully capable of this (as soon as I grok how to navigate the pixels), but that doesn't solve the question of how to translate the y-coordinates from something like {pdftools} to the pixel locations in an image generated using a DPI argument (like every? PDF-to-image conversion tool).
Edit # 1:
It turns out the coordinates are based on the PDF "object" locations, which doesn't necessarily mean that text that is supposed to be on the same line (and visually is) is always reflected as such. Text that is meant to be on the same row may be off by several pixels.
The next best thing is cropping boxes around each of the objects. In R, this does the trick:
build_training_data <- function(pdf_paths, out_path = "training-data") {
out_path_mold <- "%s/%s-%d-%d.%s"
for (pdf_path in pdf_paths) {
prefix <- sub(".pdf", "", basename(pdf_path), fixed = TRUE)
pdf_data <- pdftools::pdf_data(pdf_path)
pdf_text <- pdftools::pdf_text(pdf_path)
pdf_heights <- pdftools::pdf_pagesize(pdf_path)$height
for (i_page in seq_along(pdf_data)) {
page_text <- pdf_text[[i_page]]
line_text <- strsplit(page_text, "\n")[[1L]]
page_image <- magick::image_read_pdf(pdf_path, pages = i_page)
image_stats <- magick::image_info(page_image)
scale_by <- image_stats$height / pdf_heights[[i_page]]
page_data <- pdf_data[[i_page]]
for (j_object in seq_len(nrow(page_data))) {
cat(sprintf("\r- year: %s, page: %d, object: %d ",
prefix, i_page, j_object))
image_path <- sprintf(out_path_mold, prefix, i_page, j_object)
text_path <- sprintf(out_path_mold, prefix, i_page, j_object)
geom <- magick::geometry_area(
height = page_data$height[[j_object]] * scale_by * 1.2,
width = page_data$width[[j_object]] * scale_by * 1.1,
x_off = page_data$x[[j_object]] * scale_by,
y_off = page_data$y[[j_object]] * scale_by
)
line_image <- magick::image_crop(page_image, geom)
magick::image_write(line_image, format = "png",
path = image_path)
writeLines(page_data$text[[j_object]], text_path)
}
}
}
}
This is definitely not optimal.
The university of Salford has a Pattern Recognition and Image Analysis (PRImA) research Lab. It's part of their School of Computing, Science and Engineering. They've created some software called Aletheia designed to help create the ground truth text from images. These can be used to train Tesseract versions 3 or 4.
https://www.primaresearch.org/tools/Aletheia

What is the way to ignore/skip some issues from python bandit security issues report?

I've got a bunch of django_mark_safe errors
>> Issue: [B703:django_mark_safe] Potential XSS on mark_safe function.
Severity: Medium Confidence: High
Location: ...
More Info: https://bandit.readthedocs.io/en/latest/plugins/b703_django_mark_safe.html
54 return mark_safe(f'{title}')
>> Issue: [B308:blacklist] Use of mark_safe() may expose cross-site scripting vulnerabilities and should be reviewed.
Severity: Medium Confidence: High
Location: ...
More Info: https://bandit.readthedocs.io/en/latest/blacklists/blacklist_calls.html#b308-mark-safe
54 return mark_safe(f'{title}')
And I'm curious if there is a way to skip or ignore such lines? I understand that using mark_safe could be dangerous, but what if I want to take the risk? For example this method is the only way to display custom link in Django admin, so I don't know any other option how to do it without mark_safe
I've got an answer here:
Two ways:
You can skip the B703 and B308 using the --skip argument to the
command line.
Or you can affix a comment # nosec on the line to skip.
https://bandit.readthedocs.io/en/latest/config.html#exclusions
Heads up for annotating multilines with # nosec:
given:
li_without_nosec = [
"select * from %s where 1 = 1 "
% "foo"
]
li_nosec_at_start_works = [ # nosec - ✅ and you can put a comment
"select * from %s where 1 = 1 "
% "foo"
]
# nosec - there's an enhancement request to marker above line
li_nosec_on_top_doesntwork = [
"select * from %s where 1 = 1 "
% "foo"
]
li_nosec_at_end_doesntwork = [
"select * from %s where 1 = 1 "
% "foo"
] # nosec
output:
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
Severity: Medium Confidence: Low
Location: test.py:3
More Info: https://bandit.readthedocs.io/en/latest/plugins/b608_hardcoded_sql_expressions.html
2 li_without_nosec = [
3 "select * from %s where 1 = 1 "
4 % "foo"
5 ]
--------------------------------------------------
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
Severity: Medium Confidence: Low
Location: test.py:15
More Info: https://bandit.readthedocs.io/en/latest/plugins/b608_hardcoded_sql_expressions.html
14 li_nosec_on_top_doesntwork = [
15 "select * from %s where 1 = 1 "
16 % "foo"
17 ]
--------------------------------------------------
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
Severity: Medium Confidence: Low
Location: test.py:21
More Info: https://bandit.readthedocs.io/en/latest/plugins/b608_hardcoded_sql_expressions.html
20 li_nosec_at_end_doesntwork = [
21 "select * from %s where 1 = 1 "
22 % "foo"
23 ] # nosec
Black
Here's hoping that black won't get involved and restructure the lines, moving the # nosec around.
so much for hope... black does move things around, just like it does with pylint directives, whenever the line length becomes too long. At which point # nosec ends up at the end.
You can either proactively break up the line and position # nosec at the first one. Or you can just wait out black and adjust if needed.
Just to complete the topic - in my case I had to rid of B322: input rule, and didn't wanted to write # nosec each time I found this problem in the code, or to always execute Bandit with a --skip flag.
So if you want to omit a certain rule for whole solution, you can create a .bandit file in the root of your project. Then you can write which rules should be skipped every time, for example:
[bandit]
skips: B322
And then Bandit will skip this check by default without need to give additional comments in the code.
If you want to apply the rule in a local piece of your code you can do something like the following:
# this is my very basic python script
def foo():
do_something_unsecure() # nosec B703, B308
In that way you will skip those validations just in that line. In real work it can be probably the most appropriated way to skip some rules.
You can config Bandit with .bandit INI file (only if it is invoked with -r option):
[bandit]
tests = B101,B102,B301
Or with pyproject.toml file:
[tool.bandit]
tests = ["B201", "B301"]
skips = ["B101", "B601"]
Or with yaml file:
skips: ['B101', 'B601']
assert_used:
skips: ["*/test_*.py", "*/test_*.py"]
See https://bandit.readthedocs.io/en/latest/config.html

Keras evaluate_generator computes more steps than the number indicated in "steps" parameter

I am working with a custom generator that yields videos. I'm trying to do a simple evaluation with evaluate_generator on my model, but I realized that changing the batch_size was yielding different accuracy results. I decided to print the yielded video names at each generator step, and it turned out that somehow, the generator is being called more times than what I indicate on the step parameter of evaluate_generator.
From what I understand, evaluate_generator's step parameter indicates the number of batches to yield. In my case, my generator has a batch_size of 10, and because there are 30 datapoints to evaluate, I set steps=3. I should then be evaluating my model on all available data points, in 3 steps of 10 points each. However, the generator yields more videos than that, looping back over already analyzed videos and thus affecting the final accuracy score. Here is what's happening in code:
First, my generator (simplified):
def video_generator(batch_size=1,files=None,shuffle=True,augment=None,load_func='hmdb_npy_rgb',preprocess_func=None,is_train=True):
L=len(files)
print('Calling video_generator. Batch size: ',batch_size,'. Number of files received:',L,'. Augmentation: ',augment)
## This line is just to make the generator infinite, keras needs that
while True:
## Define starting idx for batch slicing
batch_start = 0
batch_end = batch_size
c=1
## Loop over the txt file while there are enough unseen files for one more batch
while batch_start < L:
## LOAD DATA
limit = min(batch_end, L)
# DEBUG STRING
print('STEP',c,' - yielding', limit-batch_start,'videos.')
X = load_func(files[batch_start:limit])
Y = load_labels(files[batch_start:limit])
## PREPROCESS DATA
if preprocess_func is not None:
X = preprocess_func(X, is_train=is_train)
## AUGMENT DATA
if augment is not None:
X= augment_video_batch(X, augment)
## YIELD DATA
yield X,Y #a tuple with two numpy arrays with batch_size samples
## Increasing idxs for next batch
batch_start += batch_size
batch_end += batch_size
c+=1
As you can see, it's a pretty standard generator. I am printing a debug string at every step, to see how many times the main loop is called. This is my general code:
for k,v in classes_to_evaluate.items():
# k is a class of videos, e.g. jump, run, etc
# v is a list of video names corresponding to class k
print('Evaluating',k, '. # steps:' 3)
generator=video_generator(files = v, batch_size=batch_size, **gen_params)
metrics = model.evaluate_generator(generator, steps=3,
max_queue_size=10, workers=1, use_multiprocessing=False)
print('Time elapsed on',k,':', end-start)
Based on this, I should be seeing the following prints:
Evaluating shoot_gun . num steps: 3
Calling video_generator. Batch size: 10 . Number of files received: 30 . Augmentation: None
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
STEP 3 - yielding 10 videos.
Time elapsed on shoot_gun : # time elapsed here
However, this is what I'm seeing:
Evaluating shoot_gun . num steps: 3
Calling video_generator. Batch size: 10 . Number of files received: 30 . Augmentation: None
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
STEP 3 - yielding 10 videos.
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
STEP 3 - yielding 10 videos.
Time elapsed on shoot_gun : 7.721895118942484
And the number of steps gets kind of randomized for the other classes. As you can see, the generator nters the loop 6 times instead of 3 for class shoot_gun, but then for some classes I get 4 steps, for other classes 5 steps (and all classes have exactly 30 videos, and are called with a new instance of the same generator). For example:
Evaluating climb . num steps: 3
Calling video_generator. Batch size: 10 . Number of files received: 30 . Augmentation: None
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
STEP 3 - yielding 10 videos.
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
Time elapsed on climb : 7.923485273960978
Here, as you can see, I get 5 generator steps instead of the 3 that I want.
But then:
Evaluating kiss . num steps: 3
Calling video_generator. Batch size: 10 . Number of files received: 30 . Augmentation: None
STEP 1 - yielding 10 videos.
STEP 2 - yielding 10 videos.
STEP 3 - yielding 10 videos.
STEP 1 - yielding 10 videos.
Time elapsed on kiss : 9.742703797062859
Here, I get only 4 steps. I don't understand, as there are no differences in the batch-size, number of videos or any other parameter between classes. It is important to note that in NO CASE do I get only 3 steps, as the intended behavior should be.
Does anyone know why this happens, and how to make it at least consistent?

Python multiple versions of constants

I am writing a program in Python 3 to work with several devices. And I have to store constants for each device. Some constants are general for all devices and permanently fixed, but some other ones are different from version to version depending on the firmware version of the devices. I have to store constants for all versions, not only for the last one. Tell me please the Pythonic way to define constants for different devices and multiple versions of them.
My current solution looks like this:
general = {
'GENERAL_CONST_1': 1,
'GENERAL_CONST_2': 2,
...
'GENERAL_CONST_N': N
}
device_1 = dict()
device_1[FIRMWARE_VERSION_1] = {
'DEVICE_1_CONST_1': 1,
'DEVICE_1_CONST_2': 2,
...
'DEVICE_1_CONST_N': N
}
device_1[FIRMWARE_VERSION_1].update(general)
device_1[FIRMWARE_VERSION_2] = {
'DEVICE_1_CONST_1': 1,
'DEVICE_1_CONST_2': 2,
...
'DEVICE_1_CONST_N': N
}
device_1[FIRMWARE_VERSION_2].update(general)
device_2 = dict()
device_2[FIRMWARE_VERSION_1] = {
'DEVICE_2_CONST_1': 1,
'DEVICE_2_CONST_2': 2,
...
'DEVICE_2_CONST_N': N
}
device_2[FIRMWARE_VERSION_1].update(general)
device_2[FIRMWARE_VERSION_2] = {
'DEVICE_2_CONST_1': 1,
'DEVICE_2_CONST_2': 2,
...
'DEVICE_2_CONST_N': N
}
device_2[FIRMWARE_VERSION_2].update(general)
Thank you in advance! Or, if you could point me in the direction where I can read about the above, I will be grateful for this too.
UPD1:
Thanks to #languitar I decided to use one of INI/JSON/YAML/TSON... format. For example, formats supported in library python-anyconfig. Format INI (proposed by #languitar configparser) looks good for my purposes (also TSON seemed interesting), but, unfortunately, both of them don't support hex value. I was very surprised. But all my constants should have hex format. And then I decided try YAML format. Now file with constants look like this:
# General consts for all devices and all versions
general: &general
GENERAL_CONST_1: 1
GENERAL_CONST_2: 2
...
GENERAL_CONST_N: N
# Particular consts for device_1 for different firmware version
device_1: &device_1
<<: *general
# General consts for device_1 and all firmware versions
DEVICE_1_CONST_1: 1
device_1:
FIRMWARE_VERSION_1:
<<: *device_1
DEVICE_1_CONST_2: 2
...
DEVICE_1_CONST_N: N
FIRMWARE_VERSION_2:
<<: *device_1
DEVICE_1_CONST_2: 2
...
DEVICE_1_CONST_N: N
# Particular consts for device_2 for different firmware version
device_2: &device_2
<<: *general
# General consts for device_2 and all firmware versions
DEVICE_1_CONST_1: 1
device_2:
FIRMWARE_VERSION_1:
<<: *device_2
DEVICE_1_CONST_2: 2
...
DEVICE_1_CONST_N: N
FIRMWARE_VERSION_2:
<<: *device_2
DEVICE_1_CONST_2: 2
...
DEVICE_1_CONST_N: N
But I am not sure, whether this is the right way to store constants for devices and all their firmware versions
Just change your names to all capital letters
like GENERAL, DEVICE_1, etc

memory dump using Python

I have small program written for me in Python to help me generate all combinations of passwords from a different sets of numbers and words i know for me to recover a password i forgot, as i know all different words and sets of numbers i used i just wanted to generate all possible combinations, the only problem is that the list seems to go on for hours and hours so eventually i run out of memory and it doesn't finish.
I got told it needs to dump my memory so it can carry on but i'm not sure if this is right. is there any way i can get round this problem?
this is the program i am running:
#!/usr/bin/python
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = "££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
print "".join(item)
i have taken out a few sets and changed the numbers and word for obvious reasons but this is roughly the program.
another thing is i may be missing a particular word but didnt want to put it in the list because i know it might go before all the generated passwords, does anyone know how to add a prefix to my program.
sorry for the bad grammar and thanks for any help given.
I used guppy to understand the memory usage, I changed the OP code slightly (marked #!!!)
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = u"££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
from guppy import hpy # !!!
h=hpy() # !!!
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
print h.heap() #!!!
for item in itertools.permutations(mylist, length):
print item # !!!
Guppy outputs something like this every time h.heap() is called.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 11748 45 985544 29 985544 29 str
1 5858 23 472376 14 1457920 43 tuple
2 323 1 253640 8 1711560 51 dict (no owner)
3 67 0 213064 6 1924624 57 dict of module
4 199 1 210856 6 2135480 63 dict of type
5 1630 6 208640 6 2344120 70 types.CodeType
6 1593 6 191160 6 2535280 75 function
7 199 1 177008 5 2712288 80 type
8 124 0 135328 4 2847616 84 dict of class
9 1045 4 83600 2 2931216 87 __builtin__.wrapper_descriptor
Running python code.py > code.log and the fgrep Partition code.log shows.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Partition of a set of 25924 objects. Total size = 3355832 bytes.
Partition of a set of 25924 objects. Total size = 3355728 bytes.
Partition of a set of 25924 objects. Total size = 3372568 bytes.
Partition of a set of 25924 objects. Total size = 3372736 bytes.
Partition of a set of 25924 objects. Total size = 3355752 bytes.
Partition of a set of 25924 objects. Total size = 3372592 bytes.
Partition of a set of 25924 objects. Total size = 3372760 bytes.
Partition of a set of 25924 objects. Total size = 3355776 bytes.
Partition of a set of 25924 objects. Total size = 3372616 bytes.
Which I believe shows that the memory footprint stays fairly consistent.
Granted I may be misinterpreting the results from guppy. Although during my tests I deliberately added a new string to a list to see if the object count increased and it did.
For those interested I had to install guppy like so on OSX - Mountain Lion
pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
In summary I don't think that it's a running out of memory issue although granted we're not using the full OP dataset.
How about using IronPython and Visual Studio for its debug tools (which are pretty good)? You should be able to pause execution and look at the memory (essentially a memory dump).
Your program will run pretty efficiently by itself, as you now know. But make sure you don't just run it in IDLE, for example; that will slow it down to a crawl as IDLE updates the screen with more and more lines. Save the output directly into a file.
Even better: Have you thought about what you'll do when you have the passwords? If you can log on to the lost account from the command line, try doing that immediately instead of storing all the passwords for later use:
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
password = "".join(item)
try_to_logon(command, password)
To answer the above comment from #shaun if you want the file to output to notepad just run your file like so
Myfile.py >output.txt
If the text file doesn't exist it will be created.
EDIT:
Replace the line in your code at the bottom which reads:
print "" .join(item)
with this:
with open ("output.txt","a") as f:
f.write('\n'.join(items))
f.close
which will produce a file called output.txt.
Should work (haven't tested)

Categories

Resources