After using checkpoint in Pyspark the program run faster, why?

After using checkpoint in Pyspark the program run faster, why? - python

My spark setting is like that :
spark_conf = SparkConf().setAppName('app_name') \
.setMaster("local[4]") \
.set('spark.executor.memory', "8g") \
.set('spark.executor.cores', 4) \
.set('spark.task.cpus', 1)
sc = SparkContext.getOrCreate(conf=spark_conf)
sc.setCheckpointDir(dirName='checkpoint')
When I do not have any checkpoint in the spark chain and my program is like this:
result = sc.parallelize(group, 4) \
.map(func_read, preservesPartitioning=True)\
.map(func2,preservesPartitioning=True) \
.flatMap(df_to_dic, preservesPartitioning=True) \
.reduceByKey(func3) \
.map(func4, preservesPartitioning=True) \
.reduceByKey(func5) \
.map(write_to_db) \
.count()
Running time is about 8 hours.
But when I use checkpoint and cache RDD like this:
result = sc.parallelize(group, 4) \
.map(func_read, preservesPartitioning=True)\
.map(func2,preservesPartitioning=True) \
.flatMap(df_to_dic, preservesPartitioning=True) \
.reduceByKey(func3) \
.map(func4, preservesPartitioning=True) \
.reduceByKey(func5) \
.map(write_to_db)
result.cache()
result.checkpoint()
result.count()
The program run in about 3 hours. Would you please guide how it is possible that after caching RDD and using checkpoint the program run faster?
Any help would be really appreciated.

Related

Pyspark writestream json value type change

When pyspark writestream in json format, the value it prints is as follows
{"value":"{\"id\": 15, \"tarih\": \"xyz\", \"time\": \"23/01/2023 00:00:00\", \"temperature\": 31.99}"}
But I want to do it like this
{"value":{"id": 15, "tarih": "xyz", "time":"23/01/2023 00:00:00", "temperature": 31.99}}
Is this possible?
import findspark
findspark.init()
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import functions
appName = "PySpark Datapipline KAFKA-CASANNDRA-REDIS"
master = "local"
spark = SparkSession \
.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
lines = spark \
.readStream \
.option('multiLine', True) \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "deneme4") \
.option("startingOffsets", "earliest") \
.option("includeHeaders", "true") \
.option("failOnDataLoss", False)\
.load()\
df = lines.selectExpr("CAST(value AS STRING)")
query = df\
.writeStream\
.format("json") \
.option("path", "C:\\Users\\alper\\OneDrive\\Masaüstü\\fqndeneme\\fqn11") \
.option("checkpointLocation", "C:\\Users\\alper\\OneDrive\\Masaüstü\\fqndeneme11\\")\
.option("maxRecordsPerFile", 1)\
.start()
query.awaitTermination()
Unfortunately, I cannot have a schema, the incoming data will change from time to time. I need to access the values in the Value somehow, please help.

tensorflow deeplabv3+ train from scratch train.sh code to recurring 82.2% in VOC val

DATA: VOC 2012 augmented datasets: I am training train_aug with 10582 annotations.
CODE: I am using official deeplabv3+ code.I didn't change the code except bash code.
Pretrained weight: xception_65_imagenet_coco from official zoo
So, this is my train.sh:
python "${WORK_DIR}"/train.py \
--logtostderr \
--train_split="train_aug" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="513,513" \
--train_batch_size=15 \
--training_number_of_steps=30000 \
--fine_tune_batch_norm=False \
--num_clones=5 \
--base_learning_rate=0.007 \
--tf_initial_checkpoint="${COCO_PRE}/x65-b2u1s2p-d48-2-3x256-sc-cr300k_init.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${PASCAL_DATASET}"\
--initialize_last_layer=False
Result: I think it should be 82.2% with my configuration.But I got 80%on eval.OS=16 and 80.15% on eval.OS=8 in 30k steps
So my question is : How can I get 82.2%?
edit:09.02.2019:-----------------
I have notice that the fine_tune_batch_norm=false.
and in train.py:
Set to True if one wants to fine-tune the batch norm parameters in
DeepLabv3
So I decide to try fine_tune_batch_norm=true cause training from scratch need to change the BN parameters.
edit------09/07--------------------
Still not working with:
python "${WORK_DIR}"/train.py \
--logtostderr \
--train_split="train_aug" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="513,513" \
--train_batch_size=15 \
--training_number_of_steps=100000 \
--fine_tune_batch_norm=true \
--num_clones=5 \
--base_learning_rate=0.007 \
--tf_initial_checkpoint="${COCO_PRE}/x65-b2u1s2p-d48-2-3x256-sc-cr300k_init.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${PASCAL_DATASET}"\
--initialize_last_layer=False
This time the result is even worse.

I reproduce the result. I got 81.5%. I use 40000 steps for the first round, I think 30000 steps might better.
Bash Code:
python "${WORK_DIR}"/train.py \
--logtostderr \
--train_split="train_aug" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="513,513" \
--train_batch_size=24 \
--base_learning_rate=0.007 \
--training_number_of_steps=30000 \
--fine_tune_batch_norm=true \
--num_clones=8 \
--tf_initial_checkpoint="${COCO_PRE}/x65-b2u1s2p-d48-2-3x256-sc-cr300k_init.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${PASCAL_DATASET}"\
--initialize_last_layer=true
python "${WORK_DIR}"/train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="513,513" \
--train_batch_size=24 \
--training_number_of_steps=60000 \
--fine_tune_batch_norm=false \
--num_clones=8 \
--base_learning_rate=0.01 \
--tf_initial_checkpoint="${COCO_PRE}/x65-b2u1s2p-d48-2-3x256-sc-cr300k_init.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${PASCAL_DATASET}"\
--initialize_last_layer=true
reference: github

Deeplab v3+ Shape mismatch in tuple component

I have trained deeplab v3+ on ADE20Kdataset,and got the trained ckptjlogs and eventslogs.But when I run eval.pyand vis.pyon ADE20K,I got the following errors about shape:
Shape mismatch in tuple component 1. Expected [513,513,3], got [513,683,3]
These are my evalscripts and vis scripts:
evalscripts:
#!/bin/bash
cd ../
python deeplab/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=513 \
--eval_crop_size=513 \
--checkpoint_dir=deeplab/datasets/ADE20K/exp/train_on_train_set/train/ \
--eval_logdir=deeplab/datasets/ADE20K/exp/train_on_train_set/eval/ \
--dataset_dir=deeplab/datasets/ADE20K/tfrecord/ \
--max_number_of_iterations=1
visscripts:
#!/bin/bash
cd ../
python deeplab/vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=513 \
--vis_crop_size=513 \
--checkpoint_dir=deeplab/datasets/ADE20K/exp/train_on_train_set/train/ \
--vis_logdir=deeplab/datasets/ADE20K/exp/train_on_train_set/vis/ \
--dataset_dir=deeplab/datasets/ADE20K/tfrecord/ \
--max_number_of_iterations=1
And my trainscripts:
#!/bin/bash
cd ../
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=150000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=2 \
--min_resize_value=513 \
--max_resize_value=513 \
--resize_factor=16 \
--dataset="ade20k" \
--tf_initial_checkpoint=deeplab/datasets/ADE20K/init_models /deeplabv3_xception_ade20k_train/model.ckpt.index \
--train_logdir=deeplab/datasets/ADE20K/exp/train_on_train_set/train \
--dataset_dir=deeplab/datasets/ADE20K/tfrecord/
Is there any thing I set wrong?
Thanks for any help.

Make sure the arguments used in your sh-script match the arguments required by your current code version.
Not long ago you had to pass two separated values for the crop size buy the current implementation uses
--eval_crop_size="513,513" \
or
--vis_crop_size="513,513" \
(taken from here)
Hope this helps ;). If not try to print the crop values in the vis.py/eval.py script and look if the are passed correctly.

Avoid an SQL injection attack in my Python SQL API

I have designed a Python SQLite API which interfaces with a GUI. The GUI allows the user to select a given column whose data will be summed for each month. From what I have learned from https://docs.python.org/2/library/sqlite3.html I know that the way I’ve written this makes my code vulnerable to an SQL injection attack; I’ve assembled my query using Python’s string operations. However, I am unable to make this module work doing it the “right” way; using the DB-API’s parameter substitution to put a “?” as a placeholder wherever you want to use a value. I’m guessing the issue is that I want to make a table column the variable and not a value. Please help me to restructure this module so that it is more secure and less vulnerable to an SQL injection attack.
The code below works (it functions as I would like it to) I just know that it is not the correct/most secure way to do this.
def queryEntireCategoryAllEmployees(self, column):
table_column = 'Name_Data_AllDaySums.%s' % column
cursor = self.conn.execute("SELECT \
SUBSTR(data_date,1,7), \
SUM(%s) \
FROM ( \
SELECT \
SS_Installations.data_date AS 'data_date', \
SS_Installations.Installations_day_sum, \
SS_PM_Site_Visits.PM_Site_Visits_day_sum, \
SS_Rpr_Maint_Site_Visits.Inst_Repair_or_Maintenance_on_Site_day_sum, \
SS_Rmt_Hrdwr_Spt.Rmt_Hardware_Support_day_sum, \
SS_Rmt_Sftwr_Spt.Rmt_Software_Support_day_sum, \
SS_Rpr_Mant_RFB_in_House.Inst_Repair_Maint_Rfb_In_House_day_sum, \
Miscellaneous.Miscellaneous_day_sum, \
SS_Doc_Gen.Document_Generation_day_sum, \
SS_Inter_Dep_Spt.Inter_Dep_Spt_day_sum, \
SS_Online_Training.Online_Training_day_sum, \
SS_Onsite_Training.Onsite_Training_day_sum, \
SS_In_House_Training.In_House_Training_day_sum, \
Validation_Duties.Validation_Duties_day_sum \
FROM \
SS_Installations \
INNER JOIN SS_PM_Site_Visits ON \
SS_Installations.employee_clk_no = SS_PM_Site_Visits.employee_clk_no AND \
SS_Installations.data_date = SS_PM_Site_Visits.data_date \
INNER JOIN SS_Rpr_Maint_Site_Visits ON \
SS_Installations.employee_clk_no = SS_Rpr_Maint_Site_Visits.employee_clk_no AND \
SS_PM_Site_Visits.data_date = SS_Rpr_Maint_Site_Visits.data_date \
INNER JOIN SS_Rmt_Hrdwr_Spt ON \
SS_Installations.employee_clk_no = SS_Rmt_Hrdwr_Spt.employee_clk_no AND \
SS_Rpr_Maint_Site_Visits.data_date = SS_Rmt_Hrdwr_Spt.data_date \
INNER JOIN SS_Rmt_Sftwr_Spt ON \
SS_Installations.employee_clk_no = SS_Rmt_Sftwr_Spt.employee_clk_no AND \
SS_Rmt_Hrdwr_Spt.data_date = SS_Rmt_Sftwr_Spt.data_date \
INNER JOIN SS_Rpr_Mant_RFB_in_House ON \
SS_Installations.employee_clk_no = SS_Rpr_Mant_RFB_in_House.employee_clk_no AND \
SS_Rmt_Sftwr_Spt.data_date = SS_Rpr_Mant_RFB_in_House.data_date \
INNER JOIN Miscellaneous ON \
SS_Installations.employee_clk_no = Miscellaneous.employee_clk_no AND \
SS_Rpr_Mant_RFB_in_House.data_date = Miscellaneous.data_date \
INNER JOIN SS_Doc_Gen ON \
SS_Installations.employee_clk_no = SS_Doc_Gen.employee_clk_no AND \
Miscellaneous.data_date = SS_Doc_Gen.data_date \
INNER JOIN SS_Inter_Dep_Spt ON \
SS_Installations.employee_clk_no = SS_Inter_Dep_Spt.employee_clk_no AND \
SS_Doc_Gen.data_date = SS_Inter_Dep_Spt.data_date \
INNER JOIN SS_Online_Training ON \
SS_Installations.employee_clk_no = SS_Online_Training.employee_clk_no AND \
SS_Inter_Dep_Spt.data_date = SS_Online_Training.data_date \
INNER JOIN SS_Onsite_Training ON \
SS_Installations.employee_clk_no = SS_Onsite_Training.employee_clk_no AND \
SS_Online_Training.data_date = SS_Onsite_Training.data_date \
INNER JOIN SS_In_House_Training ON \
SS_Installations.employee_clk_no = SS_In_House_Training.employee_clk_no AND \
SS_Onsite_Training.data_date = SS_In_House_Training.data_date \
INNER JOIN Validation_Duties ON \
SS_Installations.employee_clk_no = Validation_Duties.employee_clk_no AND \
SS_In_House_Training.data_date = Validation_Duties.data_date \
WHERE \
(SS_Installations.Installations_day_sum != 0 OR \
SS_PM_Site_Visits.PM_Site_Visits_day_sum !=0 OR \
SS_Rpr_Maint_Site_Visits.Inst_Repair_or_Maintenance_on_Site_day_sum != 0 OR \
SS_Rmt_Hrdwr_Spt.Rmt_Hardware_Support_day_sum != 0 OR \
SS_Rmt_Sftwr_Spt.Rmt_Software_Support_day_sum != 0 OR \
SS_Rpr_Mant_RFB_in_House.Inst_Repair_Maint_Rfb_In_House_day_sum != 0 OR \
Miscellaneous.Miscellaneous_day_sum != 0 OR \
SS_Doc_Gen.Document_Generation_day_sum != 0 OR \
SS_Inter_Dep_Spt.Inter_Dep_Spt_day_sum != 0 OR \
SS_Online_Training.Online_Training_day_sum != 0 OR \
SS_Onsite_Training.Onsite_Training_day_sum != 0 OR \
SS_In_House_Training.In_House_Training_day_sum != 0 OR \
Validation_Duties.Validation_Duties_day_sum != 0)) Name_Data_AllDaySums \
GROUP BY SUBSTR(data_date,1,7) \
ORDER BY SUBSTR(data_date,1,7) ASC" % table_column)
dataList = cursor.fetchall()
return dataList

To start, I would read up on this incredibly informative SO post on preventing SQL injection in PHP, as many of the principles apply: How can I prevent SQL Injection in PHP?
Additionally, because you are working with SQL Server, I would consider creating a stored procedure and running it with the EXEC command in T-SQL and passing your column name as a parameter (since your query seems to only dynamically change based on the column), similar to this MSSQL Docs example Execute a Stored Procedure and using this SO thread for dynamically changing a query based on a parameter Can I Pass Column Name As Input...
Doing so this way will help you to obscure your code from prying eyes and also secure it from injection attacks as you will be able to validate that the input matches what you expect.
Finally, consider using a drop-down list of columns to choose from so that the end user can only pick a pre-defined set of inputs and thus make your application even more secure. This approach as well as obscuring the code in a stored procedure will help also make it much easier to push out updates over time.

Fastest method to find all item types with python

I have 5 item types that I have to parse thousands of files (approximately 20kb - 75kb) for:
Item Types
SHA1 hashes
ip addresses
domain names
urls (full thing if possible)
email addresses
I currently use regex to find any items of these nature in thousands of files.
python regex is taking a really long time and I was wondering if there is a better method to identify these item types anywhere in any of my text based flat files?
reSHA1 = r"([A-F]|[0-9]|[a-f]){40}"
reIPv4 = r"\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|\[\.\])){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
reURL = r"[A-Z0-9\-\.\[\]]+(\.|\[\.\])(XN--CLCHC0EA0B2G2A9GCD|XN--HGBK6AJ7F53BBA|" \
r"XN--HLCJ6AYA9ESC7A|XN--11B5BS3A9AJ6G|XN--MGBERP4A5D4AR|XN--XKC2DL3A5EE0H|XN--80AKHBYKNJ4F|" \
r"XN--XKC2AL3HYE2A|XN--LGBBAT1AD8J|XN--MGBC0A9AZCG|XN--9T4B11YI5A|XN--MGBAAM7A8H|XN--MGBAYH7GPA|" \
r"XN--MGBBH1A71E|XN--FPCRJ9C3D|XN--FZC2C9E2C|XN--YFRO4I67O|XN--YGBI2AMMX|XN--3E0B707E|XN--JXALPDLP|" \
r"XN--KGBECHTV|XN--OGBPF8FL|XN--0ZWM56D|XN--45BRJ9C|XN--80AO21A|XN--DEBA0AD|XN--G6W251D|XN--GECRJ9C|" \
r"XN--H2BRJ9C|XN--J6W193G|XN--KPRW13D|XN--KPRY57D|XN--PGBS0DH|XN--S9BRJ9C|XN--90A3AC|XN--FIQS8S|" \
r"XN--FIQZ9S|XN--O3CW4H|XN--WGBH1C|XN--WGBL6A|XN--ZCKZAH|XN--P1AI|MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|" \
r"INFO|JOBS|MOBI|NAME|BIZ|CAT|COM|EDU|GOV|INT|MIL|NET|ORG|PRO|TEL|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|" \
r"AR|AS|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|" \
r"CL|CM|CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|" \
r"GF|GG|GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|" \
r"JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN|MO|" \
r"MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|" \
r"PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC|TD|TF|" \
r"TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|ZA|ZM|ZW)" \
r"(/\S+)"
reDomain = r"[A-Z0-9\-\.\[\]]+(\.|\[\.\])(XN--CLCHC0EA0B2G2A9GCD|XN--HGBK6AJ7F53BBA|XN--HLCJ6AYA9ESC7A|" \
r"XN--11B5BS3A9AJ6G|XN--MGBERP4A5D4AR|XN--XKC2DL3A5EE0H|XN--80AKHBYKNJ4F|XN--XKC2AL3HYE2A|" \
r"XN--LGBBAT1AD8J|XN--MGBC0A9AZCG|XN--9T4B11YI5A|XN--MGBAAM7A8H|XN--MGBAYH7GPA|XN--MGBBH1A71E|" \
r"XN--FPCRJ9C3D|XN--FZC2C9E2C|XN--YFRO4I67O|XN--YGBI2AMMX|XN--3E0B707E|XN--JXALPDLP|XN--KGBECHTV|" \
r"XN--OGBPF8FL|XN--0ZWM56D|XN--45BRJ9C|XN--80AO21A|XN--DEBA0AD|XN--G6W251D|XN--GECRJ9C|XN--H2BRJ9C|" \
r"XN--J6W193G|XN--KPRW13D|XN--KPRY57D|XN--PGBS0DH|XN--S9BRJ9C|XN--90A3AC|XN--FIQS8S|XN--FIQZ9S|" \
r"XN--O3CW4H|XN--WGBH1C|XN--WGBL6A|XN--ZCKZAH|XN--P1AI|MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|INFO|JOBS|" \
r"MOBI|NAME|BIZ|CAT|COM|EDU|GOV|INT|MIL|NET|ORG|PRO|TEL|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT" \
r"|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|" \
r"CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|" \
r"GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|" \
r"KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN|MO|MP" \
r"|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|" \
r"PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC|TD|TF" \
r"|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|ZA|" \
r"ZM|ZW)\b"
reEmail = r"\b[A-Za-z0-9._%+-]+(#|\[#\])[A-Za-z0-9.-]+(\.|\[\.\])(XN--CLCHC0EA0B2G2A9GCD|XN--HGBK6AJ7F53BBA|" \
r"XN--HLCJ6AYA9ESC7A|XN--11B5BS3A9AJ6G|XN--MGBERP4A5D4AR|XN--XKC2DL3A5EE0H|XN--80AKHBYKNJ4F|" \
r"XN--XKC2AL3HYE2A|XN--LGBBAT1AD8J|XN--MGBC0A9AZCG|XN--9T4B11YI5A|XN--MGBAAM7A8H|XN--MGBAYH7GPA|" \
r"XN--MGBBH1A71E|XN--FPCRJ9C3D|XN--FZC2C9E2C|XN--YFRO4I67O|XN--YGBI2AMMX|XN--3E0B707E|XN--JXALPDLP|" \
r"XN--KGBECHTV|XN--OGBPF8FL|XN--0ZWM56D|XN--45BRJ9C|XN--80AO21A|XN--DEBA0AD|XN--G6W251D|XN--GECRJ9C|" \
r"XN--H2BRJ9C|XN--J6W193G|XN--KPRW13D|XN--KPRY57D|XN--PGBS0DH|XN--S9BRJ9C|XN--90A3AC|XN--FIQS8S|" \
r"XN--FIQZ9S|XN--O3CW4H|XN--WGBH1C|XN--WGBL6A|XN--ZCKZAH|XN--P1AI|MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|" \
r"INFO|JOBS|MOBI|NAME|BIZ|CAT|COM|EDU|GOV|INT|MIL|NET|ORG|PRO|TEL|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|" \
r"AR|AS|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK" \
r"|CL|CM|CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE" \
r"|GF|GG|GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|" \
r"JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN" \
r"|MO|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|" \
r"PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC" \
r"|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|" \
r"ZA|ZM|ZW)\b"
I am using a
with open(file, 'r') as f:
for m in re.finditer(key, text, re.IGNORECASE):
try:
m = str(m).split('match=')[-1].split("'")[1]
new_file.write(m + '\n')
except:
pass
method to open, find and output to a new file.
Any assistance with speeding up this item and making it more efficient would be grateful.

You probably want:
text = m.group(0)
print(text, file=new_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

After using checkpoint in Pyspark the program run faster, why? - python

Related

Pyspark writestream json value type change

tensorflow deeplabv3+ train from scratch train.sh code to recurring 82.2% in VOC val

Deeplab v3+ Shape mismatch in tuple component

Avoid an SQL injection attack in my Python SQL API

Fastest method to find all item types with python

Categories

Resources