I am a beginner in Python and would really appreciate if someone could help me with the following:
I would like to run this script 10 times and for that change for every run the sub-batch (from 0-9):
E.g. the first run would be:
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 0 \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
The second run would be
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 1 \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
I am sure this can be done with a loop but not quite sure how to do it in python?
Thank you so much for any advice,
Sally
While we can't show you how to retrofit a loop to the python code without actually seeing the python code, you could just use a shell loop to accomplish what you want without touching the python code.
For bash shell, it would look like this:
for sub_batch in {0..9}; do \
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata
$DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch $sub_batch \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
done
What you're looking for seems to be a way to run the same command on the command line a set number of times.
If you're on Linux using the bash shell, this can be done using a shell loop:
for i in {0..9}; do
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch $i \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
done
If you're on Windows, similar can be achieved using powershell:
for ($i=0; $i -le 9; $i++) {
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch $i \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
}
a loop in python from 0 to 10 is very easy.
for i in range(0, 10):
do stuff
Related
I am trying to insert 30 Millions records from Data-bricks to azure SQL. SPN ID validity is 65 mins, so single insertion is not happening. I am trying to insert data with batches each batch has 2M records and I am generating new token for each batch, but still I am getting same error after 4 batches (after inserting 6M records (after 1:30 hrs it's failing)).
Error : Token Expired
table_name = "TABLE NAME"
if count <= 2000000:
access_token,connection_string = Service_Principal()
df.write.format("jdbc") \
.mode("append") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("accessToken", access_token) \
.option("encrypt", "true") \
.option("hostNameInCertificate", "") \
.option("driver","")\
.save()
else:
chunk=2000000
id1 = 0
id2 = chunk
c =count
while id1 < c:
print("Insertion STARTED at : "+ str(datetime.datetime.now()))
stop_df = final_df.filter( (final_df.id_tmp < id2) & (final_df.id_tmp >= id1))
access_token,connection_string = Service_Principal()
df.write.format("jdbc") \
.mode("append") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("accessToken", access_token) \
.option("encrypt", "true") \
.option("hostNameInCertificate", "") \
.option("driver","")\
.save()
print("Insertion COMPLETED at : "+ str(datetime.datetime.now()))
id1+=chunk
id2+=chunk
How we can close JDBC connection in each Batch or how to delete SPN ID in each Batch
I want to read data from postgres db of 1 hour time interval, I want the process to run every one hour. How can I do that? I have attached my code snippet. I am unable to use readstream for jdbc option.
df = spark.read \
.format("jdbc") \
.option("url", URL) \
.option("dbtable", "tagpool_with_tag_raw") \
.option("user", "tsdbadmin") \
.option("password", "cgqu5qss2zy3i1") \
.option("driver", "org.postgresql.Driver") \
.load()
# Getting the current date and time
dt = datetime.datetime.now(timezone.utc)
utc_time = dt.replace(tzinfo=timezone.utc)
utc_timestamp = utc_time.timestamp()
epoch = round(utc_timestamp / 60) * 60
# epoch = epoch+3600
print("epoch ", epoch)
df.createOrReplaceTempView("tagpool_with_tag_raw")
x = spark.sql("""select * from tagpool_with_tag_raw""")
x.show()
query = spark.sql("select * from tagpool_with_tag_raw WHERE input_time = " + str(epoch)) # .format()
# query = spark.sql("select CAST(input_time AS bigint), CAST(orig_time AS bigint) , from tagpool_with_tag_raw WHERE input_time = "+ epoch) #.format()
query.show()
# df.selectExpr(("SELECT * FROM public.tagpool_raw WHERE input_time<= %s".format(epoch)))
df.printSchema()
query.write \
.format("jdbc") \
.option("url", URL) \
.option("dbtable", "tagpool_tag_raw") \
.option("user", USER) \
.option("password", PW) \
.option("driver", DRIVER).save()
Readstream are not for jdbc , As jdbc is a batch operation, You will have to create a process just like what you have did and use schedulers like AutoSys or oozie or whatever your enterprise as to run every hour.
I'm trying to print the following image via python
print("""
____
.\ /
|\\ //\
/ \\// \
/ / \ \
/ / \ \
/ / \ \
/ /______^ \ \
/ ________\ \ \
/ / \ \
/\\ / \ //\
/__\\_\ /_//__\
""")
input()
output
____
.\ /
|\ // / \// / / \ / / \ / / \ \
/ /______^ \ / ________\ \ / / \ /\ / \ ///__\_\ /_//__
hope someone can help me solve this problem
Backslashes escape the newlines, change it to a raw string with r"...":
print(r"""
____
.\ /
|\\ //\
/ \\// \
/ / \ \
/ / \ \
/ / \ \
/ /______^ \ \
/ ________\ \ \
/ / \ \
/\\ / \ //\
/__\\_\ /_//__\
""")
input()
I have a bytearray, and when I list the array, I get the following data: (b'v10 \ xc73 \ x9a & \ x9edv \ x19 \ xc3B \ xbf \ x95 \ xc8 \ xd8 \ x9dN \ x8f \ xe9 \ x90J \ xax> r1 \ x1d \ xa7 \ x1fU \ x90 \ XE2 (| p \ XF1 \ x02 \ xbdw \ XB8 \ xb9 \ xf3 \ x0e \ xb2n \ xc7 ',).
And I need to decrypt this data. But the decryption function only receives data, for example, b'v10 \ xc73 \ x9a & \ x9edv \ x19 \ xc3B \ xbf \ x95 \ xc8 \ xd8 \ x9dN \ x8f \ xe9 \ x90J \ xax> r1 \ x1d \ xa7 \ x1fU \ x90 \ xe2 (| p \ xf1 \ x02 \ xbdw \ xb8 \ xb9 \ xf3 \ x0e \ xb2n \ xc7' without () and ,
What can I do?
Supposing we have
data = (b'foo',)
then this data is not a bytearray, nor is it a bytes object:
>>> type(data)
<class 'tuple'>
Because it is a tuple, we may extract that element:
>>> data[0]
b'foo'
>>> type(data[0])
<class 'bytes'>
I am creating a population model featuring education.
I start with initial picture of the population that gives the number of people for each age group (0 to 95), and each level of education (0 - No education, to 6 - University).
This picture is treated as a column of a dataframe, that will iteratively be populated for each new year as a forecast.
In order to be populated there will be assumptions or things such as mortality rate of each age group, enrollment rates and success rates of each education level and so on.
The way I solved the problem is by adding a new column and iterate through the rows by using the value for age-1 from the previous year in order to compute the new value (eg. number of males with age 5 is the number of males with age 4 at year-1 less the ones that died)
The problem with this solution is that iterating through pandas dataframe rows using for loops and .loc is very inefficient and it takes a lot of time to compute the forecast
def add_year_temp(pop_table,time,
old_year,new_year,
enrollment_rate_primary,
success_rate_primary,
enrollment_rate_1st_cycle,
success_rate_1st_cycle,
enrollment_rate_2nd_cycle,
success_rate_2nd_cycle,
enrollment_rate_3rd_cycle,
success_rate_3rd_cycle,
enrollment_rate_university,
success_rate_university,
mortality_rate_0_1,
mortality_rate_2_14,
mortality_rate_15_64,
mortality_rate_65,
mortality_mf_ratio,
enrollment_mf_ratio,
success_mf_ratio):
temp_table = pop_table
temp_table['year_ts'] = pd.to_datetime(temp_table[time])
temp_table['lag']= temp_table.groupby(['sex','schooling'])[old_year].shift(+1)
temp_table = temp_table.fillna(0)
for age in temp_table['age'].unique():
for sex in temp_table['sex'].unique():
mortality_mf_ratio_temp = 1
enrollment_mf_ratio_temp = 1
success_mf_ratio_temp = 1
if sex == 'F':
mortality_mf_ratio_temp = mortality_mf_ratio
enrollment_mf_ratio_temp = enrollment_mf_ratio
success_mf_ratio_temp = success_mf_ratio
if age <= 1:
for schooling in [0]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_0_1 * mortality_mf_ratio_temp)
elif 1 < age <= 5:
for schooling in [0]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_2_14 * mortality_mf_ratio_temp)
a lot of lines later you can see how for example I define the people that finish high-school and enter university...
elif 15 < age <= 17:
for schooling in [0 ,1 ,2 ,3 ,4]:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age-1) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif age == 18:
for schooling in [0 ,1 ,2, 3, 4, 5]:
if schooling == 0:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)]['lag']) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 1:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 2:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 3:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp)
elif schooling == 4:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp) \
* (1 - enrollment_rate_3rd_cycle * enrollment_mf_ratio_temp \
* success_rate_3rd_cycle * success_mf_ratio_temp)
elif schooling == 5:
temp_table.loc[(temp_table['age']==age) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling),'lag'] = \
float(temp_table[(temp_table['age']==(age-1)) \
& (temp_table['sex']== sex) \
& (temp_table['schooling']== schooling-1)][old_year]) \
* (1 - mortality_rate_15_64 * mortality_mf_ratio_temp) \
* (enrollment_rate_3rd_cycle * enrollment_mf_ratio_temp \
* success_rate_3rd_cycle * success_mf_ratio_temp)
And this continues for all age groups
As I said, it does work, but this is neither elegant nor fast...
Without having seen the verifiable output - https://stackoverflow.com/help/mcve - you can either use:
temp_table['mortality_mf_ratio'] = temp_table.apply(lambda row: some_function_per_row(row), axis=1)
Or you could use np.where https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
temp_table['mortality_mf_ratio'] = np.where(temp_table['sex'] == 'F', 1, 0)