Add new column to a HuggingFace dataset inside a dictionary - python

I have a tokenized dataset titled, tokenized_datasets as follows:
I want to add a column titled ['labels'] that is a copy of ['input_ids'] within the features. I'm aware of the following method from this post Add new column to a HuggingFace dataset:
new_dataset = dataset.add_column("labels", tokenized_datasets['input_ids'].copy())
But I first need to access the Dataset Dictionary. This is what I have so far but it doesn't seem to do the trick:
def new_column(example):
example["labels"] = example["input_ids"].copy()
return example
dataset_new = tokenized_datasets.map(new_column)
KeyError: 'input_ids'

Try one of the two options below:
# first option
def new_column(example):
return {"labels" = example["input_ids"]}
# second option
def new_column(example):
example["labels"] = example["input_ids"]
return example
dataset_new = tokenized_datasets.map(new_column)

Related

How to loop through a list of BQ tables in Python?

I've a list of BQ tables that I'd like to use one at a time. The purpose is to process each table individually, perform some action (in my example, score the dataset for a previously fitted model), then compute, append, and save the probabilities in the all scores list.
Here's a screenshot of the entire code snippet.
# List of BQ Table
scoring_tables = ["`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_01`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_02`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_03`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_04`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_05`"]
# List to store probabilities/scores
all_scores = []
# Loop through each BQ tables, calculate, append and store the probabilities in the all_score = []
for t in scoring_tables:
%%bigquery property_data_score_00
SELECT * FROM t
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
# Fixing for HOUSE_INCOME = 0 & AGE = 0 based on means
score_data['HOUSE_INCOME'] = np.where(score_data['HOUSE_INCOME']==0,107,score_data['HOUSE_INCOME'])
score_data['AGE'] = np.where(score_data['AGE']==0,54,score_data['AGE'])
# recategorize PROP_EXTR_WALL_TYPE | PROP_GRG_TYPE | PROP_ROOF_TYPE
condition = [(score_data['PROP_EXTR_WALL_TYPE'].str.contains("BRICK")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("WOOD")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("CONCRETE")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("METAL")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("STEEL"))]
choice = ["BRICK","WOOD","CONCRETE","METAL","METAL"]
score_data['PROP_EXTR_WALL_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_GRG_TYPE'].str.contains("ATTACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("DETACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("CARPORT")),
(score_data['PROP_GRG_TYPE'].str.contains("BASEMENT"))]
choice = ["ATTACHED","DETACHED","CARPORT","BASEMENT"]
score_data['PROP_GRG_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_ROOF_TYPE'].str.contains("GABLE")),
(score_data['PROP_ROOF_TYPE'].str.contains("HIP")),
(score_data['PROP_ROOF_TYPE'].str.contains("GAMBREL"))]
choice = ["GABLE","HIP","GAMBREL"]
score_data['PROP_ROOF_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
# one-hot encoding
to_encode = ["IND_ETHNICITY","IND_GENDER","IND_MOVERS_FLAG","IND_OCCUPATION","IND_REGION","PROP_EXTR_WALL_TYPE","PROP_GRG_TYPE","PROP_ROOF_TYPE","TSI"]
score_data_dm = pd.get_dummies(data = score_data, columns = to_encode, drop_first = False)
columns_not_in_score_data_dm = [c for c in train_X.columns if c not in score_data_dm.columns] #columns which might not get produced during pd.get_dummies(data = score_data....), if categories are not available in score data
score_data_dm[columns_not_in_score_data] = 0 #initializing above columns as 0
score_data_dm_filt = score_data_dm[select_columns] # making sure to select only the columns which are in the train_X
y_pred = xgb_prop_PV.predict_proba(score_data_dm_filt)[:,1] #final scoring
all_scores = all_scores + y_pred
Inside the looping, I'm having trouble with the SELECT * FROM t step. The error is shown below. I believe the indent within the loop is causing the %% bigquery step to fail. I looked at itertools here, however it appears that it is only useful when conditional looping is there, which is not the case in my situation.
Also, this appears to be a complex approach; is there a more elegant solution? Because the table was too large (600GB), it needed to be split into smaller datasets, so we tried this method. PS: It works without the loop if run for one table at a time. But its quite a manual effort.
Thanks,
Piyush
To answer the issue of the error message. If you are going to pass a variable in to a query using the bigquery magics command you need to use the params flag.
Your code should end up looking something like this:
t="table_name"
my_params = {"t": t}
%%bigquery --params $my_params
select #t
You may consider and try below approach.
Instead of using BigQuery magics, you may use BiQuery Client Library for Python. From there, you may loop to your list of tables by using f string as shown on below sample code.
from google.cloud import bigquery
bqclient = bigquery.Client()
scoring_tables = ["`your-project-id.your-dataset.test_table1`",
"`your-project-id.your-dataset.test_table2`","`your-project-id.your-dataset.test_table3`"]
for t in scoring_tables:
# Download query results.
query_string = f"""
SELECT * FROM {t}
"""
property_data_score_00 = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
print(score_data)
Sample output of above code:

Applying pre trained facebook/bart-large-cnn for text summarization in python on a dataframe column

I am working with huggingface transformers(Summarizers) and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation and I am running the below code:
from transformers import pipeline
summarizer = pipeline("summarization")
text= "Good Morning team, I need a help in terms of one of the functions that needs to be written on the servers.. please let me know wen are you available.. Thanks , hgjhghjgjh, 193-6757-568"
print(summarizer(str(text), min_length = int(0.1 * len(str(text))), max_length = int(0.2 * len(str(text))),do_sample=False))
But my question is that how can I apply the same pre trained model on top of my dataframe column. My dataframe looks like this:
ID Text
1 some long text here...
2 some long text here...
3 some long text here...
.... and so on for 100K rows
Now I want to apply the pre trained model to the col Text to generate a new column df['summary_Text'] from it and the resultant dataframe should look like:
ID Text Summary_Text
1 some long text here... Text summary goes here...
2 some long text here... Text summary goes here...
3 some long text here... Text summary goes here...
HOw can i get this ? ANy quick help would be highly appreciated
I am working on the same line trying to summarize news articles.
You can input either strings or lists to the model. First convert your dataframe 'Text' column to a list:
input_col = df['Text'].to_list()
Then feed it to your model:
from transformers import pipeline
summarizer = pipeline("summarization")
res = summarizer(input_col, min_length = int(0.1 * len(str(text))), max_length = int(0.2 * len(str(text))),do_sample=False)
print(res[0]['summary_text])
This gives back a list and prints only first output of it. You can recurse over the list (res[1]['summary_text']..res[2]['summary_text'] and so on....) and store it and add it back as a dataframe column.
df_res = []
for i in range(len(res)):
df_res.append(res[i]['summary_text'])
df['Summary_Text'] = df_res
Use truncation=True as input parameter (where you input min_length etc.) for the summarizer if your articles are long.
This will take a long time using cpu. I myself am looking for faster alternatives. For me XL_net is a usable option for now. Hope this helps!
this is my code to iterate through excel rows from column X and get summarization in another column Y, hope this can help you
from transformers import pipeline
import openpyxl
wb = openpyxl.load_workbook(wb, read_only=False)
ws = wb["sheet"]
bart_summarizer = pipeline("summarization")
for row in ws.iter_rows(min_col=8, min_row=2, max_col=8, max_row= 5):
for cell in row:
TEXT_TO_SUMMARIZE = cell.value
summary = bart_summarizer(TEXT_TO_SUMMARIZE, min_length=10, max_length=100)
r = cell.row
ws.cell(row=r, column=10).value = str(summary)
wb.save(wb)

How to loop through few lines

I have a doubt of how to loop over few lines :
get_sol is a function which is created which has two parameters : def get_sol(sub_dist_fil,fos_cnt)
banswara, palwai and hathin are some random values of a column named as "sub-district".
1 is fixed
I am writing it as :
out_1 = get_sol( "banswara",1)
out_1 = get_sol("palwal",1)
out_1 = get_sol("hathin",1)
How can I apply for loop to these lines in order to get results in one go
Help !!
"FEW COMMENTS HAVE HELPED ME IN ACHIEVING MY RESULTS (THANKS ALOT)". THE RESULT IS AS FOLLOW :
NOW I HAVE A QUERY THAT HOW DO I DISPLAY/PRINT THE NAME OF RESPECTIVE DISTRICT FOR WHICH THE RESULTS ARE RUNNING???????
Well in general case you can do something like this:
data = ['banswara', 'palwal', 'hathin']
result = {}
for item in data:
result[item] = get_sol(item, 1)
print(result)
This will pack your results in dictionary giving you opportunity to see which result is generated for which input.
here you go:
# save the values into a list
random_values = column["sub-district"]
# iterate through using for
for random_value in random_values:
# get the result
result = get_sol(random_value, 1)
# print the result or do whatever
# you want to the result
print(result)
Similar other answers, but using a list comprehension to make it more pythonic (and faster, usually):
districts = ['banswara', 'palwal', 'hathin']
result = [get_sol(item, 1) for item in data]
I think you are trying to get random values from the column 'subdistrict'
For the purpose of illustration, let the dataframe be df. (So to access 'subdistrict' column, df['subdistrict']
import numpy
[print(get_sol(x)) for x in np.random.choice(df['subdistrict'], 10)]
# selecting 10 random values from particular columns
Here is the official documentation

Rename a data frame name by adding the iteration value as suffix in a for loop (Python)

I have run the following Python code :
array = ['AEM000', 'AID017']
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'].isin(array)]
I run a regression model and extract the log-likelyhood value on each item of this array by a for loop :
for item in array:
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'] == item]
formula = "WEIGHTED_BASE_MEDIAN_FINAL_MEAN ~ YEAR"
response, predictors = dmatrices(formula, USA_DATA_1D, return_type='dataframe')
mod1 = sm.GLM(response, predictors, family=sm.genmod.families.family.Gaussian()).fit()
LLF_NG = {'model': ['Standard Gaussian'],
'llf_value': mod1.llf
}
df_llf = pd.DataFrame(LLF_NG , columns = ['model', 'llf_value'])
Now I would like to remane the dataframe df_llf by df_llf_(name of the item) i.e. df_llf_AEM000 when running the loop on the first item and df_llf_AID017 when running the loop on the second one.
I need some help to know how to proceed that.
If you want to rename the data frame, you need to use the copy method so that the original data frame does not get altered.
df_llf_AEM000 = df_llf.copy()
If you want to save iteratively several different versions of the original data frame, you can do something like this:
allDataframes = []
for i in range(10):
df = df_original.copy()
allDataframes.append(df)
print(allDataframes[0])

Create Forecasts Looping over SKUs and Export to CSV using Facebook Prophet

I am new to Python so please bear with me.
I am trying to convert what I think may be a nested dictionary into a csv that I can export. Below is my code:
import pandas as pd
import os
from fbprophet import Prophet
# Read in File
df1 = pd.read_csv('File_Path.csv')
#Create Loop to Forecast Multiple SKUs
def get_prediction(df):
prediction = {}
df1 = df.rename(columns={'Date': 'ds','qty_ordered': 'y', 'item_no': 'item'})
list_items = df1.item.unique()
for item in list_items:
item_df = df1.loc[df1['item'] == item]
# set the uncertainty interval to 95% (the Prophet default is 80%)
my_model = Prophet(yearly_seasonality= True, seasonality_prior_scale=1.0)
my_model.fit(item_df)
future_dates = my_model.make_future_dataframe(periods=12, freq='M')
forecast = my_model.predict(future_dates)
prediction[item] = forecast
return prediction
# Save predictions to dictionary
df2 = get_prediction(df1)
# Convert dictionary
df3 = pd.DataFrame.from_dict(df3, index='columns)
So the last part of the code is where I am struggling. I need to convert the df2 dictionary to a dataframe (df3) so I can export it to a csv. But it looks as if it is a nested dictionary? Not sure if I need to update my function or not.
This is what a snippet of the dictionary looks like
I need to export it so it will look like this
Any help would be greatly appreciated!
The following code should help flattening df2 (dictionary of dataframes if I understand correctly).
def flatten(dict_of_df):
# insert column 'item'
for key, value in dict_of_df.items():
value['item'] = key
# return vertically concatenated dataframe with all the items
return pd.concat(dict_of_df.values())

Categories

Resources