Python Dataset package & looping / updating rows --

Python Dataset package & looping / updating rows -- - python

I am trying to retrieve the contents of my sqlite3 database and updating this data utilizing a scraper in a for loop.
The presumed flow is as follows:
Retrieve all rows from the dataset
For each row, find the URL column and fetch some additional (updated) data
Once this data has been obtained, upsert (update, add columns if not existent) this data to the row the URL was taken from.
I love the dataset package because of 'upsert', allowing it to dynamically add whatever columns I may have added to the database if non-existent.
My code produces an error I can't explain, however.
'ResourceClosedError: This result object is closed.'
How would I go about obtaining my goal without running into this? The following snippet recreates my issue.
import dataset
db = dataset.connect('sqlite:///test.db')
# Add two dummy rows
testrow1 = {'TestID': 1}
testrow2 = {'TestID': 2}
db['test'].upsert(testrow1, ['TestID'])
db['test'].upsert(testrow2, ['TestID'])
print("Inserted testdata before loop")
# This works fine
testdata = db['test'].all()
for row in testdata:
print row
# This gives me an 'ResourceClosedError: This result object is closed.' error?
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = db['test'].all()
for row in testdata:
data = {'TestID': i+1000}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1

The issue might be you are querying the dataset and accessing the result object (under 'this works fine") and reading it all in a loop and then immediately trying to do another loop again with upserts on the same result object. The error is telling you that the resource has been closed, basically once you read it the connection is closed automatically (as a feature!). (see this answer about 'automatic closing' for more on the why and ways to get around it.)
Given that result resources tend to get closed, try fetching the results again at the beginning of your upsert loop:
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = db['test'].all()
for row in testdata:
data = {'TestID': i}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1
Edit: See comment, the above code would change the testdata inside the loop and thus still gives the same error, so a way to get around this is to read the data into an array first and then loop through that array to do the updates. Something like:
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = [row for row in db['test'].all()]
for row in testdata:
data = {'TestID': i}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1

Related

How to loop through a list of BQ tables in Python?

I've a list of BQ tables that I'd like to use one at a time. The purpose is to process each table individually, perform some action (in my example, score the dataset for a previously fitted model), then compute, append, and save the probabilities in the all scores list.
Here's a screenshot of the entire code snippet.
# List of BQ Table
scoring_tables = ["`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_01`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_02`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_03`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_04`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_05`"]
# List to store probabilities/scores
all_scores = []
# Loop through each BQ tables, calculate, append and store the probabilities in the all_score = []
for t in scoring_tables:
%%bigquery property_data_score_00
SELECT * FROM t
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
# Fixing for HOUSE_INCOME = 0 & AGE = 0 based on means
score_data['HOUSE_INCOME'] = np.where(score_data['HOUSE_INCOME']==0,107,score_data['HOUSE_INCOME'])
score_data['AGE'] = np.where(score_data['AGE']==0,54,score_data['AGE'])
# recategorize PROP_EXTR_WALL_TYPE | PROP_GRG_TYPE | PROP_ROOF_TYPE
condition = [(score_data['PROP_EXTR_WALL_TYPE'].str.contains("BRICK")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("WOOD")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("CONCRETE")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("METAL")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("STEEL"))]
choice = ["BRICK","WOOD","CONCRETE","METAL","METAL"]
score_data['PROP_EXTR_WALL_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_GRG_TYPE'].str.contains("ATTACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("DETACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("CARPORT")),
(score_data['PROP_GRG_TYPE'].str.contains("BASEMENT"))]
choice = ["ATTACHED","DETACHED","CARPORT","BASEMENT"]
score_data['PROP_GRG_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_ROOF_TYPE'].str.contains("GABLE")),
(score_data['PROP_ROOF_TYPE'].str.contains("HIP")),
(score_data['PROP_ROOF_TYPE'].str.contains("GAMBREL"))]
choice = ["GABLE","HIP","GAMBREL"]
score_data['PROP_ROOF_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
# one-hot encoding
to_encode = ["IND_ETHNICITY","IND_GENDER","IND_MOVERS_FLAG","IND_OCCUPATION","IND_REGION","PROP_EXTR_WALL_TYPE","PROP_GRG_TYPE","PROP_ROOF_TYPE","TSI"]
score_data_dm = pd.get_dummies(data = score_data, columns = to_encode, drop_first = False)
columns_not_in_score_data_dm = [c for c in train_X.columns if c not in score_data_dm.columns] #columns which might not get produced during pd.get_dummies(data = score_data....), if categories are not available in score data
score_data_dm[columns_not_in_score_data] = 0 #initializing above columns as 0
score_data_dm_filt = score_data_dm[select_columns] # making sure to select only the columns which are in the train_X
y_pred = xgb_prop_PV.predict_proba(score_data_dm_filt)[:,1] #final scoring
all_scores = all_scores + y_pred
Inside the looping, I'm having trouble with the SELECT * FROM t step. The error is shown below. I believe the indent within the loop is causing the %% bigquery step to fail. I looked at itertools here, however it appears that it is only useful when conditional looping is there, which is not the case in my situation.
Also, this appears to be a complex approach; is there a more elegant solution? Because the table was too large (600GB), it needed to be split into smaller datasets, so we tried this method. PS: It works without the loop if run for one table at a time. But its quite a manual effort.
Thanks,
Piyush

To answer the issue of the error message. If you are going to pass a variable in to a query using the bigquery magics command you need to use the params flag.
Your code should end up looking something like this:
t="table_name"
my_params = {"t": t}
%%bigquery --params $my_params
select #t

You may consider and try below approach.
Instead of using BigQuery magics, you may use BiQuery Client Library for Python. From there, you may loop to your list of tables by using f string as shown on below sample code.
from google.cloud import bigquery
bqclient = bigquery.Client()
scoring_tables = ["`your-project-id.your-dataset.test_table1`",
"`your-project-id.your-dataset.test_table2`","`your-project-id.your-dataset.test_table3`"]
for t in scoring_tables:
# Download query results.
query_string = f"""
SELECT * FROM {t}
"""
property_data_score_00 = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
print(score_data)
Sample output of above code:

How do I loop column names in a pandas dataframe?

I am new to Python and have never really used Pandas, so forgive me if this doesn't make sense. I am trying to create a df based on frontend data I am sending to a flask route. The data is looped through and appended for each row. My only problem is that I don't know how to get the df columns to reflect that. Here is my code to build the rows and the current output:
claims = csv_data["claims"]
setups = csv_data["setups"]
for setup in setups:
setup = setups[0]
offerings = setup["currentOfferings"]
considered = setup["considerationSet"]
reach_dict = setup["reach"]
favorite_dict = setup["favorite"]
summary_dict = setup["summaryMetrics"]
rows = []
for i, claim in enumerate(claims):
row = []
row.append(i + 1)
row.append(claim)
for setup in setups:
setup = setups[0]
row.append("X") if claim in setup["currentOfferings"] else row.append(float('nan'))
row.append("X") if claim in setup["considerationSet"] else row.append(float('nan'))
if claim in setup["currentOfferings"]:
reach_score = reach_dict[claim]
reach_percentage = "{:.0%}".format(reach_score)
row.append(reach_percentage)
else:
row.append(float('nan'))
if claim in setup["currentOfferings"]:
favorite_score = favorite_dict[claim]
fav_percentage = "{:.0%}".format(favorite_score)
row.append(fav_percentage)
else:
row.append(float('nan'))
rows.append(row)
I know that I can put columns = ["#", "Claims", "Setups", etc...] in the df, but that doesn't work because the rows are looping through multiple setups, and the number of setups can change. If I don't specify the column names (how it is in the image), then I just have numbers as columns names. Ideally it should loop through the data it receives in the route, and would start with "#" "Claims" as columns, and then for each setup "Setup 1", "Consideration Set 1", "Reach", "Favorite", "Setup 2", "Consideration Set 2", and so on... etc.
I tried to create a similar type of loop for the columns:
my_columns = []
for i, row in enumerate(rows):
col = []
if row[0] != None:
col.append("#")
else:
pass
if row[1] != None:
col.append("Claims")
else:
pass
if row[2] != None:
col.append("Setup")
else:
pass
if row[3] != None:
col.append("Consideration Set")
else:
pass
if row[4] != None:
col.append("Reach")
else:
pass
if row[5] != None:
col.append("Favorite")
else:
pass
my_columns.append(col)
df = pd.DataFrame(
rows,
columns = my_columns
)
But this didn't work because I have the same issue of no loop, I have 6 columns passed and 10 data columns passed. I'm not sure if I am just not doing the loop of the columns properly, or if I am making everything more complicated than it needs to be.
This is what I am trying to accomplish without having to explicitly name the columns because this is just sample data. There could end up being 3, 4, however many setups in the actual app.
what I would like the ouput to look like

I don't know if this is the most efficient way of doing something like this but I think this is what you want to achieve.
def create_columns(df):
new_cols=[]
for i in range(len(df.columns)):
repeated_cols = 6 #here is the number of columns you need to repeat for every setup
idx = 1 + i // repeated_cols
basic = ['#', 'Claims', f'Setup_{idx}', f'Consideration_Set_{idx}', 'Reach', 'Favorite']
new_cols.append(basic[i % len(basic)])
return new_cols
df.columns = create_columns(df)

If your data comes as csv then try pd.read_csv() to create dataframe.

Rename a data frame name by adding the iteration value as suffix in a for loop (Python)

I have run the following Python code :
array = ['AEM000', 'AID017']
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'].isin(array)]
I run a regression model and extract the log-likelyhood value on each item of this array by a for loop :
for item in array:
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'] == item]
formula = "WEIGHTED_BASE_MEDIAN_FINAL_MEAN ~ YEAR"
response, predictors = dmatrices(formula, USA_DATA_1D, return_type='dataframe')
mod1 = sm.GLM(response, predictors, family=sm.genmod.families.family.Gaussian()).fit()
LLF_NG = {'model': ['Standard Gaussian'],
'llf_value': mod1.llf
}
df_llf = pd.DataFrame(LLF_NG , columns = ['model', 'llf_value'])
Now I would like to remane the dataframe df_llf by df_llf_(name of the item) i.e. df_llf_AEM000 when running the loop on the first item and df_llf_AID017 when running the loop on the second one.
I need some help to know how to proceed that.

If you want to rename the data frame, you need to use the copy method so that the original data frame does not get altered.
df_llf_AEM000 = df_llf.copy()
If you want to save iteratively several different versions of the original data frame, you can do something like this:
allDataframes = []
for i in range(10):
df = df_original.copy()
allDataframes.append(df)
print(allDataframes[0])

counting entries yields a wrong dataframe

So I'm trying to automate the process of getting the number of entries a person has by using pandas.
Here's my code:
st = pd.read_csv('list.csv', na_values=['-'])
auto = pd.read_csv('data.csv', na_values=['-'])
comp = st.Component.unique()
eventname = st.EventName.unique()
def get_summary(ID):
for com in comp:
for event in eventname:
arr = []
for ids in ID:
x = len(st.loc[(st.User == str(ids)) & (st.Component == str(com)) & (st.EventName == str(event))])
arr.append(x)
auto.loc[:, event] = pd.Series(arr, index=auto.index)
The output I get looks like this:
I ran some manual loops to see the entries for the first four columns. And I counted them manually too in the csv file. But when I put a print function inside the loop, I can see that it does count the entries correctly, but at some point it gets overwritten with the zero values.
What am I missing/doing wrong here?

Google chart input data

I have a python script to build inputs for a Google chart. It correctly creates column headers and the correct number of rows, but repeats the data for the last row in every row. I tried explicitly setting the row indices rather than using a loop (which wouldn't work in practice, but should have worked in testing). It still gives me the same values for each entry. I also had it working when I had this code on the same page as the HTML user form.
end1 = number of rows in the data table
end2 = number of columns in the data table represented by a list of column headers
viewData = data stored in database
c = connections['default'].cursor()
c.execute("SELECT * FROM {0}.\"{1}\"".format(analysis_schema, viewName))
viewData=c.fetchall()
curDesc = c.description
end1 = len(viewData)
end2 = len(curDesc)
Creates column headers:
colOrder=[curDesc[2][0]]
if activityOrCommodity=="activity":
tableDescription={curDesc[2][0] : ("string", "Activity")}
elif (activityOrCommodity == "commodity") or (activityOrCommodity == "aa_commodity"):
tableDescription={curDesc[2][0] : ("string", "Commodity")}
for i in range(3,end2 ):
attValue = curDesc[i][0]
tableDescription[curDesc[i][0]]= ("number", attValue)
colOrder.append(curDesc[i][0])
Creates row data:
data=[]
values = {}
for i in range(0,end1):
for j in range(2, end2):
if j == 2:
values[curDesc[j][0]] = viewData[i][j].encode("utf-8")
else:
values[curDesc[j][0]] = viewData[i][j]
data.append(values)
dataTable = gviz_api.DataTable(tableDescription)
dataTable.LoadData(data)
return dataTable.ToJSon(columns_order=colOrder)
An example javascript output:
var dt = new google.visualization.DataTable({cols:[{id:'activity',label:'Activity',type:'string'},{id:'size',label:'size',type:'number'},{id:'compositeutility',label:'compositeutility',type:'number'}],rows:[{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]}]}, 0.6);

it seems you're appending values to the data but your values are not being reset after each iteration...
i assume this is not intended right? if so just move values inside the first for loop in your row setting code

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataset package & looping / updating rows -- - python

Related

How to loop through a list of BQ tables in Python?

How do I loop column names in a pandas dataframe?

Rename a data frame name by adding the iteration value as suffix in a for loop (Python)

counting entries yields a wrong dataframe

Google chart input data

Categories

Resources