Save values in a dataframe and replace them in a csv

Save values in a dataframe and replace them in a csv - python

so im not quite sure how to formulate the question, as im quite new in pythong and coding in general.
I have a GUI that displays already available information form a csv:
def updatetext(self):
"""adds information extracted from database already provided"""
df_subj = Content.extract_saved_data(self.date)
self.lineEditFirstDiagnosed.setText(str(df_subj["First_Diagnosed_preop"][0])) \
if str(df_subj["First_Diagnosed_preop"][0]) != 'nan' else self.lineEditFirstDiagnosed.setText('')
self.lineEditAdmNeurIndCheck.setText(str(df_subj['Admission_preop'][0])) \
works great
now, if i chenge values in the GUI, i want them to be updated in the csv.
I started like this:
def onClickedSaveReturn(self):
"""closes GUI and returns to calling (main) GUI"""
df_general = Clean.get_GeneralData()
df_subj = {k: '' for k in Content.extract_saved_data(self.date).keys()} # extract empty dictionary
df_subj['ID'] = General.read_current_subj().id[0]
df_subj['PID'] = df_general['PID_ORBIS'][0]
df_subj['Gender'] = df_general['Gender'][0]
df_subj['Diagnosis_preop'] = df_general['diagnosis'][0]
df_subj["First_Diagnosed_preop"] = self.lineEditFirstDiagnosed.text()
df_subj['Admission_preop'] = self.lineEditAdmNeurIndCheck.text()
df_subj['Dismissal_preop'] = self.DismNeurIndCheckLabel.text()
and this is what my boss added now:
subj_id = General.read_current_subj().id[0] # reads data from curent_subj (saved in ./tmp)
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=',')
if df.shape[1] == 1:
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=';')
idx2replace = df.index[df['ID'] == subj_id][0]
# TODO: you need to find a way to turn the dictionaryy df_subj into a dataframe and replace the data at
# the index idxreplace of 'df' with df_subj. Later I would suggest to use line 322 to save everything to the
# file
df.iloc[idx2replace] = pds.DataFrame([df_subj])
df.to_csv("preoperative.csv", index=False)
# df.to_csv(os.path.join(FILEDIR, "preoperative.csv"), index=False)
self.close()
I'm not really sure how to approach this, or to be honest, what to do at all.
Hope someone can help me.
Thank youu

You should load the file only once and keep the DF (self.df or something). Then display it and every time the user changes a value in the GUI the DF should update and when the user clicks save you should just overwrite the existing file with the current DF in memory.

Related

Automatically overwriting .csv file from updated data frame?

I was wondering if there is any way to automatically overwrite a .csv file. Basically, I would have a user inputs something in the app, then my function would update the data table with that input and also let the user know that their input has been successfully received. However, I also want to update the base .csv that the dataframe reads on - so that would be a loop: we have a .csv file, dataframe reads it, the user inputs something in the app, the dataframe gets updated, and the .csv file would be updated also. I have this so far:
def submit_reviews(n_clicks, claims_list, verdict, category, review):
if n_clicks == 0:
return (dash.no_update, dash.no_update, dash.no_update, dash.no_update, dash.no_update)
if n_clicks and claims_list and verdict and category and review:
new_df["Reviewed_Indicator"] = new_df.apply(lambda row: verdict
if row["Claim_Number"] in claims_list else row["Reviewed_Indicator"], axis = 1)
new_df["Reviewed_Category"] = new_df.apply(lambda row: category
if row["Claim_Number"] in claims_list else row["Reviewed_Category"], axis = 1)
new_df["Reviewed_Reason"] = new_df.apply(lambda row: review
if row["Claim_Number"] in claims_list else row["Reviewed_Reason"], axis = 1)
dcc.send_data_frame(new_df.to_csv, "results_test.csv", index=False)
return ([], True, "success", "Thank you. Your review has been submitted.", 1)
else:
return (dash.no_update, True,
"danger", "Error submitting review. Please review your submission.", dash.no_update)
However, the send data frame action does not seem to run anyhow. Is it possible to that update in there?

Using pd.to_csv(path) already overwrites the files. You just have to make sure that the file is not in use.

Jupyter Dropdown ipywidget displays wrong dataframe

I am well and truly baffled. First time using the dropdown widget so forgive me if this is obvious and thank you for any help you can provide.
Here is the dataframe I want to display and how it was built:
def top_10_venues(data) :
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = data['Neighborhood']
for ind in np.arange(denver_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(data.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted = neighborhoods_venues_sorted.set_index(['Neighborhood'])
top_10_venues(denver_grouped)
neighborhoods_venues_sorted
Here is my dropdown widget:
#Experimenting with Jupyter dropdown
filtered_df = None
dropdown = widgets.SelectMultiple(
options=neighborhoods_venues_sorted.index,
description='Venue',
disabled=False,
layout={'height':'100px', 'width':'40%'})
def max_density(widget):
global filtered_df
selection = list(widget['new'])
with out:
clear_output()
display(neighborhoods_venues_sorted[selection])
filtered_df = neighborhoods_venues_sorted[selection]
out = widgets.Output()
dropdown.observe(filter_dataframe, names='value')
display(dropdown)
display(out)
Here is what I end up seeing, the unformatted dataframe I ran the function on?

Booyah, figured it out!
Seems my issue was a misunderstanding of what was happening within the cell that created neighborhoods_venues_sorted. I thought I was creating a dataframe. Instead I created a function
First is the sort function
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
This is the new function instead of a block of code in a cell
#Function to create sorted data frame with top 10 most common venues
def top_ten_venues(df) :
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df['Neighborhood']
for ind in np.arange(denver_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df.iloc[ind, :], num_top_venues)
#important to have a return in a function, this is the output that can be attached to a variable
return neighborhoods_venues_sorted
Next I ran it on my targeted dataframe and assigned it to a variable. This fixed my issue, I'm still too new to understand fully why when this exact same code was run in a cell it refused to assign it as a new dataframe.
#creating a variable to hold the df for later access
neighborhoods_venues_sorted = top_ten_venues(denver_grouped)
#reindexing because it's fun
neighborhoods_venues_sorted = neighborhoods_venues_sorted.set_index(['Neighborhood'])

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)

First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

Checking HTTP Status (Python)

Is there a way to check the HTTP Status Code in the code below, as I have not used the request or urllib libraries which would allow for this.
from pandas.io.excel import read_excel
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8) #Creates the dataframes
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
#Providing correct names
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)

In pandas:
read_excel try to use urllib2.urlopen(urllib.request.urlopen instead in py3x) to open the url and get .read() of response immediately without store the http request like:
data = urlopen(url).read()
Though you need only part of the excel, pandas will download the whole excel each time. So, I voted #jonnybazookatone.
It's better to store the excel to your local, then you can check the status code and md5 of file first to verify data integrity or others.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Save values in a dataframe and replace them in a csv - python

You should load the file only once and keep the DF (self.df or something). Then display it and every time the user changes a value in the GUI the DF should update and when the user clicks save you should just overwrite the existing file with the current DF in memory.

Related

Automatically overwriting .csv file from updated data frame?

Jupyter Dropdown ipywidget displays wrong dataframe

Script keep showing "SettingCopyWithWarning'

Check logs with Spark

Checking HTTP Status (Python)

Categories

Resources