Automatically overwriting .csv file from updated data frame? - python

I was wondering if there is any way to automatically overwrite a .csv file. Basically, I would have a user inputs something in the app, then my function would update the data table with that input and also let the user know that their input has been successfully received. However, I also want to update the base .csv that the dataframe reads on - so that would be a loop: we have a .csv file, dataframe reads it, the user inputs something in the app, the dataframe gets updated, and the .csv file would be updated also. I have this so far:
def submit_reviews(n_clicks, claims_list, verdict, category, review):
if n_clicks == 0:
return (dash.no_update, dash.no_update, dash.no_update, dash.no_update, dash.no_update)
if n_clicks and claims_list and verdict and category and review:
new_df["Reviewed_Indicator"] = new_df.apply(lambda row: verdict
if row["Claim_Number"] in claims_list else row["Reviewed_Indicator"], axis = 1)
new_df["Reviewed_Category"] = new_df.apply(lambda row: category
if row["Claim_Number"] in claims_list else row["Reviewed_Category"], axis = 1)
new_df["Reviewed_Reason"] = new_df.apply(lambda row: review
if row["Claim_Number"] in claims_list else row["Reviewed_Reason"], axis = 1)
dcc.send_data_frame(new_df.to_csv, "results_test.csv", index=False)
return ([], True, "success", "Thank you. Your review has been submitted.", 1)
else:
return (dash.no_update, True,
"danger", "Error submitting review. Please review your submission.", dash.no_update)
However, the send data frame action does not seem to run anyhow. Is it possible to that update in there?

Using pd.to_csv(path) already overwrites the files. You just have to make sure that the file is not in use.

Related

Save values in a dataframe and replace them in a csv

so im not quite sure how to formulate the question, as im quite new in pythong and coding in general.
I have a GUI that displays already available information form a csv:
def updatetext(self):
"""adds information extracted from database already provided"""
df_subj = Content.extract_saved_data(self.date)
self.lineEditFirstDiagnosed.setText(str(df_subj["First_Diagnosed_preop"][0])) \
if str(df_subj["First_Diagnosed_preop"][0]) != 'nan' else self.lineEditFirstDiagnosed.setText('')
self.lineEditAdmNeurIndCheck.setText(str(df_subj['Admission_preop'][0])) \
works great
now, if i chenge values in the GUI, i want them to be updated in the csv.
I started like this:
def onClickedSaveReturn(self):
"""closes GUI and returns to calling (main) GUI"""
df_general = Clean.get_GeneralData()
df_subj = {k: '' for k in Content.extract_saved_data(self.date).keys()} # extract empty dictionary
df_subj['ID'] = General.read_current_subj().id[0]
df_subj['PID'] = df_general['PID_ORBIS'][0]
df_subj['Gender'] = df_general['Gender'][0]
df_subj['Diagnosis_preop'] = df_general['diagnosis'][0]
df_subj["First_Diagnosed_preop"] = self.lineEditFirstDiagnosed.text()
df_subj['Admission_preop'] = self.lineEditAdmNeurIndCheck.text()
df_subj['Dismissal_preop'] = self.DismNeurIndCheckLabel.text()
and this is what my boss added now:
subj_id = General.read_current_subj().id[0] # reads data from curent_subj (saved in ./tmp)
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=',')
if df.shape[1] == 1:
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=';')
idx2replace = df.index[df['ID'] == subj_id][0]
# TODO: you need to find a way to turn the dictionaryy df_subj into a dataframe and replace the data at
# the index idxreplace of 'df' with df_subj. Later I would suggest to use line 322 to save everything to the
# file
df.iloc[idx2replace] = pds.DataFrame([df_subj])
df.to_csv("preoperative.csv", index=False)
# df.to_csv(os.path.join(FILEDIR, "preoperative.csv"), index=False)
self.close()
I'm not really sure how to approach this, or to be honest, what to do at all.
Hope someone can help me.
Thank youu
You should load the file only once and keep the DF (self.df or something). Then display it and every time the user changes a value in the GUI the DF should update and when the user clicks save you should just overwrite the existing file with the current DF in memory.

Write to CSV cell by cell in a row using Python

I've written a Python boto3 code to get the average EC2 CPU utilization/day for the last 2 days. Here's the code:
import boto3
import datetime
import csv
accountId = boto3.client('sts').get_caller_identity()['Account']
session = boto3.session.Session()
region = session.region_name
ec2 = session.resource('ec2',region_name=region)
s3 = session.resource('s3')
fields = ['Account' , 'Region' , 'InstanceID' , 'InstanceName']
start = datetime.datetime.today() - datetime.timedelta(days=2)
end = datetime.datetime.today()
instanceId = ''
instanceName = ''
rows = []
filename = 'CPUUtilization.csv'
def get_cpu_utilization(instanceId):
cw = boto3.client('cloudwatch',region_name=region)
res = cw.get_metric_statistics(
Namespace = 'AWS/EC2',
Period = 86400,
StartTime = start,
EndTime = end,
MetricName = 'CPUUtilization',
Statistics = ['Average'],
Unit = 'Percent',
Dimensions = [
{
'Name' : 'InstanceId',
'Value' : instanceId
}
]
)
return res
def lambda_handler(event, context):
for instance in ec2.instances.all():
if instance.tags != None:
for tags in instance.tags:
if tags['Key'] == 'Name':
instanceName = tags['Value']
break
instanceId = str(instance.id)
response = get_cpu_utilization(instanceId)
rows.append([accountId, region, instanceId, instanceName])
for r in response['Datapoints']:
day = r['Timestamp'].date()
week = day.strftime('%a')
avg = r['Average']
day_uti = ' '.join([str(day),week])
fields.append(day_uti)
rows.append([avg])
with open("/tmp/"+filename, 'w+') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(fields)
csvwriter.writerows(rows)
csvfile.close()
s3.Bucket('instances-cmdb').upload_file('/tmp/CPUUtilization.csv', 'CPUUtilization.csv')
The output written to the CSV file is like this:
The average CPU utilization value is printed in the A3 cell, but this has to be printed/written to E2 cell under the date. And all the subsequent days to be written to 1st row and the corresponding values should go to 2nd row, cell by cell, under their respective dates.
How can I achieve this?
I have a couple of other questions related to AWS CloudWatch metrics.
This particular instance was in stopped state the whole day (1st April 2022). Still this Lambda function is giving some CPU utilization value on that day. When I checked for the same from the console, I don't see any data. How is this possible? Am I making any mistake?
When I ran this function multiple times, I got different CPU utilization values. The above attached image was from 1st execution (Avg CPU utilization=0.110935...). Below is the result from 2nd execution
Here the avg CPU utilization for the same instance on the same day is different(0.53698..) from previous result. Is this mistake from my side or what?
Please help.
NOTE: There is only one instance in my account and it was in stopped state the whole day (1st April 2022) and started only on 2nd April 2022 at around 8:00PM IST.
You need to rethink your logic for adding columns for each datapoint returned.
The row list contains one entry per row. It starts with this:
rows.append([accountId, region, instanceId, instanceName])
That creates one entry in the list that is a list with four values.
Later, the code attempts to add another column with:
rows.append([avg])
This results in rows having the value of [[accountId, region, instanceId, instanceName], [avg]].
This is adding another row, which is why it is appearing in the CSV file as a separate line. Rather than adding another row, the code needs to add another entry in the existing row.
The easiest way to do this would be to save the row in a list and only add the 'row' once you have all the information for the row.
So, you could replace this line:
rows.append([accountId, region, instanceId, instanceName])
with:
current_row = [accountId, region, instanceId, instanceName]
And you could later add to it with:
current_row.append(avg)
Then, after the for loop has completed adding all the columns, it can be stored with:
rows.append(current_row)
Also, be careful with this line:
fields.append(day_uti)
It is adding the date to the fields list, but if there is more than one instance, each instance will add an entry. I presume you want them to be the same date, so it won't work out like you expect.

Pytrends - Interest over time - return column with None when there is no data

Pytrends for Google Trends data does not return a column if there is no data for a search parameter on a specific region.
The code below is from pytrends.request
def interest_over_time(self):
"""Request data from Google's Interest Over Time section and return a dataframe"""
over_time_payload = {
# convert to string as requests will mangle
'req': json.dumps(self.interest_over_time_widget['request']),
'token': self.interest_over_time_widget['token'],
'tz': self.tz
}
# make the request and parse the returned json
req_json = self._get_data(
url=TrendReq.INTEREST_OVER_TIME_URL,
method=TrendReq.GET_METHOD,
trim_chars=5,
params=over_time_payload,
)
df = pd.DataFrame(req_json['default']['timelineData'])
if (df.empty):
return df
df['date'] = pd.to_datetime(df['time'].astype(dtype='float64'),
unit='s')
df = df.set_index(['date']).sort_index()
From the code above, if there is no data, it just returns df, which will be empty.
My question is, how can I make it return a column with "No data" on every line and the search term as header, so that I can clearly see for which search terms there is no data?
Thank you.
I hit this problem, then I hit this web page. My solution was to ask Google trends for data on a search item it would have data for, then rename the column and 0 the data.
I used the ".drop" method to get rid of the "isPartial" column and the ".rename" method to change the column name. To zero the data in the column, I did the following, I created a function:
#Make value zero
def MakeZero(x):
return x *0
Then using the ".apply" method on the dataframe to 0 the column.
ThisYrRslt=BlankResult.apply(MakeZero)
: ) But the question is, what search term do you ask google trends about that will always return a value? I chose "Google". : )
I'm sure you can think of some better ones, but it's hard to leave those words in commercial code.

How to build a dataframe from scratch while filling in missing data? (details included in question)

I have a dataframe which looks like the following (Name of the first dataframe(image below) is relevantdata in the code):
I want the dataframe to be transformed to the following format:
Essentially, I want to get the relevant confirmed number for each Key for all the dates that are available in the dataframe. If a particular date is not available for a Key, we make that value to be zero.
Currently my code is as follows (A try/except block is used as some Keys don't have the the whole range of dates, hence a Keyerror occurs the first time you refer to that date using countrydata.at[date,'Confirmed'] for the respective Key, hence the except block will make an entry of zero into the dictionary for that date):
relevantdata = pandas.read_csv('https://raw.githubusercontent.com/open-covid-19/data/master/output/data_minimal.csv')
dates = relevantdata['Date'].unique().tolist()
covidcountries = relevantdata['Key'].unique().tolist()
data = dict()
data['Country'] = covidcountries
confirmeddata = relevantdata[['Date','Key','Confirmed']]
for country in covidcountries:
for date in dates:
countrydata = confirmeddata.loc[lambda confirmeddata: confirmeddata['Key'] == country].set_index('Date')
try:
if (date in data.keys()) == False:
data[date] = list()
data[date].append(countrydata.at[date,'Confirmed'])
else:
data[date].append(countrydata.at[date,'Confirmed'])
except:
if (date in data.keys()) == False:
data[date].append(0)
else:
data[date].append(0)
finaldf = pandas.DataFrame(data = data)
While the above code accomplished what I want in getting the dataframe in the format I require, it is way too slow, having to loop through every key and date. I want to know if there is a better and faster method to doing the same without having to use a nested for loop. Thank you for all your help.

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

Categories

Resources