create new column in dataframe conditionally - python

updated question
by using the code below i am able to access dataframe only after completion of for loop, but i want to use most recently created column of the dataframe at intermediate time. i.e after every 5 minutes whichever is the last column of the dataframe ,how to achieve this?
#app.route("/sortbymax")
def sortbymax():
df = updated_data()
#### here i want to use most recently created column
df = create_links(df)
df = df.sort_values(by=['perc_change'], ascending=False)
return render_template('sortbymax.html',tables=[df.to_html(escape = False)], titles=df.columns.values)
def read_data():
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df = pd.read_excel(filename)
return df
def updated_data():
df = read_data()
for i in range(288):
temp = read_data()
x=datetime.datetime.now().strftime("%H:%M:%S")
df['perc_change_'+x] = temp['perc_change']
time.sleep(300)
return df

I see you have a file .xlsm which means is a macro enabled excel. I guess you can read it but if you want to change it with python than you most probably lose the macro part in your excel.
For the python part:
this will copy the perc_change column every 5 minutes, with the respective name. However bear in mind that this will work only for one day (it will replace existing columns after that). If you want to work for longer periods, let me know so that I will add day-month-year (whatever you want) in column names.
import datetime
import time
def read_data():
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df = pd.read_excel(filename)
return df
def write_data(df):
filename = r'c:\Users\91956\Desktop\bk.xlsm'
df.to_excel(filename)
df = read_data() #read excel for first time
for i in range(288): #this will run for one day exactly
temp = read_data()
x=datetime.datetime.now().strftime("%H:%M")
df['perc_change_'+x] = temp['perc_change']
time.sleep(300)

Related

Python Pandas Update Time in Cell Issues

Going nuts trying to update a column of time entries in a dataframe. I am opening a csv file that has a column of time entries in UTC. I can take these times, convert them to Alaska Standard time, and print that new time out just fine. But when I attempt to put the time back into the dataframe, while I get no errors, I also don't get the new time in the dataframe. The old UTC time is retained. Code is below, I'm curious what it is I am missing. Is there something special about times?
import glob
import os
import pandas as pd
from datetime import datetime
from statistics import mean
def main():
AKST = 9
allDirectories = os.listdir('c:\\MyDir\\')
for directory in allDirectories:
curDirectory = directory.capitalize()
print('Gathering data from: ' + curDirectory)
dirPath = 'c:\\MyDir\\' + directory + '\\*.csv'
# Files are named by date, so sorting by name gives us a proper date order
files = sorted(glob.glob(dirPath))
df = pd.DataFrame()
for i in range(0,len(files)):
data = pd.read_csv(files[i], usecols=['UTCDateTime', 'current_humidity', 'pm2_5_cf_1', 'pm2_5_cf_1_b'])
dfTemp = pd.DataFrame(data) # Temporary dataframe to hold our new info
df = pd.concat([df, dfTemp], axis=0) # Add new info to end of dataFrame
print("Converting UTC to AKST, this may take a moment.")
for index, row in df.iterrows():
convertedDT = datetime.strptime(row['UTCDateTime'], '%Y/%m/%dT%H:%M:%Sz') - pd.DateOffset(hours=AKST)
print("UTC: " + row['UTCDateTime'])
df.at[index,'UTCDateTime'] = convertedDT
print("AKST: " + str(convertedDT))
print("row['UTCDateTime] = " + row['UTCDateTime'] + '\n') # Should be updated with AKST, but is not!
Edit - Alternatively: Is there a way to go about converting the date when it is first read in to the dataframe? Seems like that would be faster than having two for loops.
From your code, it looks like the data is getting updated correctly in the dataframe, but you are printing the row, which is not updated, as it was fetched from dataframe before its updation!
#You are updating df
df.at[index,'UTCDateTime'] = convertedDT #You are updating df
# below you are printing row
print("row['UTCDateTime] = " + row['UTCDateTime']
See sample code below and its output for the explanation.
data=pd.DataFrame({'Year': [1982,1983], 'Statut':['Yes', 'No']})
for index, row in data.iterrows():
data.at[index, 'Year'] = '5000' + str(index)
print('Printing row which is unchanged : ', row['Year'])
print('Updated Dataframe\n',data)
Output
Printing row which is unchanged : 1982
Printing row which is unchanged : 1983
Updated Dataframe
Year Statut
0 50000 Yes
1 50001 No

Python pandas dataframe set_name length issue expecting one argument but is given two

So I have this prep_dat function and I am giving it the following csv data:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
within this prep_data function, there is this line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
However, it keeps erring out when it gets to the line saying
ValueError: Length of new names must be 1, got 2
Is there something wrong with the line or is it something wrong with the function
Here is the whole source code
import pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]['Tumor_Sample_Barcode'] #Creates a new dataframe of all the data within the 'Tumor_Sample_Barcode' column that does not match the PRIMARY_TUMOR_PATIENT_ID_REGEX
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
Any help would be greatly appreciated( I know this wasn't to specific, Im still trying to work out what this function does exactly and how I could make data to test it)

Pandas DataFrame takes too long

I am running the below code on a file with close to 300k lines. I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?
import sys
import numpy as np
import pandas as pd
file = sys.argv[1]
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])
size = np.array([])
ts = np.array([])
for i in range(len(df)):
if orig_bytes[i] > resp_bytes[i]:
size = np.append(size, orig_bytes[i])
ts = np.append(ts, df['ts'][i])
else:
size = np.append(size, resp_bytes[i])
ts = np.append(ts, df['ts'][i])
The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.
Thanking you all for your help
I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go. Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:
After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]
If you do want to include the entries where they are equal to each other, then just skip the query step.

Appending multiple pandas dataframe using for loop but returns an empty dataframe

I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy
You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!
I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())

add column to a dataframe in python pandas

How do i loop through my excel sheet and add each 'Adjusted Close' to a dataframe? I want to summarize all adj close and make an stock indice.
When i try with the below code the dataframe Percent_Change is empty.
xls = pd.ExcelFile('databas.xlsx')
countSheets = len(xls.sheet_names)
Percent_Change = pd.DataFrame()
x = 0
for x in range(countSheets):
data = pd.read_excel('databas.xlsx', sheet_name=x, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[x] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()
unfortunately I do not have the data to replicate your complete example. However, there appears to be a bug in your code.
You are looping over "x" and "x" is a list of integers. You probably want to loop over the sheet names and append them to your DF. If you want to do that your code should be:
import pandas as pd
xls = pd.ExcelFile('databas.xlsx')
# pep8 unto thyself only, it is conventional to use "_" instead of camelCase or to avoid longer names if at all possible
sheets = xls.sheet_names
Percent_Change = pd.DataFrame()
# using sheet instead of x is more "pythonic"
for sheet in sheets:
data = pd.read_excel('databas.xlsx', sheet_name=sheet, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[sheet] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()

Categories

Resources