Pandas dataframe values not changing - python

I am currently getting my toes wet with neural nets, using colaboratory, pandas and keras. To set up my data, I need to normalize all the data (for which I am getting all values between 0 and 1 by dividing by the largest value). However, I've run into 2 issues.
For some reason, the column "stroke_count" isn't being modified, or if it is it's being round down to 0 no matter what.
I saw that
df.fillna(7)
supposedly replaces all Null or NaN values with the value inside the parenthesis, but it isn't doing that.
# generating character dictionary & normalizing data
hanzi_dict = {}
hanzi_counter = 0
df.fillna(7)
for index, row in df.iterrows():
hanzi_dict[str(hanzi_counter)] = row['charcter']
hanzi_counter = hanzi_counter + 1
df.at[index, 'radical_code'] = row['radical_code'] / 214.9 # max value of any radical
df.at[index, 'stroke_count'] = row['stroke_count'] / 35.0 # max # of strokes
df.at[index, 'hsk_levl'] = row['hsk_levl'] / 7 # max level + 1
print(hanzi_dict)
display(df)

Related

Missing Value NaN distribution

My code build to find the distribution of NaN value in dataset which have too #350 ID as subject and #140 as number of trial per subject, and I have 6 eye_variables. I create a function to calculate the missing value, percentages and distribution of NaN and then I did for loop for every #ID subject, #trial and for every #time_series. I have three problems are, my code save only the last iteration, so what is my mistake here in my code, want to save all the iteration, how to obtain the histogram for every iteration and save all of them. Could anybody suggest where I'm going wrong? I'm new to python.
here my code:
# put everything as a function function:
data_missing_evalution_list={}
mis_val_tot_list={}
mis_val_percent_list={}
#mis_val_groupby_na_list={}
mis_val_distribution_list={}
def missing_values_evaluation(df):
mis_val_tot = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_groupby_na = df.isna().groupby(df.notna().cumsum()).sum()
mis_val_distribution=mis_val_groupby_na.value_counts(dropna=True)
##############################
data_missing_evalution_list['mis_val_tot_list']=mis_val_tot
data_missing_evalution_list['mis_val_percent_list']=mis_val_percent
#data_missing_evalution_list['mis_val_groupby_na_list']=mis_val_groupby_na
data_missing_evalution_list['mis_val_distribution_list']=mis_val_distribution
return data_missing_evalution_list
######
My loop:
# distribution of NaN missing for eye_data_fixation:
subjects_list=list(eye_data_fixation['subject'].unique())
trial_list=list(eye_data_fixation['trial'].unique())
dict_for_ts_list=[]
# different_blocks_list=[]
for i in range (len(subjects_list)):
data_id = eye_data_fixation.loc[eye_data_fixation['subject'] == subjects_list[i]]
for j in range(len(trial_list)):
for k in range (6,12):
time_series = data_id.loc[data_id['trial'] == trial_list[j]].iloc[:,k]
missing_values_evaluation(time_series)
dict_for_ts_list.append(missing_values_evaluation(time_series))
# dict_for_ts_more.append(dict_for_ts)
# print(dict_for_ts_list)

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Time Series Dataframe Groupby 3d Array - observation/row count - For LSTM

I have a time series with a structure like below, and identifier column and two value columns (floats)
dataframe called just df:
Date Id Value1 Value2
2014-10-01 A 1.1 1.2
2014-10-01 B 1.3 1.4
2014-10-02 A 1.5 1.6
2014-10-02 B 1.7 1.8
2014-10-03 A 3.2 4.8
2014-10-03 B 8.2 10.1
2014-10-04 A 6.1 7.2
2014-10-04 B 4.3 4.1
What I am trying to do is turn it into a an array that is grouped by the identifier column with a rolling 3 observation period so I would end up with is this:
[[[1.1 1.2]
[1.5 1.6] '----> ID A 10/1 to 10/3'
[3.2 4.8]]
[[1.3 1.4]
[1.7 1.8] '----> ID B 10/1 to 10/3'
[8.2 10.1]]
[[1.5 1.6]
[3.2 4.8] '----> ID A 10/2 to 10/4'
[6.1 7.2]]
[[1.7 1.8]
[8.2 10.1] '----> ID B 10/2 to 10/4'
[4.3 4.1]]]
Of course ignore the parts in quotes above in the array but you hopefully get the idea.
I have a larger dataset that has more identifiers and may need to change the observation count, so can't hard the row count. So far the direction I am leaning towards is taking the unique values of the ID column and iterating and grabbing 3 values at a time, by creating a temp df and iterating over that.
Seems there is probably a better and faster way to do this.
"pseudo code"
unique_ids = df.ID.unique().tolist()
for id in unique_ids:
temp_df = df.loc[df['Id']==id]]
Though the part am I stuck on there is the best way to iterate over the temp_df as well.
The end output would be used in an LSTM model; however most other solutions are written to not need to handle the groupby aspect as with column 'Id'.
Here is what I ended up doing for the solution, not the pretty easiest but then again my question wasn't winning any beauty contests to begin with
id_list = array_steps_df['Id'].unique().tolist()
# change number of steps as needed
step = 3
column_list = ['Value1', 'Value2']
master_list = []
for id in id_list:
master_dict = {}
for column in column_list:
array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
array_steps_id_df = array_steps_id_df[[column]].values
master_dict[column] = []
for obs in range(len(array_steps_id_df)-step+1):
start_obs = obs + step
master_dict[column].append(array_steps_id_df[obs:start_obs,])
master_list.append(master_dict)
for idx, dic in enumerate(master_list):
# init arrays here
if idx == 0:
value1_array_init = master_list[0]['Value1']
value2_array_init = master_list[1]['Value2']
else:
value1_array_init += master_list[idx]['Value1']
value2_array_init += master_list[idx]['Value2']
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)
all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1),
len(column_list),
step)).transpose(0, 2, 1)
Fixed, my mistake added a transpose at the end and redid order of features and steps in reshape.
Credit to this site for some extra help
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/
I ended up redoing this a bit to make it more dynamic for the columns and keep the time series in order, also added a target array as well to keep the predictions in order. For anyone that needs this here is the function:
def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""
id_list = array_steps_df[id_column].unique().tolist()
date_list = array_steps_df['date'].unique().tolist()
master_list = []
target_list = []
missing_counter = 0
total_counter = 0
# grab date size = time steps at a time and iterate through all of them
for date in range(len(date_list) - time_steps + 1):
date_range_test = date_list[date:time_steps+date]
date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
(array_steps_df['date'] >= date_range_test[0])
]
# for each id do it separately so time series data doesn't get mixed up
for identifier in id_list:
# get id in here and then skip if not the required time steps/observations for the id
date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
master_dict = {}
# if there aren't enough observations for the data range
if len(date_range_id) != time_steps:
# dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
missing_counter += 1
else:
# add target each loop through for the last date in the date range for the id or ticker
target = array_steps_df['target'].\
loc[(array_steps_df['date'] == date_range_test[-1])
& (array_steps_df[id_column] == identifier)
].iloc[0]
target_list.append(target)
total_counter += 1
# loop through each column in dataframe
for column in columns_to_array:
date_range_id_value = date_range_id[[column]].values
master_dict[column] = []
master_dict[column].append(date_range_id_value)
master_list.append(master_dict)
# redo columns to arrays, after they have been ordered and grouped by Id above
array_list = []
# for each column go through the values in the array create an array for the column then append to list
for column in columns_to_array:
for idx, dic in enumerate(master_list):
# init arrays here if the first value
if idx == 0:
value_array_init = master_list[0][column]
else:
value_array_init += master_list[idx][column]
array_list.append(np.array(value_array_init))
# for each value in the array list, horizontally stack each value
all_array = np.hstack(array_list).reshape((total_counter,
len(columns_to_array),
time_steps
)
).transpose(0, 2, 1)
target_array_all = np.array(target_list
).reshape(len(target_list),
1)
# should probably make this an if condition later after a few more tests
print('check of length of arrays', len(all_array), len(target_array_all))
return all_array, target_array_all

How to style columns in pandas without overlapping and deleting previous work

I am doing some styling to pandas columns where I want to highlight green or red values + or - 2*std of the corresponding column, but when I loop over to go to the next column, previous work is essentially deleted and only the last column shows any changes.
Function:
def color_outliers(value):
if value <= (mean - (2*std)):
# print(mean)
# print(std)
color = 'red'
elif value >= (mean + (2*std)):
# print(mean)
# print(std)
color = 'green'
else:
color = 'black'
return 'color: %s' % color
Code:
comp_holder = []
titles = []
i = 0
for value in names:
titles.append(names[i])
i+=1
#Number of Articles and Days of search
num_days = len(page_list[0]['items']) - 2
num_arts = len(titles)
arts = 0
days = 0
# print(num_days)
# print(num_arts)
#Sets index of dataframe to be timestamps of articles
for days in range(num_days):
comp_dict = {}
comp_dict = {'timestamp(YYYYMMDD)' : int(int(page_list[0]['items'][days]['timestamp'])/100)}
#Adds each article from current day in loop to dictionary for row append
for arts in range(num_arts):
comp_dict[titles[arts]] = page_list[arts]['items'][days]['views']
comp_holder.append(comp_dict)
comp_df = pd.DataFrame(comp_holder)
arts = 0
days = 0
outliers = comp_df
for arts in range(num_arts):
mean = comp_df[titles[arts]].mean()
std = comp_df[titles[arts]].std()
outliers = comp_df.style.applymap(color_outliers, subset = [titles[arts]])
Each time I go through this for loop, the 'outliers' styling data frame resets itself and only works on the current subset, but if I remove the subset, it uses one mean and std for the entire data frame. I have tried style.apply using axis=0 but i can't get it to work.
My data frame consists of 21 columns, the first being the timestamp and the next twenty being columns of ints based upon input files. I also have two lists indexed from 0 to 19 of means and std of each column.
I would apply on the whole column instead of applymap. I'm not sure I can follow your code since I don't know how your data look like, but this is what I would do:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100, [10,3]))
# compute the statistics
stats = df.agg(['mean','std'])
# format function on columns
def color_outlier(col, thresh=2):
# extract mean and std of the column
mean, std = stats[col.name]
return np.select((col<=mean-std*thresh, col>=mean+std*thresh),
('color: red', 'color: green'),
'color: black')
# thresh changes for demonstration, remove when used
df.style.apply(color_outlier, thresh=0.5)
Output:

Finding the highest value

So I'm currently using a loop to search through my csv data to find the "high" and "low" values of a group of days and then calculate the averages of each day. With those averages, I want to find the highest one amongst them but I've been having trouble doing so. This is currently what I have.
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS = (float(highNAS) + float(lowNAS)) / 2
bestNAS = max(averageNAS)
I have indeed realized that the max(averageNAS) doesn't work because averageNAS is not a list and since the average isn't found in the csv file, I can't do max(row['Average']) either.
When the highest average is found, I'd also like to be able to include the date of it as well so my program can print out the date of which the highest average occurred. Thanks in advance.
One possible solution is to create a dictionary of average values where the date is the key and the average is the value:
averageNAS = {}
Then calculate the average and insert it into this dict:
for row in reversed(list(reader1)):
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS[dateNAS] = (float(highNAS) + float(lowNAS)) / 2 # Insertion
Now you can get the maximum by finding the highest value:
import operator
bestNAS = max(averageNAS.items(), key=operator.itemgetter(1))
The result will be a tuple like:
# (1, 8.0)
which means that day 1 had the highest average. And the average was 8.
If you don't need the day then you could create a list instead of a dictionary and append to it. That makes finding the maximum a bit easier:
averageNAS = []
for ...
averageNAS.append((float(highNAS) + float(lowNAS)) / 2)
bestNAS = max(averageNAS)
There are a few solutions that come to mind.
Solution 1
The method most similar to your existing solution would be to create a list of the averages as you calculate them, and then take the maximum from that list. The code, based on your example, looks something like this:
averageNAS = []
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS.append((float(highNAS) + float(lowNAS)) / 2)
# the maximum of the list only needs to be done once (at the end)
bestNAS = max(averageNAS)
Solution 2
Instead of creating an entire list, you could just maintain a variable of the maximum average NAS that you've "seen" so far, and the dateNAS associated with it. That would look something like:
bestNAS = float('-inf')
bestNASdate = None
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS = (float(highNAS) + float(lowNAS)) / 2
if averageNAS > bestNAS:
bestNAS = averageNAS
bestNASdate = dateNAS
Solution 3
If you want to use a package as a solution, I'm fairly certain that the pandas package can do this easily and efficiently. I'm not 100% certain that the pandas syntax is exact, but the library has everything that you'd need to get this done. It's based on numpy, so the operations are faster/more efficient than a vanilla python loop.
from pandas import DataFrame, read_csv
import pandas as pd
df = pd.read_csv(r'file location')
df['averageNAS'] = df[["High", "Low"]].mean(axis=1)
bestNASindex = df['averageNAS'].argmax() # 90% sure this is the right syntax
bestNAS = df['averageNAS'][bestNASindex]
bestNASdate = df['date'][bestNASindex]

Categories

Resources