How to append data to a dataframe whithout overwriting?

How to append data to a dataframe whithout overwriting? - python

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)

Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

Related

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.

Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.

def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()

#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

How to replace a loop that looks at multiple previous values with a formula in Python

My Problem
I have a loop that creates a column using either a formula based on values from other columns or the previous value in the column depending on a condition ("days from new low == 0"). It is really slow over a huge dataset so I wanted to get rid of the loop and find a formula that is faster.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
for x in range(1,len(df.index)):
if df["days from new low"].iloc[x] == 0:
df["mB"].iloc[x] = (df["RSI on new low"].iloc[x-1] - df["RSI on new low"].iloc[x]) / -df["days from new low"].iloc[x-1]
else:
df["mB"].iloc[x] = df["mB"].iloc[x-1]
df
Input Data and Expected Output
RSI on new low,days from new low,mB
0,22,0
29.6,0,1.3
29.6,1,1.3
29.6,2,1.3
29.6,3,1.3
29.6,4,1.3
21.7,0,-2.0
21.7,1,-2.0
21.7,2,-2.0
21.7,3,-2.0
21.7,4,-2.0
21.7,5,-2.0
21.7,6,-2.0
21.7,7,-2.0
21.7,8,-2.0
21.7,9,-2.0
25.9,0,0.5
25.9,1,0.5
25.9,2,0.5
23.9,0,-1.0
23.9,1,-1.0
Attempt at Solution
def mB_calc (var1,var2,var3):
df[var3]= np.where(df[var1] == 0, df[var2].shift(1) - df[var2] / -df[var1].shift(1) , "")
return df
df = mB_calc('days from new low','RSI on new low','mB')
First, it gives me this "TypeError: can't multiply sequence by non-int of type 'float'" and second I dont know how to incorporate the "ffill" into the formula.
Any idea how I might be able to do it?
Cheers!

Try this one:
df["mB_temp"] = (df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift()
df["mB"] = df["mB"].shift()
df["mB"].loc[df["days from new low"] == 0]=df["mB_temp"].loc[df["days from new low"] == 0]
df.drop(["mB_temp"], axis=1)
And with np.where:
df["mB"] = np.where(df["days from new low"]==0, df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift(), df["mB"].shift())

Error "numpy.float64 object is not iterable" for CSV file creation in Python

I have some very noisy (astronomy) data in csv format. Its shape is (815900,2) with 815k points giving information of what the mass of a disk is at a certain time. The fluctuations are pretty noticeable when you look at it close up. For example, here is an snippet of the data where the first column is time in seconds and the second is mass in kg:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
So it looks like there is a 1.53E+028 data point of noise, and also probably the 2.19E+028 and 2.35E+028 points.
To fix this, I am trying to set a Python script that will read in the csv data, then put some restriction on it so that if the mass is e.g. < 2.35E+028, it will remove the whole row and then create a new csv file with only the "good" data points:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41242600,2.40936E+028
Following this old question top answer by n8henrie, I so far have:
import pandas as pd
import csv
# Here are the locations of my csv file of my original data and an EMPTY csv file that will contain my good, noiseless set of data
originaldata = '/Users/myname/anaconda2/originaldata.csv'
gooddata = '/Users/myname/anaconda2/gooddata.csv'
# I use pandas to read in the original data because then I can separate the columns of time as 'T' and mass as 'M'
originaldata = pd.read_csv('originaldata.csv',delimiter=',',header=None,names=['t','m'])
# Numerical values of the mass values
M = originaldata['m'].values
# Now to put a restriction in
for row in M:
new_row = []
for column in row:
if column > 2.35E+028:
new_row.append(column)
csv.writer(open(newfile,'a')).writerow(new_row)
print('\n\n')
print('After:')
print(open(newfile).read())
However, when I run this, I get this error:
TypeError: 'numpy.float64' object is not iterable
I know the first column (time) is dtype int64 and the second column (mass) is dtype float64... but as a beginner, I'm still not quite sure what this error means or where I'm going wrong. Any help at all would be appreciated. Thank you very much in advance.

You can select rows by a boolean operation. Example:
import pandas as pd
from io import StringIO
data = StringIO('''\
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
''')
df = pd.read_csv(data,names=['t','m'])
good = df[df.m > 2.35e+28]
out = StringIO()
good.to_csv(out,index=False,header=False)
print(out.getvalue())
Output:
40023700,2.40896e+28
40145700,2.44487e+28
40267700,2.44487e+28
40389700,2.44478e+28
40755400,2.44496e+28
40877200,2.44489e+28
40999000,2.44489e+28
41242600,2.40936e+28

This returns a column: M = originaldata['m'].values
So when you do for row in M:, you get only one value in row, so you can't iterate on it again.

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps

Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

python: DataFrame.append does not append element

I am working one week with python and I need some help.
I want that if certain condition is fulfilled, it adds a value to a database.
My program doesn't give an error but it doesn't append an element to my database
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
If I run this, at the end I still get an empty dataframe. I have no idea why it doesn't do what I want it to do

You should reassign the dataframe such as:
import pandas as pd
noTEU = pd.DataFrame() # empty database
index_TEU = 0
for vessel in list:
if condition is fullfilled:
imo_vessel = pd.DataFrame({'imo': vessel}, index=[index_TEU])
noTEU = noTEU.append(imo_vessel) # I want here to add an element to my database
index_TEU = index_TEU + 1
and don't use the keyword list for a List because it's included in the Python syntax.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to append data to a dataframe whithout overwriting? - python

Related

pandas: while loop to simultaneously advance through multiple lists and call functions

How to replace a loop that looks at multiple previous values with a formula in Python

Error "numpy.float64 object is not iterable" for CSV file creation in Python

Storing Datetime in a matrix to be used to define points of interest (Python)

python: DataFrame.append does not append element

Categories

Resources