Python dataframe loop row by row would not change value no matter what - python

I am trying to change value of my panda dataframe but it just so stubborn and would not change the value desired. I have used df.at as suggested in some other post and it is not working as a way to change/modify data in dataframe.
HOUSING_PATH = "datasets/housing"
csv_path = os.path.join(HOUSING_PATH, "property6.csv")
housing = pd.read_csv(csv_path)
headers = ['Sold Price', 'Longitude', 'Latitude', 'Land Size', 'Total Bedrooms', 'Total Bathrooms', 'Parking Spaces']
# housing.at[114, headers[6]] = 405 and I want to change this to empty or 0 or None as 405 parking spaces does not make sense.
for index in housing.index:
# Total parking spaces in this cell
row = housing.at[index, headers[6]]
# if total parking spaces is greater than 20
if row > 20:
# change to nothing
row = ''
print(housing.at[114, headers[6]])
# however, this is still 405
Like why is this happening? Why can't I replace the value of the dataframe? They are<class 'numpy.float64'>, I have checked so the if statement should work and it is working. But just changing the value

You cannot do it like this. Once you assign the value of housing.at[index, headers[6]], you create a new variable which contains this value (row). Then you change the new variable, not the original data.
for index in housing.index:
# if total parking spaces is greater than 20
if housing.at[index, headers[6]] > 20:
# Set the value of original data to empty string
housing.at[index, headers[6]] = ''

This can be easily done without the use of for loop. Use pd.loc to filter the data frame based on condition and change the values
CODE
import pandas as pd
import os
HOUSING_PATH = "datasets/housing"
csv_path = os.path.join(HOUSING_PATH, "property6.csv")
housing = pd.read_csv(csv_path)
housing.loc[housing["Parking Spaces"] > 20, "Parking Spaces"] = ""

There are several built-in functions to finish such tasks. (where, mask, replace etc.)
# Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='rais',...)
data2=data.iloc[:,6]
data2.where(data2<=20, '', inplace=True)

Related

How to create new columns of last 5 sale price off in dataframe

I have a pandas data frame of sneakers sale, which looks like this,
I added columns last1, ..., last5 indicating the last 5 sale prices of the sneakers and made them all None. I'm trying to update the values of these new columns using the 'Sale Price' column. This is my attempt to do so,
for index, row in df.iterrows():
if (index==0):
continue
for i in range(index-1, -1, -1):
if df['Sneaker Name'][index] == df['Sneaker Name'][i]:
df['last5'][index] = df['last4'][i]
df['last4'][index] = df['last3'][i]
df['last3'][index] = df['last2'][i]
df['last2'][index] = df['last1'][i]
df['last1'][index] = df['Sale Price'][i]
continue
if (index == 100):
break
When I ran this, I got a warning,
A value is trying to be set on a copy of a slice from a DataFrame
and the result is also wrong.
Does anyone know what I did wrong?
Also, this is the expected output,
Use this instead of for loop, if you have rows sorted:
df['last1'] = df['Sale Price'].shift(1)
df['last2'] = df['last1'].shift(1)
df['last3'] = df['last2'].shift(1)
df['last4'] = df['last3'].shift(1)
df['last5'] = df['last4'].shift(1)

using pandas to find the string from a column

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

Column name disappears after resampling dataframe

I have a dataframe in which i need to calculate the means of ozone values for every 8 hours. The problem is that the column after which i am doing the resampling('readable time') disappears and cannot be referenced after the resampling.
import pandas as pd
data = pd.read_csv("o3_new.csv")
del data['latitude']
del data['longitude']
del data['altitude']
sensor_name = "o3"
data['readable time'] = pd.to_datetime(data['readable time'], dayfirst=True)
data = data.resample('480min', on='readable time').mean() # 8h mean
data[str(sensor_name) + "_aqi"] = ""
for i in range(len(data)):
data[str(sensor_name) + "_aqi"][i] = calculate_aqi(sensor_name, data[sensor_name][i])
print(data['readable time']) #throws KeyError
Where o3_new.csv is like this:
,time,latitude,longitude,altitude,o3,readable time,day
0,1591037392,45.645893,25.599471,576.38,39.4,1/6/2020 21:49,1/6/2020
1,1591037452,45.645893,25.599471,576.64,48.4,1/6/2020 21:50,1/6/2020
2,1591037512,45.645893,25.599471,576.56,53.4,1/6/2020 21:51,1/6/2020
3,1591037572,45.645893,25.599471,576.64,36.4,1/6/2020 21:52,1/6/2020
4,1591037632,45.645893,25.599471,576.73,50.4,1/6/2020 21:53,1/6/2020
5,1591037692,45.645893,25.599471,577.09,37.4,1/6/2020 21:54,1/6/2020
What to do to keep referencing the 'readable time' column after resampling?
What would you like the column to contain? mean makes no particularly good sense for time columns. Also, the resampler makes your on column the index, so just data.reset_index(inplace=True) may make you happy.
Or you can use data.index to access the values still, directly after the resample

How to compare two str values dataframe python pandas

I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()

How to append data to a dataframe whithout overwriting?

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

Categories

Resources