Replacing cell of a dataframe in a loop - python

I was working on this question filter iteration using FOR. I wonder how to replace the last cell of year column in each csv file generated. Lets say I want to replace the last cell (of column year) of each cvs file by current year (2018). I did the following code:
for i, g in df.sort_values("year").groupby("Univers", sort=False):
for y in g.iloc[-1, g.columns.get_loc('year')]:
y = 2018
g.to_csv('{}.xls'.format(i))
But I get the same column with any changes. Any ideas how to do this?

The problem seems to be of two fold: first find the index of the last row and then replace it at (last_row_idx, "year").
Try this
for i, g in df.sort_values("year").groupby("Univers", sort=False):
last_row_idx = g.tail(1).index.item() # first task: find index
g.at[last_row_idx, "year"] = 2018 # replace
g.to_csv('{}.xls'.format(i))
Alternatively, one can also use g.set_value(last_row_idx, "year", 2018) to set value at a particular cell.
Reference
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_value.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.at.html
Set value for particular cell in pandas DataFrame
Get index of a row of a pandas dataframe as an integer

Related

Python: Get last row from Pandas Data Frame

I'm trying to use some financial functions written in Python which return a Panda Data Frame.
The following function returns a Panda Data Frame:
from yahoo_fin import stock_info as si
data = si.get_data("ENEL.MI", start_date="01/21/2022 8:00", end_date="01/21/2022 16:30",index_as_date=False, interval="1d")
Here is what I get if I print data:
date open high low close adjclose volume ticker
0 2022-01-21 6.976 6.993 6.855 6.905 6.905 33639775 ENEL.MI
1 2022-01-21 6.976 6.993 6.855 6.905 6.905 35419140 ENEL.MI
I'd like to collect just the last row from the DataFrame (the row with number "1")
So, I've tried with:
lastrow = data.tail()
print(lastrow)
However I still get the same result (the whole DataFrame is printed).
I'm a bit puzzled. Is there a way to get just the last row?
Thanks a lot
With iloc() function, we can retrieve a particular value belonging to a row and column using the index values assigned to it.
lastrow = data.iloc[-1]
You have to specifies the number of rows you want to get. So for your case it would be data.tail(1) to print only the last row. The default value is to print 5 rows, that's why you see your full dataframe with 2 rows.

How to change DataFrame column values so that mean is modified accordingly?

I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.

Python Pandas Splitting Strings and Storing the Remainder in New Row

I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True

how to add specific two columns and get new column as a total using pandas library?

I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.

How do I replace one cell in a data frame with another data in a different data frame?

I am trying to replace a value in the data frame dh based on the data frame larceny.
If the date in larceny exists, I want to find the corresponding date in dh and replace the corresponding 5th column entry with 1.
I am currently (somewhat successfully) doing it with the below code but, it is taking forever. Any help on this?
When I try to compare the dates, the code does not work, so I compare the .value of the dates and this seems to work.
import pandas as pd
from datetime import datetime
for i, row in dh.iterrows():
for j in range(45314):
if dh.iat[i,0].value==larceny.iat[j,0].value:
dh.iat[i,5]=1
print("Larceny")
print(i,j)
print(dh.iat[i,0],larceny.iat[j,0])
print(dh.iat[i,0].value,larceny.iat[j,0].value,'\n\n')
Basically, dh has a cell for each hour of each day for 4 years. I want to populate the cell for each hour with a 1 in the "Is_larceny" column, if that corresponding year-month-day-hour appears in the larceny data frame.
Please help. I tried some pandas search methods but I was having a problem with comparing dates and searching and replacing properly.
Thanks.
dh.loc[dh['col1'].isin(larceny['col2']), 'col1'] = 1
This looks for any value in the dh['col1'] that also appears in larceny['col2'], then sets those values in dh['col1'] to 1. You will have to replace col1 and col2 with your respective column names.

Categories

Resources