Update all the rows in a particular column PANDAS - python

Hi I am trying to edit rows for a particular column 'PM2.5' using '''df.at['0', 'PM2.5'] = 10''' for first row of the PM2.5 column but instead of editing, it is adding a new row. The headers for the columns are defined by titles but my rows are numbered, how do I go around this? I want to do this for 18 rows and manually add data to the column PM2.5. Thanks!

After comment problem is index values are integers (RangeIndex), so for set values need integer too.
So change '0' (string)
df.at['0', 'PM2.5'] = 10
to 0 (integer):
df.at[0, 'PM2.5'] = 10

Related

Pandas dataframe- How to count the number of distinct rows for a given ID

I have this dataframe and I want to add a column to it with the total of distinct SalesOrderId for a given CustomerId
So, with I am trying to do there would be a new column with the value 3 for all this rows.
How can I do it?
I am trying this way but I get an error
data['TotalOrders'] = data.groupby([['CustomerID','SalesOrderID']]).size().reset_index(name='count')
Try using transform:
data['TotalOrders'] = df.groupby('CustomerID')['SalesOrderID'].transform('nunique')
This will give you one entry for each entry in the group. (thanks #Rodalm)

How to change DataFrame column values so that mean is modified accordingly?

I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.

How to add rows to a specific location in a pandas DataFrame?

enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?
One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1

Python Pandas Splitting Strings and Storing the Remainder in New Row

I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True

How can I remove the rows of a data frame where a certain value appears in that row in Python?

I want to remove every row in my 7000 x 10 data frame where one of the row entries takes a certain value. For example, if I had 600 rows where '20' appeared in the row, how can I delete all of those?
You usually are better off just making a new dataframe fulfilling the conditions you need than editing around with the old one. When in doubt you can just always assign it back to the same name, but here's a minimal example:
value = 20
df_filtered = df[(df != value).all(axis=1)]
find the relevant rows and then create new_df from the rest
value = 20
rows_to_delete = df[df==value].any(axis=1)
new_df = df.loc[~rows_to_delete,:]

Categories

Resources