Dropping rows and finding the average of a speific column - python

I am trying to remove specific rows from the dataset and find the average of a specific column after the rows are removed without changing the original dataset
import pandas as PD
import NumPy as np
df = PD.read_csv(r"C:\Users\User\Downloads\nba.CSV")
NBA = PD.read_csv(r"C:\Users\User\Downloads\nba.CSV")
NBA.drop([25,72,63],axis=0)
I NEED TO FIND THE AVERAGE OF A SPECIFIC COLUMN LIKE "AGE"
HOWEVER THIS ISNT WORKING: Nba.drop([25,72,63],axis=0),['Age'].mean()
NEITHER IS THE QUERY COMMAND OR THE. LOC COMMAND

can you try this? I think there was a typo in your code
Nba.drop([25,72,63],axis=0)['Age'].mean()

Your code to drop the rows is correct.
NBA_clean = NBA.drop([25,72,63],axis=0)
will give you a new dataframe with some rows removed.
To find the average of a specific column, you can use index notation, which will return a series containing that specific row:
NBA_Age = NBA_clean["Age"]
Finally, to return the mean, you simply call the mean() method with:
NBA_mean_age = NBA_Age.mean()
It is not clear what the specific mistake in your code is, but I will present two possibilities:
You are not saving the result of NBA.drop([25,72,63],axis=0) into a variable. This operation is not done in place, if you want to do it in place you must use the inplace=True argument for NBA.drop([25,72,63], axis=0, inplace=True).
There is an unnecessary comma in Nba.drop([25,72,63],axis=0),['Age'].mean(). Remove this to get the correct syntax Nba.drop([25,72,63],axis=0)['Age'].mean(). I suspect the error message obtained when running this code would have hinted at the unnecessary comma.

Related

Python pandas extra 0 in numeric values

I have a simple code that read csv file. After that I change the names of the columns and print them. I found one weird issue that for some numeric columns its adding extra .0 Here is my code:
v_df = pd.read_csv('csvfile', delimiter=;)
v_df = v_df.rename(columns={Order No. : Order_Id})
for index, csv_row in v_df.iterrows():
print(csv_row.Order_Id)
Output is:
149545961155429.0
149632391661184.0
If I remove the empty row (2nd one in the above output) from the csv file, .0 does not appear in the ORDER_ID.
After doing some search, I found that converting this column to string will solve the problem. It does work if I change the first row of the above code to:
v_df = pd.read_csv('csvfile', delimiter=;, dtype={'Order No.' : 'str'})
However, the issue is that the column name 'Order No.' is changing to Order_Id as I am doing the rename so I can not use 'Order No.'. For this reason I tried the following:
v_df[['Order_Id']] = v_df[['Order_Id']].values.astype('str')
But unfortunately it seems that astype is not changing the datatype and .0 is still appearing. My questions are:
1- Why .0 is coming at the first place if there is an empty row in the csv file?
2- Why datatype change is not happening after rename?
My aim is to just get rid of .0, I don't want to change the datatype if .0 can go away using any other method.
I am trying to emulate your df here, although it has some differences I think it will work for you:
import pandas as pd
import numpy as np
v_df = pd.DataFrame([['13-Oct-22','149545961155429.0','149545961255429','Delivered'],
['12-Oct-22',None,None,'delivered'],
['15-Oct-22','149632391661184.0','149632391761184','Delivered']], columns=
['Transaction Date','Order_Id','Order Item No.','Order Item Status'])
v_df[['Order_Id']] = v_df[['Order_Id']].fillna(np.nan).values.astype('float').astype('int').astype('str')
Try it and let me know

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

Is there an equivalent to DataFrame.idxmax for when the DataFrame contains a string?

I'm using pandas to read a simple CSV file of election results:
constituency,anug,apnuafc,cg,ljp,pppc,...
Barima-Waini,0,3905,0,170,8022,...
Pomeroon-Supenaam,86,7343,149,120,18788,...
Essequibo Islands-West Demerara,310,23811,318,0,47855,...
...
I access this with election.votes in views.py:
results = pd.read_csv(election.votes)
For each row I want to add a new column for the winning party. I've tried:
results["winner"] = results.max(axis=1)
But this adds the highest value, not the corresponding column header. So I've tried:
results["winner"] = results.idxmax(axis=1)
I then get the error reduction operation 'argmax' not allowed for this dtype.
Because of the strings of the constituencies I can't use to_numeric to make idxmax work.
Is there another efficient way to get the column header?
Use DataFrame.select_dtypes for get only numeric columns:
import numpy as np
results["winner"] = results.select_dtypes(np.number).idxmax(axis=1)

filter off NaNs in a large DataFrame with headers

I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.
I believe you want to use df.dropna()
I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!

Using if then with python pandas

I have a dataset using pandas in python and would like to apply an if-then-else rule for a specific column.
If the there is missing value, then replace it with a specific value taken from another column in the same observation, else do nothing.
My dataset is generated by the code as follow:
results = df2.merge(df1,on="sku", how="left")
The column variable that needs to be filled, if empty is "stock_y".
If empty, the value of the column variable "stock_x" should be copied to "stock_y". In the case that stock_y is already filled, the code should skip the observation.
Check out Series.combine_first:
results['stock_y'] = results['stock_y'].combine_first(results['stock_x'])
Implementation with fillna:
results['stock_y'] = results['stock_y'].fillna(results['stock_x'])

Categories

Resources