Replace a row in Pandas DataFrame with 'NaN' based on condition - python

I have a Pandas DataFrame called df (378000, 82) and I would like to replace the entire row with NaN based on a specific condition. The condition is for any value in the column df.halon_gas that is >20, I want to replace that entire row with NaN. This is the way I want to filter my data so I don't lose the index values.
Thanks!

First of all get all indexes of values, that are below 20
idx = df[df.halon_gas >= 20].index
Then set the values for all columns and all columns which are below 200 to None
df.set_value(idx, df.columns , None)
This should write None/Nan in the rows with the value below 20

If you're fine with the rows being gone then I suggest you do this:
df.reset_index(level=0, inplace=True)
df = df[df.halon_gas <= 20]
df.set_index("index", inplace=True)
Whats happening here is the following:
The Index gets reset so you have an extra Column with the Index Values pre Removal.
Only the rows where df.halon_gas <= 20 are kept.
The old Index values are set to be the Index for the DataFrame again.

Related

Python/Pandas: How to find rows that are duplicated in two df columns?

I have a df where I want to check for duplicate rows in only two of the columns, but if those columns are similar to the previous row, then I'd like to isolate/print them. So for example, if rows 12 - 89 have the same value in column 2 and column 3 as the previous row(s), then I want to know this range of rows.
See image 1 for example of df where 'pm10_ugm3' and 'pm25_ugm3' are duplicated but other columns are not:
Many thanks
Try the dataframe's duplicated function. This returns an index that you can use to slice/select those rows. Some variation of this will get you close:
dup_rows = df.duplicated(subset=['col1', 'col2'], keep='first')
print(df[dup_rows])

Pandas dataframe set cell value from sum of rows with condition

I would like to set for each row, the value of ART_IN_TICKET to be the number of rows that have the same TICKET_ID as this row.
for example, for the first 5 rows of this dataframe, TICKET_ID is 35592159 and ART_IN_TICKET should be 5 since there are 5 rows with that same TICKET_ID.
There can be other solutions as well. A relatively simple solution would be to get the count of rows for each TICKET_ID and then merge the new df with this one to get the final result in ART_IN_TICKET. Assuming the above dataframe is in df.
count_df = df[['TICKET_ID', 'ART_IN_TICKET']].groupby("TICKET_ID").count().reset_index()
df = df[list(set(df.columns.tolist())-set(["ART_IN_TICKET"]))] # Removing ART_IN_TICKET column before merging
final_df = df.merge(count_df, on="TICKET_ID")

How to append a list to a specific dataframe column using for loop?

I'm attempting to append a value to a dataframe by looping through the dataframe columns and comparing it to a list of columns. If the list of column is found in the dataframe then that particular column of the dataframe gets assigned a value.
Assuming this is my dataframe columns
itching skin_rash nodal_skin_eruptions continuous_sneezing shivering chills joint_pain stomach_pain
And this is my list of columns
list_columns = ['itching', 'continuous_sneezing', 'shivering']
How do I look the list_columns up in the dataframe and assign a value of one to each column in the dataframe that is found in the lis_column?
So for instance, the result will be like this
itching skin_rash nodal_skin_eruptions continuous_sneezing shivering chills joint_pain
1 1 1```
You want to loop over the dataframe columns. Then, in the for loop, you can check if the column name is in list_columns. But the thing is, in a dataframe, all columns need to have a value in every row, so you can't leave something blank like you showed in the example output. But, you can do something like putting 0s for columns where you don't want a value, and 1s for columns where you do. Here's an example.
for column in df.columns:
if column in list_columns:
new_row = {}
for c in df.columns:
if c in list_columns:
new_row[c] = 1
else:
new_row[c] = 0
df.append(new_row, ignore_index=True)
I know this doesn't look pretty, and isn't super efficient, but this should work, please comment if it doesn't :)
Your dataframe:
cols=['itching','skin_rash','nodal_skin_eruptions','continuous_sneezing','shivering','chills','joint_pain','stomach_pain']
list_columns = ['itching', 'continuous_sneezing', 'shivering']
Now check if it is present in columns or not then then finally assign values:
df=pd.DataFrame(columns=cols) #created Empty dataframe
df.loc[0]=df.columns.isin(list_columns).astype(int)
OR
df=pd.DataFrame([df.columns.isin(list_columns).astype(int)],columns=cols)
output of df:
itching skin_rash nodal_skin_eruptions continuous_sneezing shivering chills joint_pain stomach_pain
0 1 0 0 1 1 0 0 0

Find duplicated rows, multiply a certain column by number of duplicates, drop duplicated rows

I have a pandas dataframe of about 70000 rows, and 4500 of them are duplicates of an original. The columns are a mix of string columns and number columns. The column I'm interested in is the value column. I'd like to look through the entire dataframe to find rows that are completely identical, count the number of duplicated rows per row (inclusive of the original), and multiply the value in that row by the number of duplicates.
I'm not really sure how to go about this from the start, but I've tried using df[df.duplicated(keep = False)] to obtain a dataframe df1 of duplicated rows (inclusive of original rows). I appended a column of Trues to the end of df1. I tried to use .groupby with a combination of columns to sum up the number of Trues but the result was unable to capture true number of duplicates (i obtained about 3600 unique duplicated rows in this case).
Here's my actual code:
duplicate_bool = df.duplicated(keep = False)
df['duplicate_bool'] = duplicate_bool
df1= df[duplicate_bool]
f = {'duplicate_bool':'sum'}
df2= df1.groupby(['Date', 'Exporter', 'Buyer', \
'Commodity Description', 'Partner Code', \
'Quantity', 'Price per MT'], as_index = False).agg(f)
My idea here was to obtain a separate dataframe df2 with no duplicates, and i could multiply the entry in the value column inside with the number stored in the summed duplicate_bool column. Then I'd simply append df2 to my original dataframe after removing all the duplicates identified by .duplicated.
However, if I use groupby with all columns I get an empty dataframe. If I don't use all the columns, I don't get the true number of duplicates and i wont be able to append it in any way.
I think I'd like a better way to do this since i'm confusing myself.
I think this question is nothing more of figuring out how to get a count of the occurrences of each unique row. If a row occurs only once, this number is one. If it occurs more often, it will be > 1. This count you can then use to multiply, filter, etc.
This nice one-liner (taken from How to count duplicate rows in pandas dataframe?) creates an extra column with the number of occurrences of each row:
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'dup_count'}).
To then calculate the true value of each row:
df['total_value'] = df['value'] * df['dup_count']
And to filter we can use the dup_count column to remove all duplicate rows:
dff = df[df['dup_count'] == 1]

How to set in pandas the first column and row as index?

When I read in a CSV, I can say pd.read_csv('my.csv', index_col=3) and it sets the third column as index.
How can I do the same if I have a pandas dataframe in memory? And how can I say to use the first row also as an index? The first column and row are strings, rest of the matrix is integer.
You can try this regardless of the number of rows
df = pd.read_csv('data.csv', index_col=0)
Making the first (or n-th) column the index in increasing order of verboseness:
df.set_index(list(df)[0])
df.set_index(df.columns[0])
df.set_index(df.columns.tolist()[0])
Making the first (or n-th) row the index:
df.set_index(df.iloc[0].values)
You can use both if you want a multi-level index:
df.set_index([df.iloc[0], df.columns[0]])
Observe that using a column as index will automatically drop it as column. Using a row as index is just a copy operation and won't drop the row from the DataFrame.
Maybe try set_index()?
df = df.set_index([2])
Maybe try df = pd.read_csv(header = 0)

Categories

Resources