Finding multiple maximum values from a file using Python - python

I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values.
My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy.
Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it.
I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here).
Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5

Pandas looks perfect for your task:
import pandas as pd
df = pd.read_csv('data.csv')
df.nlargest(3, 'column name')

Related

Resample().mean() in Python/Pandas and adding the results to my dataframe when the starting point is missing

I'm pretty new to coding and have a problem resampling my dataframe with Pandas. I need to resample my data ("value") to means for every 10 minutes (13:30, 13:40, etc.). The problem is: The data start around 13:36 and I can't access them by hand because I need to do this for 143 dataframes. Resampling adds the mean at the respective index (e.g. 13:40 for the second value), but because 13:30 is not part of my indices, that value gets lost.
I'm trying two different approaches here: First, I tried every option of resample() (offset, origin, convention, ...). Then I tried adding the missing values manually with a loop, which doesn't run properly because I didn't know how to access the correct spot on the list. The list does include all relevant values though. I also tried adding a row with 13:30 as the index on top of the dataframe but didn't manage to convince Python that my index is legit because it's a timestamp (this is not in the code).
Sorry for the very rough code, it just didn't work in several places which is why I'm asking here.
If you have a possible solution, please keep in mind that it has to function within an already long loop because of the many dataframes I have to work on simultaneously.
Thank you very much!
df["tenminavg"] = df["value"].resample("10Min").mean()
df["tenminavg"] = df["tenminavg"].ffill()
ls1 = df["value"].resample("10Min").mean() #my alternative: list the resampled values in order to eventually access the first relevant timespan
for i in df.index: #this loop doesn't work. It should add the value for the first 10 min
if df["tenminavg"][i]=="nan":
if datetime.time(13,30) <= df.index.time < datetime.time(13,40):
df["tenminavg"][i] = ls1.index.loc[i.floor("10Min")]["value"] #tried to access the corresponding data point in the list
else:
continue

Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?

I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples.
I have 2 pandas dataframes (I get them from a sqlite DB).
First DF:
Second DF:
So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture.
I solved it in pandas with what I know in the following way:
cam_idxs = sorted(list(range(9)) * 2)
for cam_idx in cam_idxs:
sub_df = images.loc[(images["camera_id"]==cam_idx)]
captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id",
right_on="capture_id")
I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database.
Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms.
So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?
What you are looking for is the pivot table.
You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table.
For example this could be :
images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9)
In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9)
Then you pivot :
pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image')
This will give the expected result.

Python: create new columns from rows based on multiple conditions

I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})
Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.

How to combine two columns in a df as a condition in np.where to check if nan in calculating new column

I want to check if either column contains a nan and if so the value in my new column to be "na" else 1.
I was able to get my code to work when checking a single column using .isnull() but I am unsure how to combine two. I tried using or as seen below. But it did not work. I know I can make the code a bit messy by testing one condition and then checking the next and from that producing the outcome I want but was hoping to make the code a bit cleaner by using some sort of any, or etc function instead but I do not know how to do that.
temp_df["xl"] = np.where(temp_df["x"].isnull() or temp_df["y"].isnull(), "na",1)
Let us do any more detail explain the situation you have link
temp_df["xl"] = np.where(temp_df[['x','y']].isnull().any(1), "na",1)

How to select all rows which contain values *in selected columns* greater than a threshold?

I'm trying to do the same thing as in this question, but I have have a string-type column that I need to keep in the dataframe so I can identify which rows are which. (I guess I could do this by index, but I'd like to be able to save a step.) Is there a way to not count a column when using .any(), but keep it in the resulting dataframe? Thanks!
Here's the code that words on all columns:
df[(df > threshold).any(axis=1)]
Here's the hard coded version I'm working with right now:
df[(df[list_of__selected_columns] > 3).any(axis=1)]
This seems a little clumsy to me, so I'm wondering if there's a better way.
You can use .select_dtype to choose all, say numerical columns:
df[df.select_dtype(include='number').gt(threshold).any(axis=1)]
Or a chunk of continuous columns with iloc:
df[df.iloc[:,3:6].gt(threshold).any(axis=1)]
If you want to select some random list of columns, you'd be best to resolve by hard coded list.

Categories

Resources