how to sort a data-frame using two column values, at first look at 1st column values and only if values are duplicate then look at the 2nd column values
Use sort_values() on dataframe as:-
df.sort_values(by=['col1', 'col2'])
Related
I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).
I have a data like this in a csv file which I am importing to pandas df
I want to collapse the values of Type column by concatenating its strings to one sentence and keeping it at the first row next to date value while keeping rest all rows and values same.
As shown below.
Edit:
You can try ffill + transform
df1=df.copy()
df1[['Number', 'Date']]=df1[['Number', 'Date']].ffill()
df1.Type=df1.Type.fillna('')
s=df1.groupby(['Number', 'Date']).Type.transform(' '.join)
df.loc[df.Date.notnull(),'Type']=s
df.loc[df.Date.isnull(),'Type']=''
I read some time series data and made a pd.DataFrame object out of it:
The dataframe is 1 row and 84 columns, each column's label is a datetime object so I can add more rows with different data to that date later. As you can see, the columns are out of order. This is causing my data to look incorrect when I print it in line graph form.
The only search results I'm seeing are about sorting an entire dataframe by the values of a single column. How can I sort my dataframe by the headers of every column, so that my columns are in chronological order?
You can sort your dataframe by multiple columns like this:
df.sort_values(by=['col1', 'col2'])
What it will do is sort your df by col1 and then, if there are duplicate values in col2 against a single value in col1, it will perform a sort again for col2 values.
I have to perform RMSE on two columns with different Non-Nan values.
I have found the indices of Non-Nan Values in the first column. Now I have filtered out the values of 2nd column according to those indices.
This is the code I used to find the values of indices:-
b = np.argwhere(y.notnull().values).tolist()
Here y is the column which stores the indices of Non-Nan values in b.
I have another column x and have to match b with values of x. Filter out those values and store it in another column.
If you're using pandas dataframes, you can use pandas iloc
df[x].iloc[b]
You can just get the values using the values attribute
df[x].iloc[b].values
OR if wanted a list do:
print(df[column].iloc[b].values.tolist())
Similar to pandas unique values multiple columns I want to count the number of unique values per column. However, as the dtypes differ I get the following error:
The data frame looks like
A small[['TARGET', 'title']].apply(pd.Series.describe) gives me the result, but only for the category types and I am unsure how to filter the index for only the last row with the unique values per column
Use apply and np.unique to grab the unique values in each column and take its size:
small[['TARGET','title']].apply(lambda x: np.unique(x).size)
Thanks!