How to find the top any % of a dataframe? - python

I want to find the top 1% in my dataframe and append all the values in a list. Then i can check the first value inside and use it as a filter in the dataframe, any idea how to do it ? Or if you have a simplier way to do it !
You can find the dataframe i use here :
https://raw.githubusercontent.com/srptwice/forstack/main/resultat_projet.csv
What i tried is to watch my dataframe with heatmap (from Seaborn) and use a filter like that :
df4 = df2[df2 > 50700]

You can use df.<column name>.quantile(<percentile>) to get the top % of a dataframe. For example, the code below would get you the rows for df2 where bfly column is at the top 10% (90th percentile)
import pandas as pd
df = pd.read_csv('./resultstat_projet.csv')
df.columns = df.columns.str.replace(' ', '') # remove blank spaces in columns
df2 = df[df.bfly > df.bfly.quantile(0.9)]
print(df2)

Related

Replace elements of a dataframe with a values at another dataframe elements

I want to replace df2 elements with df1 elements but according to that: If df2 first row first column has value '1' than df1 first row first column element is getting there, If it is zero than '0' stands. If df2 any row last column element is '1' than df1 that row last column element is coming there. It is going to be like that.
So i want to replace all df2 '1' element with df1 elements according to that rule. df3 is going to be like:
abcde0000;
abcd0e000;
abcd00e00;...
We can use apply function for this. But first you have concat both frames along axis 1. I am using a dummy table with just three entries. It can be applied for any number of rows.
import pandas as pd
import numpy as np
# Dummy data
df1 = pd.DataFrame([['a','b','c','d','e'],['a','b','c','d','e'],['a','b','c','d','e']])
df2 = pd.DataFrame([[1,1,1,1,1,0,0,0,0],[1,1,1,1,0,1,0,0,0],[1,1,1,1,0,0,1,0,0]])
# Display dataframe . May not work in python scripts. I used them in jupyter notebooks
display(df1)
display(df2)
# Concat DFs
df3 = pd.concat([df1,df2],axis=1)
display(df3)
# Define function for replacing
def replace(letters,indexes):
seek =0
for i in range(len(indexes)):
if indexes[i]==1:
indexes[i]=letters[seek]
seek+=1
return ''.join(list(map(str,indexes)))
# Applying replace function to dataframe
df4 = df3.apply(lambda x: replace(x[:5],x[5:]),axis=1)
# Display df4
display(df4)
The result is
0 abcde0000
1 abcd0e000
2 abcd00e00
dtype: object
I think this will solve your problem

Problem with a column in my groupby new object

so I have a dataframe and I made this operation:
df1 = df1.groupby(['trip_departure_date']).agg(occ = ('occ', 'mean'))
The problem is that when I try to plot, it gives me an error and it says that trip_departure_date doesn't exist!
I did this:
df1.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')
and I get this error:
KeyError: 'trip_departure_date'
Please help!
Your question is similar to this question: pandas groupby without turning grouped by column into index
When you group by a column, the column you group by ceases to be a column, and is instead the index of the resulting operation. The index is not a column, it is an index. If you set as_index=False, pandas keeps the column over which you are grouping as a column, instead of moving it to the index.
The second problem is the .agg() function is also aggregating occ over trip_departure_date, and moving trip_departure_date to an index. You don't need this second function to get the mean of occ grouped by trip_departure_date.
import pandas as pd
df1 = pd.read_csv("trip_departures.txt")
df1_agg = df1.groupby(['trip_departure_date'],as_index=False).mean()
Or if you only want to aggregate the occ column:
df1_agg = df1.groupby(['trip_departure_date'],as_index=False)['occ'].mean()
df1_agg.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')

How do I groupby one column in a dataframe and combine on another column no matter if there is text or blanks?

I want to groupby on one column of a Dataframe and then combine any text in another column for the unique values of the first column (call_ID). The code should also delete any duplicates. My problem seems to be in deleting the duplicates. I tried the code below but it is not successful.
# Dataframe
dftest0 = pd.DataFrame(data ={'call_ID':[5423684,5423684,5423684,5423684],
'other_comb_words':['','','inspection','inspection']})
# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)
# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(['call_ID'], as_index=False)
['other_comb_words'].transform(lambda x : ' '.join(x))
# remove duplicates
dftest0 = dftest0.drop_duplicates(subset='other_comb_words')
dftest0
Dataframe sample:
call_ID other_comb_words
5423684
5423684
5423684 inspection
5423684 inspection
Current output:
call_ID other_comb_words
5423684 inspection inspection
Desired output:
call_ID other_comb_words
5423684 inspection
Put your line that drop duplicates also above your groupby line but without parameters (I added more data as example to confirm the expected result):
import pandas as pd
# Dataframe
dftest0 = pd.DataFrame(
data ={'call_ID':[5423684,5423684,5423684,5423684,5423684,1234567],
'other_comb_words':['','','inspection','inspection','example_1','example_2']})
# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)
# remove duplicates
dftest0 = dftest0.drop_duplicates()
# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(
['call_ID'], as_index=False)['other_comb_words'].transform(lambda x : ' '.join(x))
# remove duplicates from subsets
dftest0 = dftest0.drop_duplicates()
print(dftest0)
Result:
call_ID other_comb_words
0 5423684 inspection example_1
5 1234567 example_2
This is because you need to remove the duplicates from the original dataset, then in the groupby combination.

How to remove column with number as index name?

I have the following dataframe:
I tried to drop the data of -1 column by using
df = df.drop(columns=['-1'])
However, it is giving me the following error:
I was able to drop the column if the column name is some language character using this similar coding script, but not a number. What am I doing wrong?
You can test real columns names by converting them to list:
print (df.columns.tolist())
I think you need droping number -1 instead string '-1':
df = df.drop(columns=[-1])
Or another solution with same ouput:
df = df.drop(-1, axis=1)
EDIT:
If need select all columns without first use DataFrame.iloc for select by position, first : means select all rows and second 1: all columns with omit first:
df = df.iloc[:, 1:]
If you are just trying to remove the first column, another approach that would be independent of the column name is this:
df = df[df.columns[1:]]
you can do it simply by using the following code:
first check the name of the column by using following:
df.columns
then if the output is like:
Index(['-1', '0'], dtype='object')
use drop command to delete the column
df.drop(['-1'], axis =1, inplace = True)
guess this should help for future as well

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Categories

Resources