I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]
Related
I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]
is it possible to drop duplicate of rows with the same strings but of different order within the same column?
exampe: dl3_hr_rank.r0 and hr_dl3_rank.r0
code for df before drop:
data = {'item':['dl3_hr_rank.r0','hr_dl3_rank.r0','hr_kl3_rank.r0',
'kl3_hr_rank.r0','hcrfr_hr_rank.r0',
'hr_hcrfr_rank.r0','hcfr_hkfr_rank.r0_wp','hkfr_hcfr_rank.r0_wp',
'hr_krl2_rank.r0_wp','krl2_hr_rank.r0_wp',],
'result':[1.17,1.17,1.17,1.17,1.13,1.13,1,1,1,1]}
df = pd.DataFrame(data)
df
code for df after drop:
data = {'item':['dl3_hr_rank.r0','hr_kl3_rank.r0',
'hcrfr_hr_rank.r0',
'hcfr_hkfr_rank.r0_wp',
'hr_krl2_rank.r0_wp'],
'result':[1.17,1.17,1.13,1,1]}
df = pd.DataFrame(data)
df
ps.i'm having trouble inserting tables with the command..
many thanks, regards
Try:
df[~df.item.str.split('_').apply(frozenset).duplicated(keep='first')]
Result df:
Use pandas.Series.str.split to split by '_'
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='first' to keep only the first occurrence of duplicate strings
I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out
this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?
This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]
For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.
You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]
Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]
df[df.cats.cat.categories == df.cats.cat.categories[0]]