Select by Column Values - python

I know it is possible in arcpy. Finding out if can happen in pandas.
I have the following
data= {'Species':[ 'P.PIN','P.PIN','V.FOG', 'V.KOP', 'E.MON', 'E.CLA', 'E.KLI', 'D.FGH','W.ERT','S.MIX','P.PIN'],
'FY':[ '2002','2016','2018','2010','2009','2019','2017','2016','2018','2018','2016']}
I need to select all the P.PIN, P.RAD and any other species starting with E that have a FY equal to or older than 2016 and put into a new dataframe.
How can I get this done. All I am able to select P.PIN and P.RAD but have adding in all the other starting with E;
df3 =df[(df['FY']>=2016)&(df1['LastSpecies'].isin(['P.PIN','P.RAD']))]
Your help will be highly appreciated.

Step by step way. But you can also combine the logic inside the np.where() just want to show that all conditions were done.
Start by typecasting your df['FY'] values as int so we can use the greater than (>) operator.
>>> df['FY'] = df['FY'].astype(int)
>>> df['flag'] = np.where(df['Species'].isin(['P.PIN', 'P.RAD']), ['Take'], ['Remove'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Remove
6 E.KLI 2017 Remove
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> df['flag'] = np.where((df['FY'] > 2016) & (df['Species'].str.startswith('E')), ['Take'], df['flag'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Take
6 E.KLI 2017 Take
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> new_df = df[df['flag'].isin(['Take'])][['Species', 'FY']]
>>> new_df
Species FY
0 P.PIN 2002
1 P.PIN 2016
5 E.CLA 2019
6 E.KLI 2017
10 P.PIN 2016
Hope this helps :D

Related

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Python - Extract multiple values from string in pandas df

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?
With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

pandas groupby sum if count equals condition

I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12

iterate over pandas dataframe and create another dataframe with repititive records

I have a dataframe act with columns as ['ids','start-yr','end-yr'].
I want to create another dataframe timeline with columns as ['ids','years'].
using the act df. So if act has fields as
ids start-yr end-yr
--------------------------------
'IAs728-ahe83j' 2014 2016
'J8273nbajsu-193h' 2012 2018
I want the timeline df to be populated like this:
ids years
------------------------
'IAs728-ahe83j' 2014
'IAs728-ahe83j' 2015
'IAs728-ahe83j' 2016
'J8273nbajsu-193h' 2012
'J8273nbajsu-193h' 2013
'J8273nbajsu-193h' 2014
'J8273nbajsu-193h' 2015
'J8273nbajsu-193h' 2016
'J8273nbajsu-193h' 2017
'J8273nbajsu-193h' 2018
My attempt so far:
timeline = pd.DataFrame(columns=['ids','years'])
cnt = 0
for ix, row in act.iterrows():
for yr in range(int(row['start-yr']), int(row['end-yr'])+1, 1):
timeline[cnt, 'ids'] = row['ids']
timeline[cnt, 'years'] = yr
cnt += 1
But this is a very costly operation, too much time consuming (which is obvious, i know). So what should be the best pythonic approach to populate a pandas df in a situation like this?
Any help is appreciated, thanks.
Use list comprehension with range for list of tuples and DataFrame constructor:
a = [(i, x) for i, a, b in df.values for x in range(a, b + 1)]
df = pd.DataFrame(a, columns=['ids','years'])
print (df)
ids years
0 'IAs728-ahe83j' 2014
1 'IAs728-ahe83j' 2015
2 'IAs728-ahe83j' 2016
3 'J8273nbajsu-193h' 2012
4 'J8273nbajsu-193h' 2013
5 'J8273nbajsu-193h' 2014
6 'J8273nbajsu-193h' 2015
7 'J8273nbajsu-193h' 2016
8 'J8273nbajsu-193h' 2017
9 'J8273nbajsu-193h' 2018
If possible multiple columns in DataFrame filter them by list:
c = ['ids','start-yr','end-yr']
a = [(i, x) for i, a, b in df[c].values for x in range(a, b + 1)]

Column name in pandas dataframe resulting from groupby

I have the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'clif_cod' : [1,2,3,3,4,4,4],
'peds_val_fat' : [10.2, 15.2, 30.9, 14.8, 10.99, 39.9, 54.9],
'mes' : [1,2,4,5,5,6,12],
'ano' : [2016, 2016, 2016, 2016, 2016, 2016, 2016]})
vetor_valores = df.groupby(['mes','clif_cod']).sum()
which yields me this output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.20
2 2 2016 15.20
4 3 2016 30.90
5 3 2016 14.80
4 2016 10.99
6 4 2016 39.90
12 4 2016 54.90
How do I select rows based on mes and clif_cod?
When I do list(df) I only get ano and peds_val_fat.
IIUC, you can just pass the argument as_index=False to your groupby. You can then access it as you would any other dataframe
vetor_valores = df.groupby(['mes','clif_cod'], as_index=False).sum()
>>> vetor_valores
mes clif_cod ano peds_val_fat
0 1 1 2016 10.20
1 2 2 2016 15.20
2 4 3 2016 30.90
3 5 3 2016 14.80
4 5 4 2016 10.99
5 6 4 2016 39.90
6 12 4 2016 54.90
To access values, you can now use iloc or loc as you would any dataframe:
# Select first row:
vetor_valores.iloc[0]
...
Alternatively, if you've already created your groupby and don't want to go back and re-make it, you can reset the index, the result is identical.
vetor_valores.reset_index()
By using pd.IndexSlice
vetor_valores.loc[[pd.IndexSlice[1,1]],:]
Out[272]:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
You've got a dataframe with a two-level MultiIndex. Use both values to access rows, e.g., vetor_valores.loc[(4,3)].
Use axis parameter in .loc:
vetor_valores.loc(axis=0)[1,:]
Output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2

Categories

Resources