Selecting all values greater than a number in a panda data frame - python

I have a dataframe like this with more than 50 columns(for years from 1963 to 2016). I was looking to select all countries with a population over a certain number(say 60 million). Now, when I looked, all the questions were about picking values from a single column. Which is not the case here. I also tried
df[df.T[(df.T > 0.33)].any()] as was suggested in an answer. Doesn't work. Any ideas?
The data frame looks like this:
Country Country_Code Year_1979 Year_1999 Year_2013
Aruba ABW 59980.0 89005 103187.0
Angola AGO 8641521.0 15949766 25998340.0
Albania ALB 2617832.0 3108778 2895092.0
Andorra AND 34818.0 64370 80788.0

First filter only columns with Year in columns names by DataFrame.filter, compare all rows and then test by DataFrame.any at least one matched value per row:
df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
Or compare all columns without first 2 selected by positons with DataFrame.iloc:
df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0

Related

Extract Data from DF into a new DF

I am not confident you can see the image. I am a student, last class before graduation, thought python would be fun. Stuck on an issue.
I have a dataframe called final_hgun_frame_raw that successfully lists every state plus DC, in alphabetical order. THere is an index column at starts at 0 - 51. The column headings are STATE, 2010,2011...2019.
The table shows, for example, that index 0 is AL and under column 2010 there is a value 2.44, 2011 there is a value 2.72, etc. For every year and for every state is a value.
My assignment is to create another data frame with 4 columns: Index, State, Year and Value
I have created a null dataframe with STATE, YEAR and VALUE
I know that I should you .tolist and .append but I am having trouble starting. The output should look something like:
State Year Value
AL 2010 2.44
AL 2011 2.72
Each row (state) plus each year (Year) plus each value (value) should not be its' own table.
There should be a table that is 4 columns x 510 rows
How do I extract that information?
You can use pd.melt for this:
import pandas as pd
data = [{'State':'AL', 2010:2.44, 2011:2.72, 2012:3.68}, {'State':'AK', 2010:3.60, 2011:3.93, 2012:4.91}]
df = pd.DataFrame(data)
df = pd.melt(df, id_vars=['State'], var_name='Year', value_name='Value').sort_values(by=['State'])
Output:
State
Year
Value
1
AK
2010
3.6
3
AK
2011
3.93
5
AK
2012
4.91
0
AL
2010
2.44
2
AL
2011
2.72
4
AL
2012
3.68

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Adding a subindex to merged dataframes

I have 3 dataframes each with the same columns (years) and same indexes (countries).
Now I want to merge these 3 dataframes. But since all have the same columns it is appending those.
So 'd like to keep the country index and add a subindex for each dataframe because all represent different numbers for each year.
#dataframe 1
#CO2:
2005 2010 2015 2020
country
Afghanistan 169405 210161 259855 319447
Albania 762 940 1154 1408
Algeria 158336 215865 294768 400126
#dataframe 2
#Arrivals + Departures:
2005 2010 2015 2020
country
Afghanistan 977896 1326120 1794547 2414943
Albania 103132 154219 224308 319440
Algeria 3775374 5307448 7389427 10159656
#data frame 3
#Travel distance in km:
2005 2010 2015 2020
country
Afghanistan 9330447004 12529259781 16776152792 22337458954
Albania 63159063 82810491 107799357 139543748
Algeria 12254674181 17776784271 25782632480 37150057977
The result should be something like:
2005 2010 2015 2020
country
Afghanistan co2 169405 210161 259855 319447
flights 977896 1326120 1794547 2414943
traveldistance 9330447004 12529259781 16776152792 22337458954
Albania ....
How can I do this?
NOTE: The years are an input so these are not fixed. They could just be 2005,2010 for example.
Thanks in advance.
I have tried to solve the problem using concat and groupby using your dataset hope it helps
First concat the 3 dfs
l=[df,df2,df3]
f=pd.concat(l,keys= ['CO2','Flights','traveldistance'],axis=0,).reset_index().rename(columns={'level_0':'Category'})
the use groupby to get the values
result_df=f.groupby(['country', 'Category'])[f.columns[2:]].first()
Hope it helps and solve your problem
Output looks like this

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources