Search for variable name using iloc function in pandas dataframe - python

I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.

An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400

You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]

I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.

Related

How to apply iloc in a Dataframe depending on a column value

I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?
There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )

How to access index of string value in a cell of pandas data frame?

I'm working with the Bureau of Labor Statistics data which looks like this:
series_id year period value
CES0000000001 2006 M01 135446.0
series_id[3][4] indicate the supersector. for example, CES10xxxxxx01 would be Mining & Logging. There are 15 supersectors that I'm concerned with and hence I want to create 15 separate data frames for each supersector to perform time series analysis. So I'm trying to access each value as a list to achieve something like:
# *psuedocode*:
mining_and_logging = df[df.series_id[3]==1 and df.series_id[4]==0]
Can I avoid writing a for loop where I convert each value to a list then access by index and add the row to the new dataframe?
How can I achieve this?
One way to do what you want and recursively store the dataframes through a for loop could be:
First, create an auxiliary column to make your life easier:
df['id'] = df['series_id'][3:5] #Exctract characters 3 and 4 of every string (counting from zero)
Then, you create an empty dictionary and populate it:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]
Now you'll have a dictionary with 15 dataframes inside. For example, if you want to call the dataframe associated with id = 01, you just do:
dict_df['01']
Hope it helps !
Solved it by combining answers from Juan C and G. Anderson.
Select the 3rd and 4th character:
df['id'] = df.series_id.str.slice(start=3, stop=5)
And then the following to create dataframes:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]

Basic Pandas data analysis: connecting data types

I loaded in a dataframe where there is a variable called natvty which is a frequency of numbers from 50 - 600. Each number represents a country and each country appears more than once. I did a count of the number of times each country appears in the list. Now I would like to replace the number of the country with the name of the country, for example (57 = United States). I tried all kinds of for loops to no avail. Here's my code thus far. In the value counts table, the country number is on the left and the number of times it appears in the data is on the right. I need to replace the number on the left with the country name. The numbers which correspond to country names are in an external excel sheet in two columns. Thanks.
I think there may be no need to REPLACE the country numbers with country names at first. Since you have now two tables, one is with columns ["country_number", "natvty"] and the other (your excel table, can be exported as .csv file and read by pandas) is with columns ["country_number", "country_name"], so you can simply join them both and keep them all. The resulted table would have 3 columns: ["country_number", "natvty", "country_name"], respectively.
import pandas as pd
df_nav = pd.read_csv("my_natvty.csv")
df_cnames = pd.read_csv("excel_country_names.csv") # or use pd.read_excel("country_names.xlsx") directly on excel files
df_nav_with_cnames = df_nav.join(df_cnames, on='country_number')
Make sure they both have a column "country_number". You can modify the table head in the data source files manually, or treat them as index columns to apply join similarly. The concept is a little bit like SQL operations in relational databases.
Documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
For this sort of thing, I always prefer the map function, which eats a dictionary, or a function for that matter.
import pandas as pd
import numpy.random as np
In [12]:
print
# generate data
df = pd.DataFrame(data={'natvty':np.randint(low=20,high=500,size=10),
'country':pd.Series([1,2,3,3,3,2,1,1,2,3])})
df
country natvty
0 1 24
1 2 310
2 3 88
3 3 459
4 3 38
5 2 63
6 1 194
7 1 384
8 2 281
9 3 360
Then, the dict. Here I just type it, but you could load it from a csv or excel file. Then you'd want to set the key as the index and turn the resulting series into a dict (to_dict()).
countrymap = {1:'US',2:'Canada',3:'Mexico'}
Then you can simply map the value labels.
df.country.map(countrymap)
Out[10]:
0 US
1 Canada
2 Mexico
3 Mexico
4 Mexico
5 Canada
6 US
7 US
8 Canada
9 Mexico
Name: country, dtype: objec
Note: The basic idea here is the same as Shellay's answer. I just wanted to demonstrate how to handle different column names in the two data frames, and how to retrieve the per-country frequencies you wanted.
You have one data frame containing country codes, and another data frame which maps country codes to country names. You simply need to join them on the country code columns. You can read more about merging in Pandas and SQL joins.
import pandas as pd
# this is your nativity frame
nt = pd.DataFrame([
[123],
[123],
[456],
[789],
[456],
[456]
], columns=('natvty',))
# this is your country code map
# in reality, use pd.read_excel
cc = pd.DataFrame([
[123, 'USA'],
[456, 'Mexico'],
[789, 'Canada']
], columns=('country_code', 'country_name'))
# perform a join
# now each row has an associated country_name
df = nt.merge(cc, left_on='natvty', right_on='country_code')
# now you can get frequencies on country names instead of country codes
print df.country_name.value_counts(sort=False)
The output from the above is
Canada 1
USA 2
Mexico 3
Name: country_name, dtype: int64
I think a dictionary would be your best bet. If you had a dict of the countries and their codes e.g.
country_dict = {333: 'United States', 123: 'Canada', 456: 'Cuba', ...}
You presumably have a key of the countries and their codes, so you could make the dict really easily with a loop:
country_dict = {}
for i in country_list:
country = i[0] # If you had list of countries and their numbers
number = i[1]
country_dict[number] = country
Adding a column to your DataFrame once you have this should be straightforward:
import pandas as pd
df = pd.read_csv('my_data.csv', header=None)
df['country'] = [country_dict[x[0][i]] for i in list(df.index)]
This should work if the country codes column has index 0

Trying to parse string and create new columns in data frame in Python pandas

I have the following data frame.
Team Opponent Detail
Redskins Rams Kirk Cousins .... Penaltyon Bill Smith, Holding:10 yards
What I want to do is create THREE columns using pandas which would give my the name (in this case Bill Smith), the type of infraction(Offensive holding), and how much it cost the team(10 yards). So it would look like this
Team Opponent Detail Name Infraction Yards
Redskins Rams Bill Smith Holding 10 yards
I used some string manipulation to actually extract the fields out, but don't know how to create a new column. I have looked through some old columns, but cannot seem to get it to work. Thanks!
You function should return 3 values, such as...
def extract(r):
return r[28:38], r[-8:], r[-16:-9]
First create empty columns:
df["Name"] = df["Infraction"] = df["Yards"] = ""
... and then cast the result of "apply" to a list.
df[["Name", "Infraction", "Yards"]] = list(df.Detail.apply(extract))
You could be interested in this more specific but more extended answer.
In order to create a new column, you can simply do:
your_df['new column'] = something
For example, imagine you want a new column that contains the first word of the column Details
#toy dataframe
my_df = pd.DataFrame.from_dict({'Team':['Redskins'], 'Oponent':['Rams'],'Detail':['Penaltyon Bill Smith, Holding:10 yards ']})
#apply a function that retrieves the first word
my_df['new_word'] = my_df.apply(lambda x: x.Detail.split(' ')[0], axis=1)
This creates the a column that contains "Penaltyon"
Now, imagine I now want to have two new columns, one for the first word and another one for the second word. I can create a new dataframe with those two columns:
new_df = my_df.apply(lambda x: pd.Series({'first':x.Detail.split(' ')[0], 'second': x.Detail.split(' ')[1]} ), axis=1)
and now I simply have to concatenate the two dataframes:
pd.concat([my_df, new_df], axis=1)

Pandas: how to change all the values of a column?

I have a data frame with a column called "Date" and want all the values from this column to have the same value (the year only). Example:
City Date
Paris 01/04/2004
Lisbon 01/09/2004
Madrid 2004
Pekin 31/2004
What I want is:
City Date
Paris 2004
Lisbon 2004
Madrid 2004
Pekin 2004
Here is my code:
fr61_70xls = pd.ExcelFile('AMADEUS FRANCE 1961-1970.xlsx')
#Here we import the individual sheets and clean the sheets
years=(['1961','1962','1963','1964','1965','1966','1967','1968','1969','1970'])
fr={}
header=(['City','Country','NACE','Cons','Last_year','Op_Rev_EUR_Last_avail_yr','BvD_Indep_Indic','GUO_Name','Legal_status','Date_of_incorporation','Legal_status_date'])
for year in years:
# save every sheet in variable fr['1961'], fr['1962'] and so on
fr[year]=fr61_70xls.parse(year,header=0,parse_cols=10)
fr[year].columns=header
# drop the entire Legal status date column
fr[year]=fr[year].drop(['Legal_status_date','Date_of_incorporation'],axis=1)
# drop every row where GUO Name is empty
fr[year]=fr[year].dropna(axis=0,how='all',subset=[['GUO_Name']])
fr[year]=fr[year].set_index(['GUO_Name','Date_of_incorporation'])
It happens that in my DataFrames, called for example fr['1961'] the values of Date_of_incorporation can be anything (strings, integer, and so on), so maybe it would be best to completely erase this column and then attach another column with only the year to the DataFrames?
As #DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...
You can do a column transformation by using apply
Define a clean function to remove the dollar and commas and convert your data to float.
def clean(x):
x = x.replace("$", "").replace(",", "").replace(" ", "")
return float(x)
Next, call it on your column like this.
data['Revenue'] = data['Revenue'].apply(clean)
Or if one want to use lambda function in the apply function:
data['Revenue']=data['Revenue'].apply(lambda x:float(x.replace("$","").replace(",", "").replace(" ", "")))

Categories

Resources