Ordering Headers in a DataFrame using Python - python

How do I order the headers of a dataframe.
from pandas import *
import pandas
import numpy as np
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']})
print df2
The result I get on default is this:
Country ISO
0 Germany DE
1 Switzerland CH
2 Austria AT
3 France FR
4 United States US
But I thought the ISO would be before the Country as that was the order i created it in the dataframe. It looks like it sorted it alphabetically?
How can i set up this simple table in memory to be used later in relational queries in my preferred column order. Everytime i reference the dataframe i dont want to have to order the columns.
My first coding post ever, ever.

A dict has no ordering, you can use columns argument to enforce one. If columns is not provided, default ordering is indeed alphabetically.
In [2]: df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],
...: 'Country': ['Germany','Switzerland','Austria','France','United States']},
...: columns=['ISO', 'Country'])
In [3]: df2
Out[3]:
ISO Country
0 DE Germany
1 CH Switzerland
2 AT Austria
3 FR France
4 US United States

a Python dict is unordered. The keys are not stored in the order you declare or append to it. The dict you give to the DataFrame as argument has an arbitrary order the DataFrame takes for granted.
You have several options to circumvent the issue:
Use a OrderedDict object instead of the dict if you really need a dictionary as an input:
df2 = DataFrame(OrderedDict([('ISO',['DE','CH','AT','FR','US']),('Country',['Germany','Switzerland','Austria','France','United States'])]))
If you don't rely on a dictionary in the first place, then call the DataFrame with arguments declaring the columns:
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']}, columns=['ISO', 'Country'])

Related

Search for variable name using iloc function in pandas dataframe

I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.
An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400
You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]
I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.

Can't sort dataframe column, 'numpy.ndarray' object has no attribute 'sort_values', can't separate numbers with commas

I am working with this csv
https://drive.google.com/file/d/1o3Nna6CTdCRvRhszA01xB9chawhngGV7/view?usp=sharing
I am trying to sort by the 'Taxes' column, but when I use
import pandas as pd
df = pd.read_csv('statesFedTaxes.csv')
df.Taxes.values.sort_values()
I get
AttributeError: 'numpy.ndarray' object has no attribute 'sort_values'
This is baffling to me and I cannot find a similar problem online. How can I sort the data by the "Taxes" column?
EDIT: I should explain that my real problem is that when I use
df.sort_values('Taxes')
I get this output:
State Taxes
48 Washington 100,609,767
24 Minnesota 102,642,589
25 Mississippi 11,273,202
13 Idaho 11,343,181
30 New Hampshire 12,208,656
54 International 12,611,648
22 Massachusetts 120,035,203
40 Rhode Island 14,325,645
31 New Jersey 140,258,435
Therefore, I assume the commas are getting in the way of my chart sorting properly. How do I get over this?
import pandas as pd
df = pd.DataFrame({"Taxes": ["1,000", "100", "100,000"]})
Your dataframe looks fine when we print it.
>>> df.sort_values(by="Taxes")
Taxes
0 1,000
1 100
2 100,000
But the dtype is all wrong. This is strings (stored as objects), not numbers. When you call .values you get an array of... more strings, not numbers.
>>> df.dtypes
Taxes object
So turn them into numbers
>>> df['Taxes'] = df['Taxes'].str.replace(",", "").astype(int)
>>> df.sort_values(by="Taxes")
Taxes
1 100
0 1000
2 100000
Now it's fine.
Also an option is to just read it in with a thousands separator explicitly defined, which will fix the typing problem earlier.
df = pd.read_csv('statesFedTaxes.csv', thousands=",")
It's basically the inverted order: you want to sort the column values and then extract them to an array:
df.sort_values("Taxes")["Taxes"].values
df.Taxes is a Series object, and df.Taxes.values is a ndarray object. In this case, you're not calling sort_values on the data frame df - you're trying to call it on the data from the Taxes column itself.
df.sort_values('Taxes') will give you df sorted on that column.

Can I combine groupby data?

I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2
May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())
Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.

Create multiple, new columns from dict values by mapping against a single column

This question is related to posts on creating a new column by mapping/lookup using a dictionary. (Adding a new pandas column with mapped value from a dictionary and pandas - add new column to dataframe from dictionary). However, what if I want to create, multiple new columns with dictionary values.
For argument's sake, let's say I have the following df:
country
0 bolivia
1 canada
2 ghana
And in a different dataframe, I have country mappings:
country country_id category color
0 canada 11 north red
1 bolivia 12 central blue
2 ghana 13 south green
I've been using pd.merge to merge the the mapping dataframe to my df using country and another index as keys it basically does the job, which gives me my desired output:
country country_id category color
0 bolivia 12 central blue
1 canada 11 north red
2 ghana 13 south green
But, lately, I've been wanting to experiment with using dictionaries. I suppose a related question is how does one determine to use pd.merge or dictionaries to accomplish my task.
For one-off columns that I'll map, I'll create a new column by mapping to a dictionary:
country_dict = dict(zip(country, country_id))
df['country_id'] = df['country'].map(entity_dict)
It seems impractical to define a function that takes in different dictionaries and to create each new column separately (e.g., dict(zip(key, value1)), dict(zip(key, value2))). I'm stuck on how to proceed in creating multiple columns at the same time. I started over, and tried creating the country mapping excel worksheet as a dictionary:
entity_dict = entity.set_index('country').T.to_dict('list')
and then from there, converting the dict values to columns:
entity_mapping = pd.DataFrame.from_dict(entity_dict, orient = 'index')
entity_mapping.columns = ['col1', 'col2', 'col3']
And I've been stuck going around in circles for the past few days. Any help/feedback would be appreciated!
Ok after tackling this some where..I guess pd.merge makes the most sense.

Basic Pandas data analysis: connecting data types

I loaded in a dataframe where there is a variable called natvty which is a frequency of numbers from 50 - 600. Each number represents a country and each country appears more than once. I did a count of the number of times each country appears in the list. Now I would like to replace the number of the country with the name of the country, for example (57 = United States). I tried all kinds of for loops to no avail. Here's my code thus far. In the value counts table, the country number is on the left and the number of times it appears in the data is on the right. I need to replace the number on the left with the country name. The numbers which correspond to country names are in an external excel sheet in two columns. Thanks.
I think there may be no need to REPLACE the country numbers with country names at first. Since you have now two tables, one is with columns ["country_number", "natvty"] and the other (your excel table, can be exported as .csv file and read by pandas) is with columns ["country_number", "country_name"], so you can simply join them both and keep them all. The resulted table would have 3 columns: ["country_number", "natvty", "country_name"], respectively.
import pandas as pd
df_nav = pd.read_csv("my_natvty.csv")
df_cnames = pd.read_csv("excel_country_names.csv") # or use pd.read_excel("country_names.xlsx") directly on excel files
df_nav_with_cnames = df_nav.join(df_cnames, on='country_number')
Make sure they both have a column "country_number". You can modify the table head in the data source files manually, or treat them as index columns to apply join similarly. The concept is a little bit like SQL operations in relational databases.
Documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
For this sort of thing, I always prefer the map function, which eats a dictionary, or a function for that matter.
import pandas as pd
import numpy.random as np
In [12]:
print
# generate data
df = pd.DataFrame(data={'natvty':np.randint(low=20,high=500,size=10),
'country':pd.Series([1,2,3,3,3,2,1,1,2,3])})
df
country natvty
0 1 24
1 2 310
2 3 88
3 3 459
4 3 38
5 2 63
6 1 194
7 1 384
8 2 281
9 3 360
Then, the dict. Here I just type it, but you could load it from a csv or excel file. Then you'd want to set the key as the index and turn the resulting series into a dict (to_dict()).
countrymap = {1:'US',2:'Canada',3:'Mexico'}
Then you can simply map the value labels.
df.country.map(countrymap)
Out[10]:
0 US
1 Canada
2 Mexico
3 Mexico
4 Mexico
5 Canada
6 US
7 US
8 Canada
9 Mexico
Name: country, dtype: objec
Note: The basic idea here is the same as Shellay's answer. I just wanted to demonstrate how to handle different column names in the two data frames, and how to retrieve the per-country frequencies you wanted.
You have one data frame containing country codes, and another data frame which maps country codes to country names. You simply need to join them on the country code columns. You can read more about merging in Pandas and SQL joins.
import pandas as pd
# this is your nativity frame
nt = pd.DataFrame([
[123],
[123],
[456],
[789],
[456],
[456]
], columns=('natvty',))
# this is your country code map
# in reality, use pd.read_excel
cc = pd.DataFrame([
[123, 'USA'],
[456, 'Mexico'],
[789, 'Canada']
], columns=('country_code', 'country_name'))
# perform a join
# now each row has an associated country_name
df = nt.merge(cc, left_on='natvty', right_on='country_code')
# now you can get frequencies on country names instead of country codes
print df.country_name.value_counts(sort=False)
The output from the above is
Canada 1
USA 2
Mexico 3
Name: country_name, dtype: int64
I think a dictionary would be your best bet. If you had a dict of the countries and their codes e.g.
country_dict = {333: 'United States', 123: 'Canada', 456: 'Cuba', ...}
You presumably have a key of the countries and their codes, so you could make the dict really easily with a loop:
country_dict = {}
for i in country_list:
country = i[0] # If you had list of countries and their numbers
number = i[1]
country_dict[number] = country
Adding a column to your DataFrame once you have this should be straightforward:
import pandas as pd
df = pd.read_csv('my_data.csv', header=None)
df['country'] = [country_dict[x[0][i]] for i in list(df.index)]
This should work if the country codes column has index 0

Categories

Resources