Altough I found multiple questions on the topic, I could not find a solution for this one in particular.
I am playing around with this CSV file, which contais a subselection of TBC dat from the WHO:
http://dign.eu/temp/tbc.csv
import pandas as pd
df = pd.read_csv('tbc.csv', index_col=['country', 'year'])
This gives a nicely formatted DataFrame, sorted on country and year, showing one of the parameters.
Now, for this case I would like the mean value of "param" for each country over all avaiable years. Using df.mean() gives me an overall value, and df.mean(axis=1) removes all indices which makes the results useless.
Obviously I can do this using a loop, but I guess there is a smarter way. But how?
If I understand you correctly you want to pass the level to the mean function:
In [182]:
df.mean(level='country')
Out[182]:
param
country
Afghanistan 8391.312500
Albania 183.888889
Algeria 8024.588235
American Samoa 1.500000
....
West Bank and Gaza Strip 12.538462
Yemen 4029.166667
Zambia 13759.266667
Zimbabwe 12889.666667
[219 rows x 1 columns]
Related
So I have this problem, because of the size of the dataframe that I am working on, clearly, I cannot upload it, but it has the following structure:
country
coastline
EU
highest
1
Norway
yes
yes
1500
2
Turkey
yes
no
20100
...
...
...
...
41
Bolivia
no
no
999
42
Japan
yes
no
89
I have to solve several exercises with Pandas, among them is, for example, showing the country with the "highest" maximum, minimum and the average but only of the countries that do belong to the EU, I already solved the maximum and the minimum, but for the middle I thought about creating a new dataframe, one that is created from only the rows that contain a "yes" in the EU column, I've tried a few things, but they haven't worked.
I thought this is the best way to solve it, but if anyone has another idea, I'm looking forward to reading it.
By the way, these are the examples that I said that I was able to solve:
print('Minimum outside the EU')
paises[(paises.EU == "no")].sort_values(by=['highest'], ascending=[True]).head(1)
Which gives me this:
country
coastline
EU
highest
3
Belarus
no
no
345
As a last condition, this must be solved using pandas, since it is basically the chapter that we are working on in classes.
If you want to create a new dataframe that is based off of a filter on your first, you can do this:
new_df = df[df['EU'] == 'yes'].copy()
This will look at the 'EU' column in the original dataframe df, and only return the rows where it is 'yes'. I think it is good practice to add the .copy() since we can sometimes get strange side-affects if we then make changes to new_df (probably wouldn't here).
I have a dataset with five columns.
Dataset:
Country Population Tourism Mean_Age Employed
Afghanistan 37172386 14000 17.3 Fulltime
Albania 2866376 5340000 36.2 Parttime
There are almost 1000 data like this where Employed is a categorical column. I want to represent the Employed column as a numerical column using one hot encoding.
My code is
from sklearn.preprocessing import OneHotEncoder
Employed_Status = data["Employed"]
encoder = OneHotEncoder()
encoder.fit(Employed_Status.values.reshape(-1, 1))
encoder.transform(Employed_Status.head().values.reshape(-1, 1)).todense()
Here data is the name of my data frame.
When I try to see the dataset after executing above lines I got the previous data set.
However, I thought I would get something like that
Country Population Tourism Mean_Age Employed
Afghanistan 37172386 14000 17.3 1
Albania 2866376 5340000 36.2 0
As I have applied one hot encoding on Employed column.
Can any one tell me why I got the same result and not the desired one?
You can do something like this:
data['Employed'] = data['Employed'].replace('Fulltime',1).replace('Parttime',0)
You're not saving the output.
out = encoder.transform(...).todense()
data['employed'] = out
It may take some wrangling to get the datasets to go together. I have found pd.concat(numerical_in, categorical_encoded_in, axis=1) is needed in the past but you might simply find it works once you save the dense output.
I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2
May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())
Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.
I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex
MultiIndex.levels.names = ['year', 'country', 'productcode']
I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:
ReshapeError: Index contains duplicate entries, cannot reshape
I have used:
data[data.duplicated()]
to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.
This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).
An Example from the sorted csv file:
year country productcode duplicate
1962 MYS 711 FALSE
1962 MYS 712 TRUE
1962 MYS 721 FALSE
I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?
Update:
I have tried resetting the index
temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]
and I get a completely different list!
year country productcode
1994 HKG 9710
1994 USA 9710
1995 HKG 9710
1995 USA 9710
Updated 2 [28JUNE2013]:
It appears to have been a strange memory issue during my IPython Session.
This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?
perhaps try
cleaned = df.reset_index().drop_duplicates(df.index.names)
cleaned.set_index(df.index.names, inplace=True)
I think there ought to be a duplicated method in the index, there is not yet
https://github.com/pydata/pandas/issues/4060
How do I order the headers of a dataframe.
from pandas import *
import pandas
import numpy as np
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']})
print df2
The result I get on default is this:
Country ISO
0 Germany DE
1 Switzerland CH
2 Austria AT
3 France FR
4 United States US
But I thought the ISO would be before the Country as that was the order i created it in the dataframe. It looks like it sorted it alphabetically?
How can i set up this simple table in memory to be used later in relational queries in my preferred column order. Everytime i reference the dataframe i dont want to have to order the columns.
My first coding post ever, ever.
A dict has no ordering, you can use columns argument to enforce one. If columns is not provided, default ordering is indeed alphabetically.
In [2]: df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],
...: 'Country': ['Germany','Switzerland','Austria','France','United States']},
...: columns=['ISO', 'Country'])
In [3]: df2
Out[3]:
ISO Country
0 DE Germany
1 CH Switzerland
2 AT Austria
3 FR France
4 US United States
a Python dict is unordered. The keys are not stored in the order you declare or append to it. The dict you give to the DataFrame as argument has an arbitrary order the DataFrame takes for granted.
You have several options to circumvent the issue:
Use a OrderedDict object instead of the dict if you really need a dictionary as an input:
df2 = DataFrame(OrderedDict([('ISO',['DE','CH','AT','FR','US']),('Country',['Germany','Switzerland','Austria','France','United States'])]))
If you don't rely on a dictionary in the first place, then call the DataFrame with arguments declaring the columns:
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']}, columns=['ISO', 'Country'])