How to transpose or pivote a table? Selecting specific columns - python
beginner here!
I have a dataframe similar to this:
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
Index Country_Code Inndicator_Name 2005 2006
0 FR GPD 14 23
1 FR Pop 34 34
2 FR birth 56 34
3 USA GPD 25 43
4 USA Pop 67 34
5 USA birth 68 34
6 BR GPD 55 65
7 BR Pop 8 34
8 BR birth 99 45
I need to pivot or transpose it, keeping the Country Code, the years, and the indicators names as columns, like this:
index Country_Code year GPD Pop Birth
0 FR 2005 14 34 56
1 FR 2006 23 34 34
3 USA 2005 25 67 68
4 USA 2006 43 34 34
...
I used the transposed function like this:
df.set_index(['Indicator Name']).transpose()
The result is nice, but I have the Countries as a row like this:
Inndicator_Name GPD Pop birth GPD Pop birth GPD Pop birth
Country_Code FR FR FR USA USA USA BR BR BR
2005 14 34 56 25 67 68 55 8 99
2006 23 34 34 43 34 34 65 34 45
I also tried to use the "pivot" and the "pivot table" function, but the result is not satisfactory. Could you please give me some advice?
import pandas as pd
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
df
#%% Pivot longer columns `'2005'` and `'2006'` to `'Year'`
df1 = df.melt(id_vars=["Country_Code", "Indicator_Name"],
var_name="Year",
value_name="Value")
#%% Pivot wider by values in `'Indicator_Name'`
df2 = (df1.pivot_table(index=['Country_Code', 'Year'],
columns=['Indicator_Name'],
values=['Value'],
aggfunc='first'))
Output:
Value
Indicator_Name GPD Pop birth
Country_Code Year
BR 2005 55 8 99
2006 65 34 45
FR 2005 14 34 56
2006 23 34 34
USA 2005 25 67 68
2006 43 34 34
The simplest in my opinion, you can pivot+stack:
(df.pivot(index='Country_Code', columns='Indicator_Name')
.rename_axis(columns=['year', None]).stack(0).reset_index()
)
output:
Country_Code year GPD Pop birth
0 BR 2005 55 8 99
1 BR 2006 65 34 45
2 FR 2005 14 34 56
3 FR 2006 23 34 34
4 USA 2005 25 67 68
5 USA 2006 43 34 34
Related
Switch Header and Column in a DataFrame
Economy Year Indicator1 Indicator2 Indicator3 Indicator4 . UK 1 23 45 56 78 UK 2 24 87 32 42 UK 3 22 87 32 42 UK 4 2 87 32 42 FR . . . . . This is my data which extends on and held as a DataFrame, I want to switch the Header(Indicators) and the Year columns, seems like a pivot. There are hundreds of indicators and 20 years.
Use DataFrame.melt with DataFrame.pivot: df = (df.melt(['Economy','Year'], var_name='Ind') .pivot(['Economy','Ind'], 'Year', 'value') .reset_index() .rename_axis(None, axis=1)) print (df) Economy Ind 1 2 3 4 0 UK Indicator1 23 24 22 2 1 UK Indicator2 45 87 87 87 2 UK Indicator3 56 32 32 32 3 UK Indicator4 78 42 42 42
Another option is to set Year column as index and then use transpose. Consider the code below: import pandas as pd df = pd.DataFrame(columns=['Economy', 'Year', 'Indicator1', 'Indicator2', 'Indicator3', 'Indicator4'], data=[['UK', 1, 23, 45, 56, 78],['UK', 2, 24, 87, 32, 42],['UK', 3, 22, 87, 32, 42],['UK', 4, 2, 87, 32, 42], ['FR', 1, 22, 33, 11, 35]]) # Make Year column as index df = df.set_index('Year') # Transpose columns to rows and vice-versa df = df.transpose() print(df) gives you Year 1 2 3 4 1 Economy UK UK UK UK FR Indicator1 23 24 22 2 22 Indicator2 45 87 87 87 33 Indicator3 56 32 32 32 11 Indicator4 78 42 42 42 35
You can use transpose like this : df = df.set_index('Year') df = df.transpose() print (df)
How to take mean across row in Pandas pivot table Dataframe? [duplicate]
This question already has answers here: Compute row average in pandas (5 answers) Closed 2 years ago. I have a pandas dataframe as seen below which is a pivot table. I would like to print Africa in 2007 as well as do the mean of the entire Americas row; any ideas how to do this? I have been doing combinations of stack/unstack for a while now to no avail. year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 continent Africa 12 13 15 20 39 25 81 12 22 23 25 44 Americas 12 14 65 10 119 15 21 42 47 84 15 89 Asia 12 13 89 20 39 25 81 29 77 23 25 89 Europe 12 13 15 20 39 25 81 29 23 32 15 89 Oceania 12 13 15 20 39 25 81 27 32 85 25 89
import pandas as pd df = pd.read_csv('dummy_data.csv') # handy to see the continent name against the value rather than '0' or '3' df.set_index('continent', inplace=True) # print mean for all rows - see how the continent name helps here print(df.mean(axis=1)) print('---') print() # print the mean for just the 'Americas' row print(df.mean(axis=1)['Americas']) print('---') print() # print the value of the 'Africa' row for the year (column) 2007 print(df.query('continent == "Africa"')['2007']) print('---') print() Output: continent Africa 27.583333 Americas 44.416667 Asia 43.500000 Europe 32.750000 Oceania 38.583333 dtype: float64 --- 44.416666666666664 --- continent Africa 44 Name: 2007, dtype: int64 ---
Python: How to exclude specific parts of a row when reading from CSV file
I'm very new to Python and am trying to read a CSV file:` 1980,Mark,Male,Student,L,90,56,78,44,88 1982,Cindy,Female,Student,S,45,76,22,42,90 1984,Kevin,Male,Student,L,67,83,52,55,59 1986,Michael,Male,Student,M,94,63,73,60,43 1988,Anna,Female,Student,S,66,50,59,57,33 1990,Jessica,Female,Student,S,72,34,29,69,27 1992,John,Male,Student,L,80,67,90,89,68 1994,Tom,Male,Student,M,23,60,89,78,39 1996,Nick,Male,Student,S,56,98,84,44,50 1998,Oscar,Male,Student,M,64,61,74,59,63 2000,Andy,Male,Student,M,11,50,93,69,90 I'd like to save only the specific attributes of this data into a dictionary, or a list of lists. For example, I'd only like to keep the year, name and the five numbers (in a row). I'm not sure how to exclude only the middle three columns. This is the code I have now: def read_data(filename): f = open("myfile.csv", "rt") import csv data = {} for line in f: row = line.rstrip().split(',') data[row[0]] = [e for e in row[5:]] return data I only know how to keep chunks of columns together, but not only specific columns one by one.
You could use pd.read_csv() and pass in your desired column names: import pandas as pd df = pd.read_csv('csv1.csv', names=['Year','Name','Gender','ID1','ID2','Val1','Val2','Val3','Val4','Val5']) desired = df[['Year','Name','Val1','Val2','Val3','Val4','Val5']] Yields: Year Name Val1 Val2 Val3 Val4 Val5 0 1980 Mark 90 56 78 44 88 1 1982 Cindy 45 76 22 42 90 2 1984 Kevin 67 83 52 55 59 3 1986 Michael 94 63 73 60 43 4 1988 Anna 66 50 59 57 33 5 1990 Jessica 72 34 29 69 27 6 1992 John 80 67 90 89 68 7 1994 Tom 23 60 89 78 39 8 1996 Nick 56 98 84 44 50 9 1998 Oscar 64 61 74 59 63 10 2000 Andy 11 50 93 69 90 Another option would be to pass the column index locations up front with usecols, like so: df = pd.read_csv('csv1.csv', header=None, usecols=[0,1,5,6,7,8,9]) Notice that this returns a dataframe with index-location named columns: 0 1 5 6 7 8 9 0 1980 Mark 90 56 78 44 88 1 1982 Cindy 45 76 22 42 90 2 1984 Kevin 67 83 52 55 59 3 1986 Michael 94 63 73 60 43 4 1988 Anna 66 50 59 57 33 5 1990 Jessica 72 34 29 69 27 6 1992 John 80 67 90 89 68 7 1994 Tom 23 60 89 78 39 8 1996 Nick 56 98 84 44 50 9 1998 Oscar 64 61 74 59 63 10 2000 Andy 11 50 93 69 90
You could do this with a simple list comprehension: def read_data(filename): f = open("myfile.csv", "rt") data = {} col_nums = [0, 1, 5, 6, 7, 8, 9] for line in f: row = line.rstrip().split(',') data[row[0]] = [row[i] for i in col_nums] return data You could also consider using Pandas to help you read and wrangle the data: import pandas as pd df = pd.read_csv("myfile.csv", columns=['year', 'name', 'gender', 'kind', 'size', 'num1', 'num2', 'num3', 'num4', 'num5']) data = df[['year', 'name', 'num1', 'num2', 'num3', 'num4', 'num5']]
You could try to split each line and assign it explicitly to variables; then simply ignore the variables you will not use (I named them _, so its obvious that they will not be used). This will raise errors (in the code line that has split()) if a line has less or more than the desired fields. def read_data(filename): data = {} with open(filename) as f: for line in f: line = line.strip() if len(line) > 0: year, name, _, _, _, n1, n2, n3, n4, n5 = line.split(',') data[year] = [n1, n2, n3, n4, n5] return data
How can I get combined result of a column values in a DataFrame?
I have below data in a DataFrame. city age mumbai 12 33 5 55 delhi 24 56 78 23 43 55 67 kal 12 43 55 78 34 mumbai 14 56 78 99 # Have a leading space MUMbai 34 59 # Have Capitol letters kal 11 I want to convert it into below format : city age mumbai 12 33 5 55 14 56 78 99 34 59 delhi 24 56 78 23 43 55 67 kal 12 43 55 78 34 11 How can I achieve this? Note: I have edited the data, now some city name are in Capital letter and some has leading spaces. How can we apply strip() and lower() functions to it?
We use groupby with sort=False to ensure we present cities in the same order they first appear. We use ' '.join to concatenate the strings together. Lastly, we reset_index to get the city values that have been placed in the index into the dataframe proper. df.groupby('city', sort=False).age.apply(' '.join).reset_index() city age 0 mumbai 12 33 5 55 14 56 78 99 34 59 1 delhi 24 56 78 23 43 55 67 2 kal 12 43 55 78 34 11 Response to Edit df.age.str.strip().groupby( df.city.str.strip().str.lower(), sort=False ).apply(' '.join).reset_index() city age 0 mumbai 12 33 5 55 14 56 78 99 34 59 1 delhi 24 56 78 23 43 55 67 2 kal 12 43 55 78 34 11
Convert dictionary of dictionaries into DataFrame Python
I am trying to convert the following dictionary of dictionaries into pandas DataFrame. My dictionary looks like this: mydata = {1965:{1:52, 2:54, 3:67, 4:45}, 1966:{1:34, 2:34, 3:35, 4:76}, 1967:{1:56, 2:56, 3:54, 4:34}} And I need to get a resulting dataframe that looks like this: Sector 1965 1966 1967 1 52 34 56 2 54 34 56 3 67 35 54 4 45 76 34 I was using something like this, but I'm not getting the result that I need. df = pd.DataFrame([[col1,col2,col3] for col1, d in test.items() for col2, col3 in d.items()])enter code here Thanks a lot for your help!!!
You can use DataFrame.from_records: import pandas as pd ydata = {1965:{1:52, 2:54, 3:67, 4:45}, 1966:{1:34, 2:34, 3:35, 4:76}, 1967:{1:56, 2:56, 3:54, 4:34}} print (pd.DataFrame.from_records(ydata)) 1965 1966 1967 1 52 34 56 2 54 34 56 3 67 35 54 4 45 76 34 print (pd.DataFrame.from_records(ydata).reset_index().rename(columns={'index':'Sector'})) Sector 1965 1966 1967 0 1 52 34 56 1 2 54 34 56 2 3 67 35 54 3 4 45 76 34