my dataset has some information by location for n dates. The CSV looks like
Country year2018 year2019 year2020
saleA saleB SaleA SaleB saleA saleB
USA 22 23 323 32 31 65
china 12 12 2 66 66 78
I want my data to be of the form
Country year saleA saleB
USA year2018 22 23
USA year2019 323 32
USA year2020 31 65
china year2018 12 12
.
.
.
How can I do it using pandas?
I tried using pd.melt but couldn't figured out.
You can reshape your dataframe with set_index and stack:
out = (df.set_index('Country')
.rename_axis(columns=['year', None])
.stack('year').reset_index())
Country year saleA saleB
0 USA year2018 22 23
1 USA year2019 323 32
2 USA year2020 31 65
3 China year2018 12 12
4 China year2019 2 66
5 China year2020 66 78
Another solution with melt and pivot_table:
>>> out = (df.melt(id_vars='Country', var_name=['year', 'sale'])
.pivot_table(index=['Country', 'year'], columns='sale', values='value')
.reset_index())
I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0
How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?
You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy
I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?
You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37