How to use melt function in pandas for large table? - python

I currently have data which looks like this:
Afghanistan_co2 Afghanistan_income Year Afghanistan_population Albania_co2
1 NaN 603 1801 3280000 NaN
2 NaN 603 1802 3280000 NaN
3 NaN 603 1803 3280000 NaN
4 NaN 603 1804 3280000 NaN
and I would like to use melt to turn it into this:
But with the labels instead as 'Year', 'Country', 'population Value',' co2 Value', 'income value'
It is a large dataset with many rows and columns, so I don't know what to do, I only have this so far:
pd.melt(merged_countries_final, id_vars=['Year'])
I've done this since there does exist a column in the dataset titled 'Year'.
What should I do?

Just doing with str.split with your columns
df.set_index('Year',inplace=True)
df.columns=pd.MultiIndex.from_tuples(df.columns.str.split('_').map(tuple))
df=df.stack(level=0).reset_index().rename(columns={'level_1':'Country'})
df
Year Country co2 income population
0 1801 Afghanistan NaN 603.0 3280000.0
1 1802 Afghanistan NaN 603.0 3280000.0
2 1803 Afghanistan NaN 603.0 3280000.0
3 1804 Afghanistan NaN 603.0 3280000.0

Related

Calculating mean values yearly in a dataframe with a new value daily

I have this dataframe, which contains average temps for all the summer days:
DATE TAVG
0 1955-06-01 NaN
1 1955-06-02 NaN
2 1955-06-03 NaN
3 1955-06-04 NaN
4 1955-06-05 NaN
... ... ...
5805 2020-08-27 2.067854
5806 2020-08-28 3.267854
5807 2020-08-29 3.067854
5808 2020-08-30 1.567854
5809 2020-08-31 4.167854
And I want to calculate the mean value yearly, so I can plot it, how could I do that?
If I understand correctly, can you try this ?
df['DATE']=pd.to_datetime(df['DATE'])
df.groupby(df['DATE'].dt.year)['TAVG'].mean()

Rename hundred or more column names in pandas dataframe

I am working with the John Hopkins Covid data for personal use to create charts. The data shows cumulative deaths by country, I want deaths per day. Seems to me the easiest way is to create two dataframes and subtract one from the other. But the file has column names as dates and the code, e.g. df3 = df2 - df1 subtracts the columns with the matching dates. So I want to rename all the columns with some easy index, for example, 1, 2, 3, ....
I cannot figure out how to do this?
new_names=list(range(data.shape[1]))
data.columns=new_names
This renames the columns of data from 0 upwards.
You could re-shape the data: use dates and row labels, and use country, province as column labels.
import pandas as pd
covid_csv = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_raw = (pd.read_csv(covid_csv)
.set_index(['Country/Region', 'Province/State'])
.drop(columns=['Lat', 'Long'])
.transpose())
df_raw.index = pd.to_datetime(df_raw.index)
print( df_raw.iloc[-5:, 0:5] )
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 1269 144 1163 52 41
2020-07-28 1270 148 1174 52 47
2020-07-29 1271 150 1186 52 48
2020-07-30 1271 154 1200 52 51
2020-07-31 1272 157 1210 52 52
Now, you can use the rich set of pandas tools for time-series analysis. For example, use diff() to go from cumulative deaths to per-day rates. Or, you could compute N-day moving averages, create time-series plots, ...
print(df_raw.diff().iloc[-5:, 0:5])
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 10.0 6.0 8.0 0.0 1.0
2020-07-28 1.0 4.0 11.0 0.0 6.0
2020-07-29 1.0 2.0 12.0 0.0 1.0
2020-07-30 0.0 4.0 14.0 0.0 3.0
2020-07-31 1.0 3.0 10.0 0.0 1.0
Finally, df_raw.sum(level='Country/Region', axis=1) will aggregate all Provinces within a Country.
Thanks for the time and effort but I figured out a simple way.
for i, row in enumerate(df):
df.rename(columns = { row : str(i)}, inplace = True)
to change the columns names and then
for i, row in enumerate(df):
df.rename(columns = { row : str( i + 43853)}, inplace = True)
to change them back to the dates I want.

Switching year columns to one column in Python Eurostat

I have a large dataset from Eurostat. I am trying to move all year columns 1960-2019 to a single column called "year". How can I do this?
A sample of the data:
unit sex age geo\time 2019 2018 2017 2016 2015 2014 ... 1969 1968 1967 1960
0 NR F TOTAL AD 37388.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
1 NR F TOTAL AL 1432833.0 1431715.0 1423050.0 1417141.0 1424597.0 1430827.0 ... NaN NaN NaN NaN

How to extract data from a pandas dataframe based upon values of other columns?

I have a df=
A=
[period store item
1 32 'A'
1 34 'A'
1 32 'B'
1 34 'B'
2 42 'X'
2 44 'X'
2 42 'Y'
2 44 'Y']
I want to find all the stores for an item in that period
preferably in a dictionary like this:
dicta = {1: {'A': (32, 34),'B': (32, 34)}, 2: {'X': (42, 44),'Y': (42, 44)}}
EDIT For #JEZRAEL
Actual df
RTYPE PERIOD_ID STORE_ID MKT MTYPE RGROUP RZF RXF
0 MKT 317 13178 Kiosks_11 CELL NaN NaN NaN
1 MKT 306 11437 Kiosks_11 CELL NaN NaN NaN
2 MKT 306 12236 Kiosks_11 CELL NaN NaN NaN
3 MKT 312 11024 Kiosks_11 CELL NaN NaN NaN
4 MKT 307 13010 Kiosks_11 CELL NaN NaN NaN
5 MKT 307 12723 Kiosks_11 CELL NaN NaN NaN
6 MKT 306 14218 Kiosks_11 CELL NaN NaN NaN
7 MKT 306 13547 Kiosks_11 CELL NaN NaN NaN
8 MKT 316 12396 Kiosks_11 CELL NaN NaN NaN
9 MKT 306 10778 Cafes_638 CELL NaN NaN NaN
10 MKT 317 11230 Kiosks_11 CELL NaN NaN NaN
11 MKT 315 13630 Kiosks_11 CELL NaN NaN NaN
12 MKT 314 14113 Bars_13 CELL NaN NaN NaN
13 MKT 314 12089 Kiosks_11 CELL NaN NaN NaN
Here, PERIOD_ID AND STORE_ID and MKT are periods,stores and items respectively.
The edit suggested by #jezrael is returning me this for the above df.
d1={306L: (8207L, 8209L .... 8210L, 8211L),307L:( 8215L, 8219L ... 8233L, 8235L), 308: (8238L, 8239L....8244L, 8252L) ..k:(v) ..}
(Note: Edited to make it look small as the original dictionary is huge)
For the sample data it is working fine as expected but for this dataframe it isnt.
Edit for #jezrael as a Minimal, Reproducible Example.
df=
RTYPE PERIOD_ID STORE_ID MKT MTYPE RGROUP RZF RXF
0 MKT 20171411 3102300001 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
1 MKT 20171411 3102300002 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
2 MKT 20171411 3104001193 PM Provision CELL NaN NaN NaN
3 MKT 20171411 3104001193 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
4 MKT 20171411 3104001193 Provision including MM CELL NaN NaN NaN
5 MKT 20171411 3104001641 PM Provision CELL NaN NaN NaN
6 MKT 20171411 3104001641 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
7 MKT 20171411 3104001641 Provision including MM CELL NaN NaN NaN
8 MKT 20171411 3104001682 PM Provision CELL NaN NaN NaN
9 MKT 20171411 3104001682 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
10 MKT 20171411 3104001682 Provision including MM CELL NaN NaN NaN
11 MKT 20171412 3104001682 Alcohol CELL NaN NaN NaN
12 MKT 20171412 3104001682 Fish CELL NaN NaN NaN
13 MKT 20171412 3104001684 Alcohol CELL NaN NaN NaN
14 MKT 20171412 3104001684 Fish CELL NaN NaN NaN
Current Ouput as per #jezraels code
{20171411L: ('Provision including MM', 'PM Provision', 'PM KA+PM PROV+SMKT+PETRO'), 20171412L: ('Fish', 'Alcohol')}
Expected Output :
{20171411L: ('Provision including MM', 'PM Provision'), 20171412L: ('Fish', 'Alcohol')}
For Period 20171411L, 'Provision including MM', 'PM Provision' MKT's are duplicate because they have the same set of store_ids whereas for period
20171412L , 'Fish', 'Alcohol' MKT's are duplicate because they have the same set of store_ids.
I am new to Pandas but have some basic knowledge about Python.
Really not sure how I can achieve this.
Any help will be great.
You can do with a dict comprehension:
dicta = {p: g.groupby('item')['store'].apply(tuple).to_dict()
for p, g in df.groupby('period')}
[out]
{1: {"'A'": (32, 34), "'B'": (32, 34)}, 2: {"'X'": (42, 44), "'Y'": (42, 44)}}
Create MultiIndex Series and in dictionary comprehension create nested dictionary:
s = df.groupby(['period','item'])['store'].apply(tuple)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'A': (32, 34), 'B': (32, 34)}, 2: {'X': (42, 44), 'Y': (42, 44)}}
EDIT: You can grouping by period and convert item to sets and then to tuples:
d1 = {k:tuple(set(v)) for k, v in df.groupby('period')['item']}
print (d1)
{1: ('A', 'B'), 2: ('X', 'Y')}
d1 = df.groupby('period')['item'].apply(lambda x: tuple(set(x))).to_dict()
print (d1)
{1: ('A', 'B'), 2: ('X', 'Y')}

how to skip lines in pandas dataframe at the end of the xls

I have a dataframe:
Energy Supply Energy Supply per Capita % Renewable
Country
Afghanistan 3.210000e+08 10 78.669280
Albania 1.020000e+08 35 100.000000
British Virgin Islands 2.000000e+06 85 0.000000
...
Aruba 1.200000e+07 120 14.870690 ...
Excludes the overseas territories. NaN NaN NaN
Data exclude Hong Kong and Macao Special Admini... NaN NaN NaN
Data on kerosene-type jet fuel include aviation... NaN NaN NaN
For confidentiality reasons, data on coal and c... NaN NaN NaN
Data exclude Greenland and the Danish Faroes. NaN NaN NaN
I had used df = pd.read_excel(filelink, skiprows=16) to cut unwanted information at the very beginning of the file but how can I get rid of the "noize"-information at the end of df?
I had tried to pass a list to skiprows but it messed the results up.
It seems you need parameter skip_footer = 5 in read_excel:
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
Sample:
df = pd.read_excel('myfile.xlsx', skip_footer = 5)
print (df)
Country Energy Supply Energy Supply per Capita \
0 Afghanistan 321000000.0 10
1 Albania 102000000.0 35
2 British Virgin Islands 2000000.0 85
3 Aruba 12000000.0 120
% Renewable
0 78.66928
1 100.00000
2 0.00000
3 14.87069
Another solution is remove all rows where all NaN in some columns with dropna:
df = pd.read_excel('myfile.xlsx')
cols = ['Energy Supply','Energy Supply per Capita','% Renewable']
df = df.dropna(subset=cols, how='all')
print (df)
Country Energy Supply Energy Supply per Capita \
0 Afghanistan 321000000.0 10.0
1 Albania 102000000.0 35.0
2 British Virgin Islands 2000000.0 85.0
3 Aruba 12000000.0 120.0
% Renewable
0 78.66928
1 100.00000
2 0.00000
3 14.87069

Categories

Resources