Pandas: Repeating list in column does not work - python

I want to turn a dataframe from this
to this:
It took me a while to figure out the melt and transpose function to get to this
But I did not get to manage to apply the years from 1990 to 2019 in a repeating manner into for every of the 189 countries.
I tried:
year_list = []
for year in range(1990, 2020,1):
year_list.append(year)
years = pd.Series(year_list)
years
and then
df['year'] = years.repeat(30)
(I need to repeat it 30 times, because the frame consists of 5670 rows = 189 countries * 29 years)
I got this error message:
ValueError: cannot reindex on an axis with duplicate labels
Googling this error does not help.

One approach could be as follows:
Sample data
import pandas as pd
import numpy as np
data = {'country': ['Afghanistan','Angola']}
data.update({k: np.random.rand() for k in range(1990,1993)})
df = pd.DataFrame(data)
print(df)
country 1990 1991 1992
0 Afghanistan 0.103589 0.950523 0.323925
1 Angola 0.103589 0.950523 0.323925
Code
res = (df.set_index('country')
.unstack()
.sort_index(level=1)
.reset_index(drop=False)
.rename(columns={'country': 'geo',
'level_0': 'time',
0: 'hdi_human_development_index'})
)
print(res)
time geo hdi_human_development_index
0 1990 Afghanistan 0.103589
1 1991 Afghanistan 0.950523
2 1992 Afghanistan 0.323925
3 1990 Angola 0.103589
4 1991 Angola 0.950523
5 1992 Angola 0.323925
Explanation
Use df.set_index on column country and apply df.unstack to add the years from the column names to the index.
Now, we use df.sort_index on level=1 to get the countries in alphabetical order.
Finally, we use df.reset_index with drop parameter set to False to get the index back as columns, and we chain df.rename to customize the column names.

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

How could I transform the numpy array to pandas dataframe?

I am new to analyze using python, I wonder how can I transform the format of the left table to the right one. My initial thought is to create a nested for loop.
The desired table
First, I find read the required csv file.
Imported csv
Then, I count the number of countries in the Column 'country' and the number of the new column names list.
`countries = len(test['country'])`
`columns = len(['Year', 'Values'])`
After that, I should go for the nested for loop, however, I have no idea on writing the code.What I have come up was as follows:
`for i in countries:`
`for j in columns:`
You can use df.melt here:
In [3575]: df = pd.DataFrame({'country':['Afghanistan', 'Albania'], '1970':[1.36, 6.1], '1971':[1.39, 6.22], '1972':[1.43, 6.34]})
In [3576]: df
Out[3576]:
country 1970 1971 1972
0 Afghanistan 1.36 1.39 1.43
1 Albania 6.10 6.22 6.34
In [3609]: df = df.melt('country', var_name='Year', value_name='Values').sort_values('country')
In [3610]: df
Out[3610]:
country Year Values
0 Afghanistan 1970 1.36
2 Afghanistan 1971 1.39
4 Afghanistan 1972 1.43
1 Albania 1970 6.10
3 Albania 1971 6.22
5 Albania 1972 6.34
Not sure of what you want to do, but:
If you want to transform a column in a numpy array, you can use the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [1,2,3], "bar": [10,20,30]})
print(df)
foo_array = np.array(df["foo"])
print(foo_array)
and then iterate on foo_array
You can also loop on your data frame using :
for row in df.iterrows():
print(row)
But it's not recommended since you can often use built in pandas function to do the same job.
your data frame is also an iterable object which only contains the columns names:
for d in df:
print(d)
# output:
# foo
# bar

creating a function for mathematical data imputation python

I am performing a number of similar operations and I would like to write a function but not even sure how to approach this. I am calculating the values for 0 data for the following series:
the formula is 2 * value in 2001 - value in 2002
I currently do it one by one in Python:
print(full_data.loc['Croatia', 'fertile_age_pct'])
print(full_data.loc['Croatia', 'working_age_pct'])
print(full_data.loc['Croatia', 'young_age'])
print(full_data.loc['Croatia', 'old_age'])
full_data.replace(to_replace={'fertile_age_pct': {0:(2*46.420061-46.326103)}}, inplace=True)
full_data.replace(to_replace={'working_age_pct': {0:(2*67.038157-66.889212)}}, inplace=True)
full_data.replace(to_replace={'young_age': {0:(2*0.723475-0.715874)}}, inplace=True)
full_data.replace(to_replace={'old_age': {0:(2*0.692245-0.709597)}}, inplace=True)
Data frame (full_data):
geo_full year fertile_age_pct working_age_pct young_age old_age
Croatia 2000 0 0 0 0
Croatia 2001 46.420061 67.038157 0.723475 0.692245
Croatia 2002 46.326103 66.889212 0.715874 0.709597
Croatia 2003 46.111822 66.771187 0.706091 0.72444
Croatia 2004 45.929829 66.782133 0.694854 0.735333
Croatia 2005 45.695932 66.742514 0.686534 0.747083
So you are trying to fill the 0 values in year 2000 with your formula. If you have any other country in the DataFrame then it can get messy.
Assuming the year with 0's is always the first year for each country, try this:
full_data.set_index('year', inplace=True)
fixed_data = {}
for country, df in full_data.groupby('geo_full')[full_data.columns[1:]]:
if df.iloc[0].sum() == 0:
df.iloc[0] = df.iloc[1] * 2 - df.iloc[0]
fixed_data[country] = df
fixed_data = pd.concat(list(fixed_data.values()), keys=fixed_data.keys(), names=['geo_full'], axis=0)

Looping this code to get new dataframe based on previous calculated dataframe? [duplicate]

This question already has answers here:
Create multiple dataframes in loop
(6 answers)
Closed 3 years ago.
df2 = df.copy()
df2['avgTemp'] = df2['avgTemp'] * tempchange
df2['year'] = df2['year'] + 20
df_final = pd.concat([df,df2])
OUTPUT:
Country avgTemp year
0 Afghanistan 14.481583 2012
0 Afghanistan 15.502164 2032
1 Africa 24.725917 2012
1 Africa 26.468460 2032
2 Albania 13.768250 2012
... ... ... ...
240 Zambia 21.697750 2012
241 Zimbabwe 23.038036 2032
241 Zimbabwe 21.521333 2012
242 Åland 6.063917 2012
242 Åland 6.491267 2032
So currently I'm trying to make a loop so I can then do the same calculations for "df_2" and return "df_3", and keep doing this until I have a certain amount of new dataframes that I can then concatinate together. Thank you for your help! :)
So end result should be like df_1, df_2, df_3 and so on. So I can then concat them together into one big dataset
Yes, I would use a loop to solve the issue. The x I'm passing in the function range represents the number of loops you wish to do:
lists_of_dfs = []
for i in range(x):
df_aux = df.copy()
df_aux['avgTemp'] = df['avgTemp'] * (tempchange ** i)
df_aux['year'] = df['year'] + (20 * i)
lists_of_dfs.append(df_aux)
Finally with the full the list of dataframes:
final_df = pd.concat(lists_of_dfs)
The only condition is that the variable tempchange has to be (1+%), it can't be only the % change, otherwise the formula will fail.

How to set the column name for first column in python pandas? Weird error

I have an xls with the title row as :
AZ-Phoenix CA-Los Angeles CA-San Diego
YEAR PHXR LXXR SDXR
January 1987 59.33 54.67 77
February 1987 59.65 54.89 78
March 1987 59.99 55.16 79
Note : the first row has no name above "YEAR column". How to set the name as YEAR for this row?
I have tried : data_xls = data_xls.rename(columns={data_xls.columns[0]: 'YEAR'})
But it is replacing the AZ-Phoenix row with YEAR. and i cant really change the column i want to .
How to change this row??
YEAR is not a column, it's an index here.
try:
df.index.name = 'foobar'
or:
df = df.reset_index()
in this case, YEAR will become a normal column and you can rename it.
If the text you pasted was the format of the Excel file which looked like this:
you can handle this in a couple of ways:
You can pretend that the two lines are multilevel indexes:
df = pandas.read_excel('test.xlsx', header=[0,1])
This results in a DataFrame which you can index like this:
df['AZ-Phoenix']
resulting in
YEAR PHXR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
If the first row is actually superfluous (it seems like the airport is already uniquely defined by the the three letter airport code in there with an R tacked on), you can simply ignore that row when importing and get a "flatter" DataFrame:
df_flat = pandas.read_excel('test.xlsx', skiprows=1, index_col=0)
This gives you something you can index by the airport code:
df_flat.PHXR
gives
YEAR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
Name: PHXR, dtype: float64
By using rename_axis
df.rename_axis('YEAR',1).rename_axis('YEAR',0) # change YEAR to whatever you need for rename :)
Out[754]:
YEAR value timestamp
YEAR
0 1 2017-10-03 14:33:52
1 Water 2017-10-04 14:33:48
2 1 2017-10-04 14:33:45
3 1 2017-10-05 14:33:30
4 Water 2017-10-03 14:33:40
5 Water 2017-10-05 14:32:13
6 Water 2017-10-04 14:32:01
7 1 2017-10-03 14:31:55

Categories

Resources