I am new to analyze using python, I wonder how can I transform the format of the left table to the right one. My initial thought is to create a nested for loop.
The desired table
First, I find read the required csv file.
Imported csv
Then, I count the number of countries in the Column 'country' and the number of the new column names list.
`countries = len(test['country'])`
`columns = len(['Year', 'Values'])`
After that, I should go for the nested for loop, however, I have no idea on writing the code.What I have come up was as follows:
`for i in countries:`
`for j in columns:`
You can use df.melt here:
In [3575]: df = pd.DataFrame({'country':['Afghanistan', 'Albania'], '1970':[1.36, 6.1], '1971':[1.39, 6.22], '1972':[1.43, 6.34]})
In [3576]: df
Out[3576]:
country 1970 1971 1972
0 Afghanistan 1.36 1.39 1.43
1 Albania 6.10 6.22 6.34
In [3609]: df = df.melt('country', var_name='Year', value_name='Values').sort_values('country')
In [3610]: df
Out[3610]:
country Year Values
0 Afghanistan 1970 1.36
2 Afghanistan 1971 1.39
4 Afghanistan 1972 1.43
1 Albania 1970 6.10
3 Albania 1971 6.22
5 Albania 1972 6.34
Not sure of what you want to do, but:
If you want to transform a column in a numpy array, you can use the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [1,2,3], "bar": [10,20,30]})
print(df)
foo_array = np.array(df["foo"])
print(foo_array)
and then iterate on foo_array
You can also loop on your data frame using :
for row in df.iterrows():
print(row)
But it's not recommended since you can often use built in pandas function to do the same job.
your data frame is also an iterable object which only contains the columns names:
for d in df:
print(d)
# output:
# foo
# bar
Related
I want to turn a dataframe from this
to this:
It took me a while to figure out the melt and transpose function to get to this
But I did not get to manage to apply the years from 1990 to 2019 in a repeating manner into for every of the 189 countries.
I tried:
year_list = []
for year in range(1990, 2020,1):
year_list.append(year)
years = pd.Series(year_list)
years
and then
df['year'] = years.repeat(30)
(I need to repeat it 30 times, because the frame consists of 5670 rows = 189 countries * 29 years)
I got this error message:
ValueError: cannot reindex on an axis with duplicate labels
Googling this error does not help.
One approach could be as follows:
Sample data
import pandas as pd
import numpy as np
data = {'country': ['Afghanistan','Angola']}
data.update({k: np.random.rand() for k in range(1990,1993)})
df = pd.DataFrame(data)
print(df)
country 1990 1991 1992
0 Afghanistan 0.103589 0.950523 0.323925
1 Angola 0.103589 0.950523 0.323925
Code
res = (df.set_index('country')
.unstack()
.sort_index(level=1)
.reset_index(drop=False)
.rename(columns={'country': 'geo',
'level_0': 'time',
0: 'hdi_human_development_index'})
)
print(res)
time geo hdi_human_development_index
0 1990 Afghanistan 0.103589
1 1991 Afghanistan 0.950523
2 1992 Afghanistan 0.323925
3 1990 Angola 0.103589
4 1991 Angola 0.950523
5 1992 Angola 0.323925
Explanation
Use df.set_index on column country and apply df.unstack to add the years from the column names to the index.
Now, we use df.sort_index on level=1 to get the countries in alphabetical order.
Finally, we use df.reset_index with drop parameter set to False to get the index back as columns, and we chain df.rename to customize the column names.
grouped1=exp.groupby("country")["value"].sum().reset_index()
grouped2=imp.groupby("country")["value"].sum().reset_index()
grouped=grouped1.merge(grouped2,on="country")
grouped.rename(columns={"value_x":"export_to","value_y":"import_from"},inplace=True)
grouped
output:
I want to sort dataframe by the sum of export_to and import_from
I tried this:
grouped.sort_values(grouped.export_to+grouped.import_from)
In my opinion, there's nothing wrong to add a new column to sort by. But of course, we can replace the index with the series of sums as an option:
(
grouped
.set_index(grouped.export_to + grouped.import_from)
.sort_index()
.reset_index(drop=True)
)
We can hide all this machinery deeper inside sort_index:
grouped.sort_index(
ignore_index=True,
key=lambda index: grouped.loc[index, ['export_to','import_from']].sum('columns')
)
The process is the same: an index has been replaced here by a key-function and then sorted. After that reordered index is dropped due to ignore_index=True.
You can create a temporary column for total and drop it after sorting by it.
df.assign(total=df['export_to']+df['import_from']).sort_values('total').drop(columns='total')
country export_to import_from
3 ANDORA 6.28 5.82
1 ALBANIA 196.51 524.18
0 AFG 4790.19 2682.23
2 ALGERIA 8232.24 10185.12
Even if you don't drop the total_column, it will not be added to the DF permanently, and the result will still be the same. though shows in the result
country export_to import_from total
3 ANDORA 6.28 5.82 12.10
1 ALBANIA 196.51 524.18 720.69
0 AFG 4790.19 2682.23 7472.42
2 ALGERIA 8232.24 10185.12 18417.36
df.assign(total=df['export_to']+df['import_from']).sort_values('total')
I want to read a table form Wikipedia:
import pandas as pd
caption="Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df
But I got this errore:
"ValueError: No tables found matching pattern 'Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)'"
This method worked for table like below table:
caption = "Average daily maximum and minimum temperatures for selected cities in Minnesota"
df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match=caption)
df
But I get confused for this one, how can I solved this problem?
You have multiple problems here.
pandas doesn't support https, and there's no such caption that you're looking for.
Try this:
import pandas as pd
import requests
caption = "Table of countries by IHDI"
df = pd.read_html(
requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index").text,
match=caption,
)
print(df[0].head())
Output:
Rank Country ... 2019 estimates (2020 report)[4][5][6]
Rank Country ... Overall loss (%) Growth since 2010
0 1 Norway ... 6.1 0.021
1 2 Iceland ... 5.8 0.055
2 3 Switzerland ... 6.9 0.015
3 4 Finland ... 5.3 0.040
4 5 Ireland ... 7.3 0.066
[5 rows x 6 columns]
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index')
df[2]
Or if you wish to use match argument
import pandas as pd
caption="Table of countries by IHDI"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df[0]
Returns
I'm trying to figure out how to sort through rows in a spreadsheet read with pandas and save values to variables.
Here is my code so far:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('data_file.xlsx', sheetname='Sheet 1')
for line in df:
if line.startswith(line):
The data is formatted the following way:
Column 1 has runner numbers, column 2 has 100 meter sprint times, Column 3 has 400 meter sprint times.
Here's an example of the data:
Runner 100m 400m
1 43.7 93.5
1 37.5 87.6
1 39.2 82.5
2 28.9 67.9
2 26.2 69.9
2 33.3 60.25
2 34.2 60.65
3 19.9 45.5
3 19.8 44.0
4 18.7 50.0
4 19.0 52.4
How could I store the contents of all the rows starting with 1 in a unique variable, all the rows starting with 2 in another variable, 3, etc.? I know this has to involve a loop of some sort but I'm not sure about how to approach this problem.
Generally, you want to avoid trying to programmatically set unique variables. This problem is probably best approached using a dictionary data structure to store the contents of the rows, with keys for each "Runner" ID (but runners would need to be unique).
You can quickly iterate through the data for each runner using pandas groupby. In the loop, the i represents the "Runner" ID and tdf is the dataframe of just data for that runner. This would store a numpy array of the data for each runner in dict d.
d = {}
for i, tdf in df.groupby('Runner'):
d[i] = tdf[['100m', '400m']].values
EDIT:
If you really want to iterate line by line you can use df.iterrows() method.
d = {}
for i, x in df.iterrows():
runner = x['Runner']
data = x[['100m', '400m']].tolist()
d[runner] = d.get(runner, []).append(data)
I have the following dataframe in Pandas. My doubt is how can I make to operate with series where one has a time delay. For example, I would like to calculate the result of dividing the GDP of a period by the population of the next period.
GDP Population
1950 3.31 1
1951 3.5 1
...
2000 15.2 2
To do that you can just use:
df['new_col'] = df['GDP'] / df['Population'].shift(1)
Have you consider using shift?
import pandas as pd
import numpy as np
df = pd.DataFrame({"GDP": np.random.normal(3,1,51),
"Pop":np.random.randint(1,10,51)},
index=np.arange(1950,2001))
df["res"] = df.GDP.shift(1)/df.Pop