Drawing point to another point in R? - python

I have a information of migration data of different countries to Mexico. I want to represent it in a sanky. I am new to R and having a difficult time to produce it. Can someone please help me to achieve this? The destination is Mexico, so all the data flow will be to one point
df
Country 2013 2014 2015 2016 Dest
UK 1200 1200 1207 1400 Mexico
China 630 700 800 940 Mexico
Canada 1000 1000 950 920 Mexico
Brazil 820 670 550 230 Mexico
France 400 200 700 700 Mexico
Australia 440 350 340 780 Mexico
Sankey diagram example:

Here is a different way to visualise your data. We use plotly to create a dynamic animation of the flow of migrants into Mexico over time.
# Need to reshape data from wide to long and prepare data for plotly
data_long <- data %>%
rename(source = Country, target = Dest) %>%
pivot_longer(matches("\\d{4}"), names_to = "year") %>%
pivot_longer(c(source, target), values_to = "country") %>%
mutate(node_id = as.factor(country))
# Plotly Sankey diagrams need a link list that gives 0-based indices
# (instead of node names)
link_list <- data_long %>%
select(-country) %>%
mutate(node_id = as.integer(node_id) - 1) %>%
pivot_wider(names_from = "name", values_from = "node_id") %>%
arrange(year) %>%
as.list()
# Now we're ready to plot
library(plotly)
plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = levels(data_long$node_id),
pad = 15,
thickness = 20),
link = link_list,
frame = ~link_list$year) %>%
animation_slider(currentvalue = list(prefix = "Year: "))
Sample data
data <- read.table(text = "Country 2013 2014 2015 2016 Dest
UK 1200 1200 1207 1400 Mexico
China 630 700 800 940 Mexico
Canada 1000 1000 950 920 Mexico
Brazil 820 670 550 230 Mexico
France 400 200 700 700 Mexico
Australia 440 350 340 780 Mexico", header = TRUE, check.names = FALSE)

A different option could be using ggalluvial to create an alluvial workflow of the values per country over the years and the total per year. Here is a reproducible example:
# remotes::install_github("davidsjoberg/ggsankey")
library(ggalluvial)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
df %>%
pivot_longer(cols = -`Country`:`2016`) %>%
mutate(value = as.numeric(gsub(",", "", value))) %>%
ggplot(aes(x = name, y = value, alluvium = Country)) +
geom_alluvium(aes(fill = Country, colour = Country),
alpha = .75, decreasing = FALSE) +
ggtitle("Migration to Mexico") +
scale_y_continuous(breaks=pretty_breaks()) +
theme_bw()
Created on 2022-09-04 with reprex v2.0.2

Related

fill column with value of a column from another dataframe, depending on conditions

I have a dataframe that looks like this (my input database on COVID cases)
data:
date state cases
0 20200625 NY 300
1 20200625 CA 250
2 20200625 TX 200
3 20200625 FL 100
5 20200624 NY 290
6 20200624 CA 240
7 20200624 TX 100
8 20200624 FL 80
...
worth noting that the "date" column in the above data is a number (not datetime)
I want to make it a timeseries like this (desired output), with dates as index and each state's COVID cases as columns
NY CA TX FL
20200625 300 250 200 100
20200626 290 240 100 80
...
As of now I managed to create only the scheleton of the output with the following code
states = ['NY', 'CA', 'TX', 'FL']
days = [20200625, 20200626]
columns = states
positives = pd.DataFrame(columns = columns)
i = 0
for day in days:
positives.loc[i, "date"] = day
i = i +1
positives.set_index('date', inplace=True)
positives= positives.rename_axis(None)
print(positives)
which returns:
NY CA TX FL
20200625.0 NaN NaN NaN NaN
20200626.0 NaN NaN NaN NaN
how can I get from the "data" dataframe the value of column "cases" when:
(i) value in data["state"] = column header of "positives",
(ii) value in data["date"] = row index of "positives"
You can do:
df = df.set_index(['date', 'state']).unstack().reset_index()
# fix column names
df.columns = df.columns.get_level_values(1)
state CA FL NY TX
0 20200624 240.0 NaN 290.0 NaN
1 20200625 250.0 100.0 300.0 200.0
Later, to set index again we need to set the name explicitly, do:
df = df.set_index("")
df.index.name = "date"
The transformation you are interested in is called a pivot. You can achieve this in Pandas as follows:
# Reproduce part of the data
data = pd.DataFrame({'date': [20200625, 20200625, 20200624, 20200624],
'state': ['NY', 'CA', 'NY', 'CA'],
'cases': [300, 250, 290, 240]})
data
# date state cases
# 0 20200625 NY 300
# 1 20200625 CA 250
# 2 20200624 NY 290
# 3 20200624 CA 240
# Pivot
data.pivot(index='date', columns='state', values='cases')
# state CA NY
# date
# 20200624 240 290
# 20200625 250 300

How to add conditional row to pandas dataframe

I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109
import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10
IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99

Find average of row and column groups pandas

I want to find the states with the highest average total revenue and be able to see states with the 40-45th highest average, 35-40th, etc for all states from 1992-2016.
Data is organized in a dataframe in the below picture. So ideally I could have another column like the following. I think this is what I am trying to do.
STATE // YEAR // TOTAL_REVENUE // AVG_TOTAL_REVENUE
ALABAMA // 1992 // 5000 // 6059
ALABAMA // 1993 // 4000 // 6059
ALASKA // 1992 // 3000 // 2059
ALABAMA // 1996 // 6019 // 6059
Is this possible to do? I am not sure if I am stating what I want to do correctly and not sure what I am looking for google wise to figure out a way forward.
Assuming your input looks like:
STATE YEAR TOTAL_REVENUE
Michigan 2001 1000
Michigan 2002 2000
California 2003 3000
California 2004 4000
Michigan 2005 5000
Then just do:
df['AVG_TOTAL_REVENUE'] = np.nan
states = df['STATE'].tolist()
states = list(set(states))
for state in states:
state_values = df[df['STATE'] == state]
revenues = state_values['TOTAL_REVENUE'].tolist()
revenues = [float(x) for x in revenues]
avg = sum(revenues)/len(revenues)
df['AVG_TOTAL_REVENUE'].loc[state_values.index] = avg
which gives you:
STATE YEAR TOTAL_REVENUE AVG_TOTAL_REVENUE
0 Michigan 2001 1000 2666.666667
1 Michigan 2002 2000 2666.666667
2 California 2003 3000 3500.000000
3 California 2004 4000 3500.000000
4 Michigan 2005 5000 2666.666667
If your data is stored in a pandas dataframe called df with STATE as index, then you can try:
df.set_index("STATE",inplace=True)
avg_revenue = df.groupby(level=0)["TOTAL_REVENUE"].agg("mean")
df["AVG_TOTAL_REVENUE"] = avg_revenue.loc[df.index]
df = df.sort_values(by="AVG_TOTAL_REVENUE",ascending=False)
Regarding the "40-45th highest average", I'm not sure exactly what you're looking for. But you could do this for instance:
import numpy as np
bin = (np.array([0.40, 0.45]) * len(df)).astype(int)
df.iloc[bin[0]:bin[1],:]
# Or with quantiles
min_q,max_q = (0.40, 0.45)
avg = df.AVG_TOTAL_REVENUE
df.loc[(avg >= avg.quantile(min_q)) & (avg <= avg.quantile(max_q)), :]
Or maybe you want to bin your data every 5 state in order of AVG_TOTAL_REVENUE?
df_grouped = df.groupby("STATE")["AVG_TOTAL_REVENUE"].agg("first")
n_bins = int(df_grouped.shape[0] / 5)
bins = (pd.cut(df_grouped,bins=n_bins)
.reset_index()
.groupby("AVG_TOTAL_REVENUE")
.agg(list)
)

Bar plot by grouping values in python

I want to plot a bar chart where I need to compare sales of two regions with respect to Region and Tier.
I implemented below code:
df.groupby(['Region','Tier'],sort=True).sum()[['Sales2015','Sales2016']].unstack().plot(kind="bar",width = .8)
But I want to implement sales of Tier 2015 and 2016 side by side,
e.g., on the x-axis the xticks should look like High Sales of 2015 and 2016 etc.
Data generation: I randomly generated your data using below code:
import numpy as np
import pandas as pd
# The number of demo data count
demo_num = 20
# Regions
regions = ['central', 'east', 'west']
np.random.seed(9)
regions_r = np.random.choice(regions, demo_num)
# Tiers
tiers = ['hi', 'lo', 'mid']
np.random.seed(99)
tiers_r = np.random.choice(tiers, demo_num)
# Sales
sales2015 = np.array(range(demo_num)) * 100
sales2016 = np.array(range(demo_num)) * 200
# Dataframe `df` to store all above
df = pd.DataFrame({'Region': regions_r, 'Tier': tiers_r, 'Sales2015': sales2015, 'Sales2016': sales2016})
Data: Now input data looks like this
Region Sales2015 Sales2016 Tier
0 west 0 0 lo
1 central 100 200 lo
2 west 200 400 hi
3 east 300 600 lo
4 west 400 800 hi
5 central 500 1000 mid
6 west 600 1200 hi
7 east 700 1400 lo
8 east 800 1600 hi
9 west 900 1800 lo
10 central 1000 2000 mid
11 central 1100 2200 lo
12 west 1200 2400 lo
13 east 1300 2600 hi
14 central 1400 2800 lo
15 east 1500 3000 mid
16 east 1600 3200 hi
17 east 1700 3400 mid
18 central 1800 3600 hi
19 central 1900 3800 hi
Code for visualization:
import matplotlib.pyplot as plt
import pandas as pd
# Summary statistics
df = df.groupby(['Tier', 'Region'], sort=True).sum()[['Sales2015', 'Sales2016']].reset_index(level=1, drop=False)
# Loop over Regions and visualize graphs side by side
regions = df.Region.unique().tolist()
fig, axes = plt.subplots(ncols=len(regions), nrows=1, figsize=(10, 5), sharex=False, sharey=True)
for region, ax in zip(regions, axes.ravel()):
df.loc[df['Region'] == region].plot(ax=ax, kind='bar', title=region)
plt.tight_layout()
plt.show()
Result: Now graphs look like this. I haven't optimize font size etc..
Hope this helps.

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Categories

Resources