How to randomly assign treatment group in python?

How to randomly assign treatment group in python? - python

In my research, I employ regression-based difference-in-difference specification. And to conduct a placebo test, I tried to randomly assign the placebo treated entry year to all treatment groups based on a uniform distribution. For example, my original data looks like this
treatment_group_dummy treated_year group_number
1 1996 1
1 2005 3
1 2001 5
1 2006 5
1 2007 5
1 2002 5
and I want to randomly assign treated years to all treatment groups based on a uniform distribution from 1996 ~ 2007. For example,
treatment_group_dummy treated_year group_number
1 2007 1
1 1996 3
1 2004 5
1 2005 5
1 2001 5
1 2006 5
Here is my preliminary code, but I think it does not work at all...
import random
import numpy as np
import pandas as pd
import itertools as it
random.seed(0)
numGroups=5
numYears=1996 ~ 2007
data = list(it.product(range(numGroups),range(numMembers)))
df = pd.DataFrame(data=data,columns=['group','years'])
Does anyone give some though about it?
Thanks in advance

I don't see any initialization of numMembers in your code. So I am not sure about the size of the list you want. But following is a possible implementation
import numpy as np
import pandas as pd
# set a random seed
np.random.seed(2021)
numGroups = 5
# number of rows in the dataset
size = 10
data = {
'group': np.random.randint(1, numGroups+1, size),
'years': np.random.randint(1996, 2008, size)
}
df = pd.DataFrame(data)
Edit 1: Based on the additional explanation from author, when we want to randomize treated_year only
df['treated_year'] = np.random.randint(1996, 2008, df.shape[0])

Related

Pandas: Repeating list in column does not work

I want to turn a dataframe from this
to this:
It took me a while to figure out the melt and transpose function to get to this
But I did not get to manage to apply the years from 1990 to 2019 in a repeating manner into for every of the 189 countries.
I tried:
year_list = []
for year in range(1990, 2020,1):
year_list.append(year)
years = pd.Series(year_list)
years
and then
df['year'] = years.repeat(30)
(I need to repeat it 30 times, because the frame consists of 5670 rows = 189 countries * 29 years)
I got this error message:
ValueError: cannot reindex on an axis with duplicate labels
Googling this error does not help.

One approach could be as follows:
Sample data
import pandas as pd
import numpy as np
data = {'country': ['Afghanistan','Angola']}
data.update({k: np.random.rand() for k in range(1990,1993)})
df = pd.DataFrame(data)
print(df)
country 1990 1991 1992
0 Afghanistan 0.103589 0.950523 0.323925
1 Angola 0.103589 0.950523 0.323925
Code
res = (df.set_index('country')
.unstack()
.sort_index(level=1)
.reset_index(drop=False)
.rename(columns={'country': 'geo',
'level_0': 'time',
0: 'hdi_human_development_index'})
)
print(res)
time geo hdi_human_development_index
0 1990 Afghanistan 0.103589
1 1991 Afghanistan 0.950523
2 1992 Afghanistan 0.323925
3 1990 Angola 0.103589
4 1991 Angola 0.950523
5 1992 Angola 0.323925
Explanation
Use df.set_index on column country and apply df.unstack to add the years from the column names to the index.
Now, we use df.sort_index on level=1 to get the countries in alphabetical order.
Finally, we use df.reset_index with drop parameter set to False to get the index back as columns, and we chain df.rename to customize the column names.

pandas mean column 1 of each different instance in column 2

I have a dataframe with list of houses and column 'GROSSAREA' for each house and column 'YEARBUILT' on when it was constructed.
I need to find the average house size for each year.
df[df['YEARBUILT'] == 1991].mean()
Would you just look it from lowest to the highest year?

It's a little hard to parse your question, but I think what you are asking for is the mean GROSSAREA per each YEARBUILD. If that's not the correct understanding then please edit your question and add an example set of data with the desired output.
If I'm correct then you want to use groupby.
import pandas as pd
df = pd.DataFrame({'YEARBUILT': [1999, 1999, 2000, 2000], 'GROSSAREA': [10, 20, 50, 60]})
df.groupby(by='YEARBUILT').mean()
GROSSAREA
YEARBUILT
1999 15
2000 55
That will give you the mean per each group of YEARBUILT.
I think of groupby like merging cells in a spreadsheet.
# Your original dataframe:
YEARBUILT GROSSAREA
1999 10
1999 20
2000 50
2000 60
# Your dataframe after df.groupby(by='YEARBUILT')
YEARBUILT GROSSAREA
1999 10
20
2000 50
60

How to take statistics from a website and make it into a DataFrame on python?

I'm trying to make a DataFrame from this website: http://mcubed.net/ncaab/seeds.shtml
I'm trying to make these lists into a DataFrame and to see the history of each seed in the NCAA tournament.
I'm not familiar with web-scraping and manually entering it would take awhile.
So I'm wondering if there is an easier way to create this DataFrame than manually doing it?
I've tried testing it with making my own dataframe and would manually input data from the website but it is a very long process
import pandas as pd
data= {"History of 1 Seed":["1 seed versus 1 seed"],
"History of 2 Seed":["2 seed versus 1 seed"],
"History of 3 Seed":["3 seed versus 1 seed"],
"History of 4 Seed":["4 seed versus 1 seed"],
"History of 5 Seed":["5 seed versus 1 seed"],
"History of 6 Seed":["6 seed versus 1 seed"],
"History of 7 Seed":["7 seed versus 1 seed"],
"History of 8 Seed":["8 seed versus 1 seed"],
"History of 9 Seed":["9 seed versus 1 seed"],
"History of 10 Seed":["10 seed versus 1 seed"],
"History of 11 Seed":["11 seed versus 1 seed"],
"History of 12 Seed":["12 seed versus 1 seed"],
"History of 13 Seed":["13 seed versus 1 seed"],
"History of 14 Seed":["14 seed versus 1 seed"],
"History of 15 Seed":["16 seed versus 1 seed"],
"History of 16 Seed":["16 seed versus 1 seed"]
}
df1= pd.DataFrame(data)
df1
I create my dataframe but I'm not sure how to input values into it and hoping there is an easier way to do this. Thanks

Parsing the Website
First step is to parse the website, and put the information into a DataFrame or a series of DataFrames. Here we use a combo of requests to get the text and BeautifulSoup to parse the html. The difficult aspect of your specific website it that the tables are just text, not specific html elements. So we have to go about this slightly differently than we would normally.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header = None))
dflist[0]
0 1 2 3
0 vs. #1 (23-23) 50.0%
1 vs. #2 (40-35) 53.3%
2 vs. #3 (25-15) 62.5%
Cleaning, and Combining the DataFrames
Next what we need to do is format all the dataframes in the list. I've also decided that we will combine all the dataframes, making team name a column, and who they are VS in another column. This will allow for easy filtering to get whatever information we need.
#We need to create a new list, due to the melt we are going to do not been able to replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win Percent'], axis = 1)
allTeamStats.head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
1 Team 1 vs.#2 (40-35) 53.3%
2 Team 1 vs.#3 (25-15) 62.5%
3 Team 1 vs.#4 (53-22) 70.7%
4 Team 1 vs.#5 (45-9) 83.3%
Querying our new DF
Now that we have all the information in a single DataFrame we can filter it to pull the information we want!
allTeamStats[allTeamStats['VS'] == 'vs.#1'].head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
0 Team 2 vs.#1 (35-40) 46.7%
0 Team 3 vs.#1 (15-25) 37.5%
0 Team 4 vs.#1 (22-53) 29.3%
0 Team 5 vs.#1 (9-45) 16.7%
If you wanted an easier way to investigate the wins and losses of a team, we could further create two new columns with their wins and losses separate from Record.
allTeamStats['Win'] = allTeamStats['Record'].str.extract(r'\((\d+)')
allTeamStats['Lose'] = allTeamStats['Record'].str.extract(r'\(\d+-(\d+)')
allTeamStats.head()
Team VS Record Win Percent Win Lose
0 Team 1 vs.#1 (23-23) 50.0% 23 23
1 Team 1 vs.#2 (40-35) 53.3% 40 35
2 Team 1 vs.#3 (25-15) 62.5% 25 15
3 Team 1 vs.#4 (53-22) 70.7% 53 22
4 Team 1 vs.#5 (45-9) 83.3% 45 9

Plotting histogram for column by grouping two column in pandas

I am new to pandas and matplotlib. I have a csv file which consist of year from 2012 to 2018. For each month of the year, I have Rain data. I want to analyze by the histogram, which month of the year having maximum rainfall. Here is my dataset.
year month Temp Rain
2012 1 10 100
2012 2 20 200
2012 3 30 300
.. .. .. ..
2012 12 40 400
2013 1 50 300
2013 2 60 200
.. .. .. ..
2018 12 70 400
I could not able to plot with histogram, I tried plotting with the bar but not getting desired result. Here what I have tried:
import pandas as pd
import numpy as npy
import matplotlib.pyplot as plt
df2=pd.read_csv('Monthly.csv')
df2.groupby(['year','month'])['Rain'].count().plot(kind="bar",figsize=(20,10))
Here what I got output:
Please suggest me an approach to plot an histogram to analyze maxmimum rainfall happening in which month grouped by year.

Probably you don't want to see the count per group but
df2.groupby(['year','month'])['Rain'].first().plot(kind="bar",figsize=(20,10))
or maybe
df2.groupby(['month'])['Rain'].sum().plot(kind="bar",figsize=(20,10))

you are closed to solution, i'll write: use max() and not count()
df2.groupby(['year','month'])['Rain'].max().plot(kind="bar",figsize=(20,10))

First groubby year and month as you already did, but only keep the maximum rainfall.
series_df2 = df2.groupby(['year','month'], sort=False)['Rain'].max()
Then unstack the series, transpose it and plot it.
series_df2.unstack().T.plot(kind='bar', subplots=False, layout=(2,2))
This will give you an output that looks like this for your sample data:

from csv file :Convert unix timestamp of first column into year and create a graph

I am trying to create a plot from csv file. In the csv file, the first column is timestamp, second till sixth column are different parties. I want to create a graph where x axis is year(ie. 2004) and plot the graph with the values of the parties in percentage in y axis.
The csv file looks like:
date,CSU/CDU,SPD,Gruene,FDP,Linke
891468000.0,34,44,6,5,6
891986400.0,34,44,6,5,6
892677600.0,35,43,6,5,5
894405600.0,32,46,6,6,5
895010400.0,33,46,5,5,5
I have tried the below code.
import numpy as np
import matplotlib.pyplot as plt
with open('polldata.csv') as f:
names = f.readline().strip().split(',')
data = np.loadtxt(f, delimiter=',')
cols = data.shape[1]
for n in range (1,cols):
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('year',fontsize=14)
plt.ylabel('parties',fontsize=14)
plt.show()
From the first column of my csv file, I want to convert that timestamp to year .Also, I need to display in a bar chart so that the color differentiation parties can be easily identified.
I want the graph to look similar like the 5TH one in the below page
(https://moderndata.plot.ly/elections-analysis-in-r-python-and-ggplot2-9-charts-from-4-countries/)
THANKS IN ADVANCE!

You can use the csv reader from pandas. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
it looks like this:
import pandas as pd
import matplotlib.pyplot as plt
import datetime
df = pd.read_csv("polldata.csv", delimiter=',')
df['date'] = df['date'].apply(lambda ts: datetime.datetime.utcfromtimestamp(ts).strftime('%Y'))
print(df)
ols = df.columns
for n in range (len(cols)):
plt.plot(df,label=cols[n])
plt.xlabel('year',fontsize=14)
plt.ylabel('parties',fontsize=14)
plt.show()
it will print:
date CSU/CDU SPD Gruene FDP Linke
0 1998 34 44 6 5 6
1 1998 34 44 6 5 6
2 1998 35 43 6 5 5
3 1998 32 46 6 6 5
4 1998 33 46 5 5 5
does that get you started?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to randomly assign treatment group in python? - python

Related

Pandas: Repeating list in column does not work

pandas mean column 1 of each different instance in column 2

How to take statistics from a website and make it into a DataFrame on python?

Plotting histogram for column by grouping two column in pandas

from csv file :Convert unix timestamp of first column into year and create a graph

Categories

Resources