Stata .dta files include labels/descriptions for each column, which can be viewed in Stata using the describe command. For example, the adults and kids variables in this online dataset, have descriptions number of adults in household and number of children in household, respectively:
clear
use http://www.principlesofeconometrics.com/stata/alcohol.dta
describe
Contains data from http://www.principlesofeconometrics.com/stata/alcohol.dta
obs: 1,000
vars: 4 10 Nov 2007 11:33
size: 5,000 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
adults byte %8.0g number of adults in household
kids byte %8.0g number of children in household
income int %8.0g weekly income
consume byte %8.0g =1 if consume alcohol, =0 otherwise
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Those descriptions do not show up in Pandas, for example with describe():
df = pd.read_stata('http://www.principlesofeconometrics.com/stata/alcohol.dta')
df
adults kids income consume
0 2 2 758 1
1 2 3 1785 1
2 3 0 1200 1
.. ... ... ... ...
997 2 0 1383 1
998 2 2 816 0
999 2 2 387 0
df.describe()
adults kids income consume
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 2.012000 0.722000 649.528000 0.766000
std 0.815181 1.078833 460.657826 0.423584
min 1.000000 0.000000 12.000000 0.000000
25% 2.000000 0.000000 295.000000 1.000000
50% 2.000000 0.000000 562.500000 1.000000
75% 2.000000 1.000000 887.500000 1.000000
max 6.000000 5.000000 3846.000000 1.000000
Is there a way to view this information after loading it to a Pandas DataFrame using read_stata()?
Using Stata's toy dataset auto as an example:
sysuse auto, clear
describe
Contains data from auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2014 17:45
size: 3,182 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign
The following works for me:
import pandas as pd
data = pd.read_stata('auto.dta', iterator = True)
labels = data.variable_labels()
labels
Out[5]:
{'make': 'Make and Model',
'price': 'Price',
'mpg': 'Mileage (mpg)',
'rep78': 'Repair Record 1978',
'headroom': 'Headroom (in.)',
'trunk': 'Trunk space (cu. ft.)',
'weight': 'Weight (lbs.)',
'length': 'Length (in.)',
'turn': 'Turn Circle (ft.) ',
'displacement': 'Displacement (cu. in.)',
'gear_ratio': 'Gear Ratio',
'foreign': 'Car type'}
Edited:
OK If I understand you correctly, you are looking for frequency counts?
If so, .value_counts() should do the trick.
df = pd.read_stata("http://www.principlesofeconometrics.com/stata/alcohol.dta")
adults_values = df.adults.value_counts().sort_index().to_frame()
print(adults_values)
adults
1 247
2 562
3 133
4 49
5 8
6 1
kids_values = df.kids.value_counts().sort_index()
print(kids_values)
kids
0 626
1 133
2 158
3 61
4 20
5 2
Variable Descriptions
.info() gives you information on the datatypes of variables in each column.(int8, int64, etc)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 4 columns):
adults 1000 non-null int8
kids 1000 non-null int8
income 1000 non-null int16
consume 1000 non-null int8
dtypes: int16(1), int8(3)
memory usage: 12.7 KB
Hope this helps.
Related
Here is my problem :
Let’s say you have to buy and sell two objects with those following conditions:
You buy object A or B if its price goes below 150 (<150) and assuming that you can buy fraction of the object (so decimals are allowed)
If the following day the object is still below 150, then you just keep the object and do nothing
If the object is higher or equal to 150, then you sell the object and take profits
You start the game with 10000$
Here is the DataFrame with all the prices
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
The goal is to return a DataFrame with the number of object for A and B you hold and the number of cash you have left.
If the conditions are met, the allocation for each object is the half of the cash you have if you don’t hold any object (weight =1/2) and is the rest if you already have one object (weight=1)
Let’s look at df first, I will also develop the new data frame that I’m trying to create (let’s call it df_end) :
On 2017-05-19, object A is 153$ and B is 139$ : You buy 35.97 object B (=5000/139) as the price is <150 —> You have 5000$ left in cash.
On 2017-05-22, object A is 147$ and B is 152$ : You buy 34.01 object A (=5000/147) as the price is <150 + You sell 35.97 object B at 152$ as it is >=150 --> You have now 5467,44$ left in cash thanks to the selling of B.
On 2017-05-23, object A is 149$ and B is 141$ : You keep your position on Object A (34.01 object) as it’s still below 150 and you buy 38.77 Object B (=5467.44/141) as the price is <150 —> You have now 0$ left in cash.
On 2017-05-24, object A is 155$ and B is 141$ : You sell 34.01 object A at 155$ as it’s above 150$ and you keep 38.77 Object B as it’s still below 150 —> You have now 5271.55$ left in cash thanks to the selling of A
On 2017-05-25, object A is 145$ and B is 141$: You buy 36.35 object A (5271.55/145) as it’s below 150 and you keep 38.77 Object B as it’s still below 150 —> You have now 0$ in cash
On 2017-05-26, object A is 147$ and B is 152$: You sell 38.77 object B at 152 as it’s above 150 and you keep 36.35 Object A as it’s still below 150 —> You have now 5893.04$ in cash thanks to the selling of Object B
On 2017-05-29, object A is 155$ and B is 152$: You sell 36.35 object A at 155 as it’s above 150 and you do nothing else as B is not below 150 —> You have now 11.527,29$ in cash thanks to the selling of Object A.
Hence, the new dataframe df_end should look like this (this is the Result I am looking for)
A B Cash
Date
2017-05-19 0 35.97 5000
2017-05-22 34.01 0 5467.64
2017-05-23 34.01 38.77 0
2017-05-24 0 38.77 5272.11
2017-05-25 36.35 38.77 0
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
My principal problem is that we have to iterate over both rows and columns and this is the most difficult part.
It's been a week that I'm trying to find a solution but I still don't find any idea on that, that is why I tried to explain as clear as possible.
So if somebody has an idea on this issue, you are very welcome.
Thank you so much
You could try this:
import pandas as pd
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
print(df)
#Values before iterations
EntryCash=10000
newdata=[]
holding=False
#First iteration (Initial conditions)
firstrow=df.to_records()[0]
possibcash=EntryCash if holding else EntryCash/2
prevroa=possibcash/firstrow[1] if firstrow[1]<=150 else 0
prevrob=possibcash/firstrow[2] if firstrow[2]<=150 else 0
holding=any(i!=0 for i in [prevroa,prevrob])
newdata.append([df.to_records()[0][0],prevroa,prevrob,possibcash])
#others iterations
for row in df.to_records()[1:]:
possibcash=possibcash if holding else possibcash/2
a=row[1]
b=row[2]
if a>150:
if prevroa>0:
possibcash+=prevroa*a
a=0
else:
a=prevroa
else:
if prevroa==0:
a=possibcash/a
possibcash=0
else:
a=prevroa
if b>150:
if prevrob>0:
possibcash+=prevrob*b
b=0
else:
b=prevrob
else:
if prevrob==0:
b=possibcash/b
possibcash=0
else:
b=prevrob
prevroa=a
prevrob=b
newdata.append([row[0],a,b,possibcash])
holding=any(i!=0 for i in [a,b])
df_end=pd.DataFrame(newdata, columns=[df.index.name]+list(df.columns)+['Cash']).set_index('Date')
print(df_end)
Output:
df
A B
Date
2017-05-19 153 139
2017-05-22 147 152
2017-05-23 149 141
2017-05-24 155 141
2017-05-25 145 141
2017-05-26 147 152
2017-05-29 155 152
df_end
A B Cash
Date
2017-05-19 0.000000 35.971223 5000.000000
2017-05-22 34.013605 0.000000 5467.625899
2017-05-23 34.013605 38.777489 0.000000
2017-05-24 0.000000 38.777489 5272.108844
2017-05-25 36.359371 38.777489 0.000000
2017-05-26 36.359371 0.000000 5894.178274
2017-05-29 0.000000 0.000000 11529.880831
If you want it rounded to two decimals, you can add:
df_end=df_end.round(decimals=2)
df_end:
A B Cash
Date
2017-05-19 0.00 35.97 5000.00
2017-05-22 34.01 0.00 5467.63
2017-05-23 34.01 38.78 0.00
2017-05-24 0.00 38.78 5272.11
2017-05-25 36.36 38.78 0.00
2017-05-26 36.36 0.00 5894.18
2017-05-29 0.00 0.00 11529.88
Slight Differences Final Values
It is slight different to your desired output because sometimes you were rounding the values to two decimals and sometimes you didn't. For example:
In your second row you put:
#second row
2017-05-22 34.01 0 5467.64
That means you used the complete value of object A, first row, that is 35.971223 not 35.97:
35.97*152
Out[120]: 5467.44
35.971223*152
Out[121]: 5467.6258960000005 #---->closest to 5467.64
And at row 3, again you used the real value, not the rounded:
#row 3
2017-05-24 0 38.77 5272.11
#Values
34.013605*155
Out[122]: 5272.108775
34.01*155
Out[123]: 5271.549999999999
And finally, at the last two rows you used the rounded value, I guess, because:
#last two rows
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
#cash values
#penultimate row, cash value
38.777489*152
Out[127]: 5894.178328
38.77*152
Out[128]: 5893.040000000001
#last row, cash value
5894.04+(155*36.35)
Out[125]: 11528.29 #---->closest to 11527.29
5894.04+(155*36.359371)
Out[126]: 11529.742505
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I have a data set with categorical values for Family size (1 to 4) and Loan taken (0 and 1). I want to know if there is significant diff between the mean of family size and loan taken.
I did groupby to get the count of Loan by family size as :
gp = df.groupby(["Family", "Personal Loan"])["Personal Loan"].count()
with output
Family Personal Loan
1 0 1365
1 107
2 0 1190
1 106
3 0 877
1 133
4 0 1088
1 134
Now I need to apply f_two way anova to see if there is significant difference between the loan taken and family size. Need help how to go about it.
You cannot do an two way anova with count data and binary response. What you can do is a chisq test, to test that the proportions of loan==1 are not equal across all families:
import seaborn as sns
import pandas as pd
from scipy.stats import chi2_contingency
I have to get back something like your original df:
df = pd.DataFrame({
'Family':np.repeat(np.arange(1,5),[1472,1296,1010,1222]),
'Personal Loan':np.repeat([0,1,0,1,0,1,0,1],
[1365,107,1190,106,877,133,1088,134]),
})
gp = df.groupby(["Family","Personal Loan"])["Personal Loan"].count()
gp
Family Personal Loan
1 0 1365
1 107
2 0 1190
1 106
3 0 877
1 133
4 0 1088
1 134
Name: Personal Loan, dtype: int64
Now we do the chi-sq using crosstab:
contingency = pd.crosstab(df['Family'],df['Personal Loan'])
test = chi2_contingency(contingency)
test
(29.676116414854746, 1.6144121228248757e-06, 3, array([[1330.688, 141.312],
[1171.584, 124.416],
[ 913.04 , 96.96 ],
[1104.688, 117.312]]))
The 2nd value, 1.614e-06 is the p-value of your test that all the loan == 1 ratios are equal
I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?
The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.
I'm working with the pandas DF and the properties are following,
df.info() prints before entering the outlier function,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6661 entries, 0 to 6660
Data columns (total 4 columns):
currency 6661 non-null object
port 6661 non-null object
supplier_id 6661 non-null int64
value 6661 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 260.2+ KB
None
df.columns.values prints,
[u'currency' u'port' u'supplier_id' u'value']
the data was like before adding the country and the outlier comlumns,
currency port supplier_id value
0 USD CNAQG 35 118.66
1 USD CNAQG 19 120.83
2 USD CNAQG 49 86.83
3 USD CNAQG 54 112.15
4 USD CNAQG 113 113.60
5 USD CNAQG 5 114.32
6 USD CNAQG 55 111.43
7 USD CNAQG 81 117.22
8 USD CNAQG 2 111.43
9 USD CNAQG 10 119.39
10 USD CNAQG 56 104.91
11 USD CNAQG 14 119.39
12 USD CNAQG 4 115.77
13 USD CNAQG 7 119.39
14 USD CNAQG 74 127.34
15 USD CNAQG 15 112.15
16 USD CNAQG 149 88.27
17 USD CNAQG 20 144.71
18 USD CNAQG 231 119.39
19 USD CNBIH 19 140.00
I use lower and the upper quartile 0.05 and 0.95 respectively and use the formula to exclude the outliers,
CURRENCIES_DIC = {'CN':'CHINA', 'US':'USA'}
LOW_Q = 0.05
HIGH_Q = 0.95
# mark the data for respective country as outlier
def calculate_outliers(df):
df['country'] = df.port.str[:2].map(CURRENCIES_DIC)
df['outlier'] = 0
for c in df.country.unique():
q = df.value[df.country==c].quantile([LOW_Q, HIGH_Q])
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c].apply(lambda x: 1 if x<q[LOW_Q] or x>q[HIGH_Q] else 0))
return df
How to choose the correct values for defining the outliers ?
Is there any better way to define the outliers for the respective purpose ?
I see another formula df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'. Would it be better to use that?
In my opinion, there is no general rule for defining correct thresholds which distinguish between normal values and outliers. In fact, there are plenty of different methods to detect outliers. Wikipedia has a good coverage here.
What outliers actually mean is highly dependent on your context. Plotting your data helps to get a visual impression of how your data behaves. In case your data is not normally distributed, the standard deviation criteria might not be well suited as the data is not symmetrically distributed.
See those two topics on crossvalidated here and here for more.
You can slightly modify your code to make it a bit more readable:
CURRENCIES_DIC = {'CN':'CHINA', 'US':'USA'}
LOW_Q = 0.05
HIGH_Q = 0.95
def outlier_quant(series):
lower = series < series.quantile(LOW_Q)
upper = series > series.quantile(HIGH_Q)
return lower | upper
def mark_outliers(sub_df)
sub_df["country"] = sub_df["port"].str[:2].map(CURRENCIES_DIC)
sub_df["outlier"] = sub_df.groupby("country")["value"].transform(outlier_quant)
print(mark_outliers(df))