Using a numpy random number generator, generate arrays on height and weight of the 88,000 people living in Utah.
The average height is 1.75 metres and the average weight is 70kg. Assume standard deviation on 3.
Combine these two arrays using column_stack method and convert it into a pandas DataFrame with the first column named as 'height' and the second column named as 'weight'
I've gotten the randomly generated data. However, I can't seem to convert the array to a DataFrame
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
print(Utah)
df = pd.DataFrame(
[[np_height],
[np_weight]],
index = [0, 1],
columns = ['height', 'weight'])
print(df)
You want 2 columns, yet you passed data [[np_height],[np_weight]] as 1 column. You can set the data as dict.
df = pd.DataFrame({'height':np_height,
'weight':np_weight},
columns = ['height', 'weight'])
print(df)
The data in Utah is already in a suitable shape. Why not use that?
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
df = pd.DataFrame(
data=Utah,
columns=['height', 'weight']
)
print(df.head())
height weight
0 3.57 65.32
1 -0.15 66.22
2 5.65 73.11
3 2.00 69.59
4 2.67 64.95
Related
Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})
You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Suppose I have the following dataframe.
df = pd.DataFrame({"a": [1, 0, 0, 2, 0]})
I want to construct a new dataframe based on df such that
newdf[0] = 1 or nan
newdf[1] = 0 + newdf[0] * exp(-alpha) # Alpha is some value.
newdf[2] = 0 + newdf[1] * exp(-alpha)
newdf[3] = 2 + newdf[2] * exp(-alpha)
newdf[4] = 0 + newdf[3] * exp(-alpha)
Basically I want to construct a new dataframe which accepts instanteneous change and decay its own value.
Is there an elegant way to achieve this using pd.rolling or pd.ewm?
I'd like to avoid any for-loop because dataframe has many rows and columns.
Thanks
Use -
alpha = 2
df['new'] = 1 or np.nan
df['new'] = df['a'] + df['a'].shift(-1)*np.exp(-alpha)
import numpy as np is a dependency.
The last row in the df will by np.nan based on this.
I want to generate a new column in my dataframe df which can take only two values i.e. 0 or 1. My dataframe currently has 1000 rows with other columns as well. I want to generate 0 and 1 in such a way that 60% of the values in the column are 0 and rest 40% 1.
I did the following :
generated_data = []
for index, row in df.iterrows():
if index <= len(df) * 0.6 :
generated_data.append(0)
else :
generated_data.append(1)
The question is : How can this be achieved randomly. in my code top 60% of the rows are 0 and rest 1. I want to achieve the randomness in the creation.
Thanks
In case you want precisely 60% of 0 and 40% of 1, you could first create the column with np.onesand np.zeros, and then shuffle it :
import numpy as np
generated_data = np.concatenate([np.zeros(600), np.ones(400)])
np.random.shuffle(generated_data)
print(generated_data)
Use numpy.random.choice with p parameter if need each value has 60% chance to be 0 and 40% chance to be 1.
For 60% 0's and 40% 1's use numpy.random.shuffle. with all possible values generated before:
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'a':range(1000)})
#print (df)
arr = np.ones(len(df))
arr[:int(len(df) * 0.6)] = 0
np.random.shuffle(arr)
df['new1'] = arr
df['new2'] = np.random.choice([0, 1], size=len(df), p=(0.6, 0.4))
print (df['new1'].value_counts())
0.0 600
1.0 400
Name: new1, dtype: int64
print (df['new2'].value_counts())
0 601
1 399
Name: new2, dtype: int64
I have a list of Numpy arrays that looks like this:
[400.31865662]
[401.18514808]
[404.84015554]
[405.14682194]
[405.67735105]
[273.90969447]
[274.0894528]
When I try to convert it to a Pandas Dataframe with the following code
y = pd.DataFrame(data)
print(y)
I get the following output when printing it. Why do I get all those zeros?
0
0 400.318657
0
0 401.185148
0
0 404.840156
0
0 405.146822
0
0 405.677351
0
0 273.909694
0
0 274.089453
I would like to get a single column dataframe which looks like that:
400.31865662
401.18514808
404.84015554
405.14682194
405.67735105
273.90969447
274.0894528
You could flatten the numpy array:
import numpy as np
import pandas as pd
data = [[400.31865662],
[401.18514808],
[404.84015554],
[405.14682194],
[405.67735105],
[273.90969447],
[274.0894528]]
arr = np.array(data)
df = pd.DataFrame(data=arr.flatten())
print(df)
Output
0
0 400.318657
1 401.185148
2 404.840156
3 405.146822
4 405.677351
5 273.909694
6 274.089453
Since I assume the many visitors of this post aren't here for OP's specific and un-reproducible issue, here's a general answer:
df = pd.DataFrame(array)
The strength of pandas is to be nice for the eye (like Excel), so it's important to use column names.
import numpy as np
import pandas as pd
array = np.random.rand(5, 5)
array([[0.723, 0.177, 0.659, 0.573, 0.476],
[0.77 , 0.311, 0.533, 0.415, 0.552],
[0.349, 0.768, 0.859, 0.273, 0.425],
[0.367, 0.601, 0.875, 0.109, 0.398],
[0.452, 0.836, 0.31 , 0.727, 0.303]])
columns = [f'col_{num}' for num in range(5)]
index = [f'index_{num}' for num in range(5)]
Here's where the magic happens:
df = pd.DataFrame(array, columns=columns, index=index)
col_0 col_1 col_2 col_3 col_4
index_0 0.722791 0.177427 0.659204 0.572826 0.476485
index_1 0.770118 0.311444 0.532899 0.415371 0.551828
index_2 0.348923 0.768362 0.858841 0.273221 0.424684
index_3 0.366940 0.600784 0.875214 0.108818 0.397671
index_4 0.451682 0.836315 0.310480 0.727409 0.302597
I just figured out my mistake. (data) was a list of arrays:
[array([400.0290173]), array([400.02253235]), array([404.00252113]), array([403.99466754]), array([403.98681395]), array([271.97896036]), array([271.97110677])]
So I used np.vstack(data) to concatenate it
conc = np.vstack(data)
[[400.0290173 ]
[400.02253235]
[404.00252113]
[403.99466754]
[403.98681395]
[271.97896036]
[271.97110677]]
Then I convert the concatened array into a Pandas Dataframe by using the
newdf = pd.DataFrame(conc)
0
0 400.029017
1 400.022532
2 404.002521
3 403.994668
4 403.986814
5 271.978960
6 271.971107
Et voilĂ !
There is another way, which isn't mentioned in the other answers. If you have a NumPy array which is essentially a row vector (or column vector) i.e. shape like (n, ) , then you could do the following :
# sample array
x = np.zeros((20))
# empty dataframe
df = pd.DataFrame()
# add the array to df as a column
df['column_name'] = x
This way you can add multiple arrays as separate columns.
I have a dataset that maps continuous values to discrete categories. I want to display a histogram with the continuous values as x and categories as y, where bars are stacked and normalized. Example:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
},
columns=['score', 'category'])
print df.head(10)
Output:
score category
0 0.649371 B
1 0.042309 B
2 0.689487 A
3 0.433064 B
4 0.978859 A
5 0.789140 C
6 0.215758 D
7 0.922389 B
8 0.105364 D
9 0.010274 C
If I try to plot this as a histogram using df.hist(by='category'), I get 4 graphs:
I managed to get the graph I wanted but I had to do a lot of manipulation.
# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
'score' : df.score,
'A' : (df.category == 'A').astype(float),
'B' : (df.category == 'B').astype(float),
'C' : (df.category == 'C').astype(float),
'D' : (df.category == 'D').astype(float)
},
columns=['score', 'A', 'B', 'C', 'D'])
# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])
# Sum over series for weights
df4 = df3.sum(1)
bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))
bars.plot.bar(stacked=True)
I expect there is a more straightforward way to do this, easier to read and understand and more optimized with less intermediate steps. Any solutions?
I dont know if this is really that much more compact or readable than what you already got but it is a suggestion (a late one as such :)).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
}, columns=['score', 'category'])
# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)
# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size()
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)
# Plot
df_a.unstack().plot.bar(stacked=True)
Consider assigning bins with cut, calculating grouping percentages with couple of groupby().transform calls, and then aggregate and reshape with pivot_table:
# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1),
labels=np.arange(0,1,0.1)).round(1)
# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
.div(df.groupby(['plot_bins'])['score'].transform('count')))
# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
.reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)