Pandas: Randomly splitting index values between 2 index values - python

This is an "ISO week 53 problem".
I have a pandas Series instance with index values representing the ISO week number:
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,53,53])
I want to randomly and equally replace all of the index = 53 indices with either index = 52 or index = 1.
For the above, this could be:
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,52,1])
or
import pandas as pd
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,1,52])
for example. How do I do this, please?
Thanks for any help.
EDIT
In numpy I used the following to achieve this:
from numpy import where
from numpy.random import shuffle
indices = where(timestamps == 53)[0]
number_of_indices = len(indices)
if number_of_indices == 0:
return # no iso week number 53 to fix.
shuffle(indices) # randomly shuffle the indices.
midway_index = number_of_indices // 2
timestamps[indices[midway_index:]] = 52 # precedence if only 1 timestamp.
timestamps[indices[: midway_index]] = 1
where the timestamps array is the pandas index value.

List comprehension should work if I understand you correctly:
ts = pd.Series([1,1,1,2,3,1,2], index=[1,1,2,2,52,53,53])
ts.index = [i if i != 53 else np.random.choice([1,52]) for i in ts.index]
1 1
1 1
2 1
2 2
52 3
52 1
1 2
dtype: int64

Related

Pandas: convert a b-tree of Series objects, (with name set), into a single DataFrame with multi-index

I've created a 4 level b-tree thing with each leaf being a Pandas Series and each level is a Series with 2 values based on True or False. Each level's Series is named according to the level. The result is a not very useful object, but convenient to create.
The code below shows how to create a similar (but simpler) object, that has the same essential properties.
What I really want is a MultiIndex dataframe, where each level of index inherits its name from the same name of that level Series.
import random
import pandas as pd
def sertree(names):
if len(names) <= 1:
ga = pd.Series([random.randint(0,100) for x in range(5)], name='last')
gb = pd.Series([random.randint(0,100) for x in range(5)], name='last')
return pd.Series([ga,gb], index=[True,False], name=names[0])
else:
xa = sertree(names[1:])
xb = sertree(names[1:])
return pd.Series([xa,xb], index=[True,False], name=names[0])
pp = sertree(['top', 'next', 'end'])
n=4
while True:
print(f"{'':>{n}s}{pp.name}")
n+=4
if len(pp) > 2 : break
pp = pp[True]
top
next
end
last
What I want is something like this...
top=[True,False]
nxt=[True,False]
end=[True,False]
last=range(5)
midx = pd.MultiIndex.from_product([top,nxt,end,last],names=['top','next','end','last']) ;
midf = pd.DataFrame([random.randint(0,100) for x in range(len(midx))], index=midx, columns=['name'])
In [593]: midf.head(12)
Out[593]:
name
top next end last
True True True 0 99
1 74
2 16
3 61
4 3
False 0 44
1 46
2 59
3 14
4 82
False True 0 98
1 93
Any ideas how to transform my 'pp' abomination into a nice DataFrame multi-index in a nice Pandas method that I'm lost to understand. Essential is to maintain the Series name as the Multi-index name at each level.

Populate pandas dataframe using column and row indices as variables

Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))

Finding the indexes of the N maximum values across an axis in Pandas

I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])

Pandas: convert column from minutes (type object) to number

I want to convert a column of a Pandas DataFrame from an object to a number (e.g., float64). The DataFrame is the following:
import pandas as pd
import numpy as np
import datetime as dt
df = pd.read_csv('data.csv')
df
ID MIN
0 201167 32:59:00
1 203124 14:23
2 101179 8:37
3 200780 5:22
4 202699 NaN
5 203117 NaN
6 202331 36:05:00
7 2561 30:43:00
I would like to convert the MIN column from type object to a number (e.g., float64). For example, 32:59:00 should become 32.983333.
I'm not sure if it's necessary as an initial step, but I can convert each NaN to 0 via:
df['MIN'] = np.where(pd.isnull(df['MIN']), '0', df['MIN'])
How can I efficiently convert the entire column? I've tried variations of dt.datetime.strptime(), df['MIN'].astype('datetime64'), and pd.to_datetime(df['MIN']) with no success.
Defining a converter function:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
minutes, sec, *_ = [int(x) for x in time_str.split(':')]
return minutes + sec / 60
and applying it to the MINcolumn:
df.MIN = df.MIN.map(str_to_number)
works.
Before:
ID MIN
0 1 32:59:00
1 2 NaN
2 3 14:23
After:
ID MIN
0 1 32.983333
1 2 0.000000
2 3 14.383333
The above is for Python 3. This works for Python 2:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
entries = [int(x) for x in time_str.split(':')]
minutes = entries[0]
sec = entries[1]
return minutes + sec / 60.0
Note the 60.0. Alternatively, use from __future__ import print_function to avoid the integer division problem.

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources