I'm using Python pandas and have a data frame that is pulled from my CSV file:
ID Value
123 10
432 14
213 12
'''
214 2
999 43
I want to randomly select some rows with the condition that the sum of the selected values = 30% of the total value.
Please advise how should I write this condition.
You can first shuffle the rows with sample, then filter using loc, cumsum and comparison to be ≤ to 30% of the total:
out = df.sample(frac=1).loc[lambda d: d['Value'].cumsum().le(d['Value'].sum()*0.3)]
Example output:
ID Value
0 123 10
3 214 2
2 213 12
Intermediates:
ID Value cumsum ≤30%
0 123 10 10 True
3 214 2 12 True
2 213 12 24 True
1 432 14 38 False
4 999 43 81 False
Related
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333
I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.
Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0
I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])
Suppose we have a Pandas DataFrame like the following:
df=pd.DataFrame({'name':['Ind','Chn','SG','US','SG','US','Ind','Chn','Fra','Fra'],'a':[5,6,3,4,7,12,66,78,65,100]})
I would like to sum the values of column 'a' for each distinct values of column 'name'.
I tried this code:
for i in df['name'].unique():
df['tot']=df[(df.name==i)]['a'].sum()
In the resulting new column, 'tot' Column contains only the sum of last distinct value of 'name' i.e (only 'Fra') for all rows rather than separate values for each of [Ind, US,Fra ,etc] . I would like to have one cell in the new column (tot) for each unique value of 'name' column and ultimately want to sort the whole dateframe 'df' through sum of each unique values.
I tried using dictionary,
dc={}
for i in df['name'].unique():
dc[i]=dc.get(i,0)+(df[(df.name==i)]['a'].sum())
I get the desired result though in dictionary,so I don't know how to sort df from here based on values of the dictionary 'dc'.
{'Ind': 71, 'Chn': 84, 'SG': 10, 'US': 16, 'Fra': 165}
Could anybody please explain the process to workout such scenario in as many ways as possible? Which would be the most efficient way when dealing with huge data? Thanks!
Edit: My expected output is just to sort the dataframe df by the value of the new column 'tot'.. Or like finding the rows associated with maximum or minimum values in the column 'tot'.
You are looking for groupby
df=pd.DataFrame({'name':['Ind','Chn','SG','US','SG','US','Ind','Chn','Fra','Fra'],'a':[5,6,3,4,7,12,66,78,65,100]})
df.groupby('name').a.sum()
Out[950]:
name
Chn 84
Fra 165
Ind 71
SG 10
US 16
Name: a, dtype: int64
Edit :
df.assign(total=df.name.map(df.groupby('name').a.sum())).sort_values(['name','total'])
Out[964]:
a name total
1 6 Chn 84
7 78 Chn 84
8 65 Fra 165
9 100 Fra 165
0 5 Ind 71
6 66 Ind 71
2 3 SG 10
4 7 SG 10
3 4 US 16
EDIT 2 :
df.groupby('name').a.sum().sort_values(ascending=True)
Out[1111]:
name
SG 10
US 16
Ind 71
Chn 84
Fra 165
Name: a, dtype: int64
df.groupby('name').a.sum().sort_values(ascending=False)
Out[1112]:
name
Fra 165
Chn 84
Ind 71
US 16
SG 10
Name: a, dtype: int64
(df.groupby('name').a.sum().sort_values(ascending=False)).index.values
Out[1119]: array(['Fra', 'Chn', 'Ind', 'US', 'SG'], dtype=object)
IIUIC, use groupby and transform
In [3716]: df['total'] = df.groupby('name')['a'].transform('sum')
In [3717]: df
Out[3717]:
a name total
0 5 Ind 71
1 6 Chn 84
2 3 SG 10
3 4 US 16
4 7 SG 10
5 12 US 16
6 66 Ind 71
7 78 Chn 84
8 65 Fra 165
9 100 Fra 165
And, use sort_values
In [3719]: df.sort_values(by='total', ascending=False)
Out[3719]:
a name total
8 65 Fra 165
9 100 Fra 165
1 6 Chn 84
7 78 Chn 84
0 5 Ind 71
6 66 Ind 71
3 4 US 16
5 12 US 16
2 3 SG 10
4 7 SG 10