Random sampling pandas based on column values

Random sampling pandas based on column values - python

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).
file value_1 value_2
0 A 1 43
1 A 1 89
2 A 1 22
3 A 1 87
4 A 1 43
5 A 1 89
6 A 1 22
7 A 1 87
8 A 1 43
9 A 1 89
10 A 1 22
11 A 1 87
12 A 1 83
13 B 0 99
14 B 0 23
15 B 0 29
16 B 0 34
17 B 0 99
18 B 0 23
19 B 0 29
20 B 0 34
21 B 0 99
22 B 0 23
23 B 0 29
24 B 0 34
25 C 1 62
- - - -
- - - -
Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.
Thanks for the help in advance. I'm pretty new to pandas and python in general.

If I understand what you are trying to get at, the following should be of help:
# Test dataframe
import numpy as np
import pandas as pd
data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
'value_1': np.repeat([1,0,1],12),
'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)
# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])
# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])

Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframe to denote whether that row had been used.
Generating Data
import pandas as pd
from string import ascii_lowercase
import random
random.seed(44)
files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)
files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))
df = pd.DataFrame({'file' : files_df,
'value_1': value_1_df,
'value_2': value_2_df,
'used': 0})
Randomly Selecting Files
len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()
for i in range(len_to_run): #not needed if you only want to run once
file_to_pull = ''.join(random.sample(updated_files, 1))
print 'file ' + file_to_pull
for j in range(batch_to_pull): #pulling 4 values
updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
value_1_to_pull = random.sample(updated_value_1,1)
print 'value_1 ' + str(value_1_to_pull)
df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1
file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

Related

Creating a data frame named after values from another data frame

I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.

If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary

Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.

Find and count all occurrences and position of numbers in a range in a list

I want to find the number of times each number appears at each index position in a list of 6 number sets when I do not know what the numbers will be, but they will range from 0-99 only.
Example list:
data = [['22', '45', '6', '72', '1', '65'], ['2', '65', '67', '23', '98', '1'], ['13', '45', '98', '4', '12', '65']]
Eventually I will be putting the resulting counts into a pandas DataFrame to look something like this:
num numofoccurances position numoftimesinposition
01 02 04 01
01 02 05 01
02 01 00 01
04 02 03 01
06 01 02 01
12 01 04 01
13 01 00 01
and so on...
The resulting data will be a little different due to the num repeating for each time it appears in a different index position, but hopefully this helps you understand what I'm looking for.
So far, this is what I've started:
data = json.load(f)
numbers = []
contains = []
'''
This section is simply taking the data from the json file and putting it all into a list of lists containing the 6 elements I need in each list
'''
for i in data['data']:
item = [i[9], i[10]]
# print(item)
item = [words for segments in item for words in segments.split()]
numbers.append(item)
'''
This is my attempt to count to number of occurrences for each number in the range then add it to a list.
'''
x = range(1,99)
for i in numbers:
if x in i and not contains:
contains.append(x)

import pandas as pd
num_pos = [(num,pos) for i in data for pos,num in enumerate(i)]
df = pd.DataFrame(num_pos,columns = ['number','position']).assign(numoftimesinposition = 1)
df = df.astype(int).groupby(['number','position']).count().reset_index()
df1 = df.groupby('number').numoftimesinposition.sum().reset_index().\
rename(columns = {'numoftimesinposition':'numofoccurences'}).\
merge(df, on='number')
print(df1)
number numofoccurences position numoftimesinposition
0 1 2 4 1
1 1 2 5 1
4 2 1 0 1
7 4 1 3 1
9 6 1 2 1
2 12 1 4 1
3 13 1 0 1
5 22 1 0 1
6 23 1 3 1
8 45 2 1 2
10 65 3 1 1
11 65 3 5 2
12 67 1 2 1
13 72 1 3 1
14 98 2 2 1
15 98 2 4 1
if the code above feels slow, then use Counter from collections:
import pandas as pd
from collections import Counter
num_pos = [(int(num),pos) for i in data for pos,num in enumerate(i)]
count_data = [(num,pos,occurence) for (num,pos), occurence in Counter(num_pos).items()]
df = pd.DataFrame(count_data, columns = ['num','pos','occurence']).sort_values(by='num')
df['total_occurence'] = [Counter(df.num).get(num) for num in df.num]
print(df)

This should solve your query (should be faster than the extremely slow groupby (which you will need 2 of) and other pandas operations for larger data) -
#get the list of lists into a 2d numpy array
dd = np.array(data).astype(int)
#get vocab of all unique numbers
vocab = np.unique(dd.flatten())
#loop thru vocab and get sum of occurances in each index position
df = pd.DataFrame([[i]+list(np.sum((dd==i).astype(int), axis=0)) for i in vocab])
#rename cols
df.columns = ['num', 0, 1, 2, 3, 4, 5]
#create total occurances of the item
df['numoccurances'] = df.iloc[:,1:].sum(axis=1)
#Stack the position counts and rename cols
stats = pd.DataFrame(df.set_index(['num','numoccurances']).\
stack()).reset_index().\
set_axis(['num', 'numoccurances', 'position', 'numtimesinposition'], axis=1)
#get only rows with occurances
stats = stats[stats['numtimesinposition']>0].reset_index(drop=True)
stats
num numoccurances position numtimesinposition
0 1 2 4 1
1 1 2 5 1
2 2 1 0 1
3 4 1 3 1
4 6 1 2 1
5 12 1 4 1
6 13 1 0 1
7 22 1 0 1
8 23 1 3 1
9 45 2 1 2
10 65 3 1 1
11 65 3 5 2
12 67 1 2 1
13 72 1 3 1
14 98 2 2 1
15 98 2 4 1
As the results show -
1 comes a total of 2 times in the sample data you shared, and it occurs 1 time each in 5th and 6th position. Similarly 2 comes 1 times in total and that too at the 1st position.

Python: Create some sort of cumsum between two columns

I am trying to figure out how to get some sort of running total using multiple columns and I can't figure out where to even start. I've used cumsum before but only for just one single column and this won't work.
I have this table :
Index A B C
1 10 12 20
2 10 14 20
3 10 6 20
I am trying to build out this table that looks like this:
Index A B C D
1 10 12 20 10
2 10 14 20 18
3 10 6 20 24
The formula for D is as follows:
D2 = ( D1 - B1 ) + C1
D1 = Column A
Any ideas on how I could do this? I am totally out of ideas on this.

This should work:
df.loc[0, 'New_Inventory'] = df.loc[0, 'Inventory']
for i in range(1, len(df)):
df.loc[i, 'New_Inventory'] = df.loc[i-1, 'Inventory'] - df.loc[i-1, 'Booked'] - abs(df.loc[i-1, 'New_Inventory'])
df.New_Inventory = df.New_Inventory.astype(int)
df
# Index Inventory Booked New_Inventory
#0 1/1/2020 10 12 10
#1 1/2/2020 10 14 -12
#2 1/3/2020 10 6 -16

You can get your answer by using shift, reference the answer here
import pandas as pd
raw_data = {'Index': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020'],
'Inventory': [10, 10, 10, 10, 10],
'Booked': [12,14,6,3,5] }
df = pd.DataFrame(raw_data)
df['New_Inventory'] = 10 # need to initialize
df['New_Inventory'] = df['Inventory'] - df['Booked'].shift(1) - df['New_Inventory'].shift(1)
df
Your requested output seems wrong. The calculation for above New_Inventory is what was requested.

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?

On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

How to stack (per iteration) dataframes in column side by side in one csv file in python pandas?

If I could generate two columns of data per iteration in a for-loop and I want to save it in a csv file, how will it be done if the next iteration that I would generate two columns it will be stacked side by side on the same csv file(no overwriting)? same goes for the next iterations. I have searched for pandas.DataFrame(mode='a') but it only appends the columns vertically (by rows). I have looked into concatenating pd.concat, however, I don't know how to implement it in a for loop for more than two dataframes. Do you have some sample codes for this one? or some ideas to share?
import numpy as np, pandas as pd
for i in xrange (0, 4):
x = pd.DataFrame(np.arange(10).reshape((5,1)))
y = pd.DataFrame(np.arange(10).reshape((5,1)))
data = np.array([x,y])
df = pd.DataFrame(data.T, columns=['X','Y'])

A file is a one dimensional object that only grows in length. The rows are only separated by a \n character. So, it is impossible to add rows without rewriting the file.
You can load the file in memory and concatenate using dataframe and then write it back to (some other file). Here:
import numpy as np, pandas as pd
a = pd.DataFrame(np.arange(10).reshape((5,2)))
b = pd.DataFrame(np.arange(20).reshape((5,4)))
pd.concat([a,b],axis=1)

is that what you want?
In [84]: %paste
df = pd.DataFrame(np.arange(10).reshape((5,2)))
for i in range (0, 4):
new = pd.DataFrame(np.random.randint(0, 100, (5,2)))
df = pd.concat([df, new], axis=1)
## -- End pasted text --
In [85]: df
Out[85]:
0 1 0 1 0 1 0 1 0 1
0 0 1 50 82 24 53 84 65 59 48
1 2 3 26 37 83 28 86 59 38 33
2 4 5 12 25 19 39 1 36 26 9
3 6 7 35 17 46 27 53 5 97 52
4 8 9 45 17 3 85 55 7 94 97

An alternative:
def iter_stack(n, shape):
df = pd.DataFrame(np.random.choice(range(10), shape)).T
for _ in range(n-1):
df = df.append(pd.DataFrame(np.random.choice(range(10), shape)).T)
return df.T
iterstacking(5, (5, 2))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Random sampling pandas based on column values - python

Related

Creating a data frame named after values from another data frame

Find and count all occurrences and position of numbers in a range in a list

Python: Create some sort of cumsum between two columns

Python Pandas Feature Generation as aggregate function

How to stack (per iteration) dataframes in column side by side in one csv file in python pandas?

Categories

Resources