I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.
For example, in the dataframe below I want to randomly sample some fraction of the values in b, but keep all corresponding rows in a and c.
d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )
Desired example output from a 16% sample:
Out[66]:
a b c
0 1 0 0
1 26 0 50
2 51 0 0
3 76 0 50
4 4 12 6
5 29 12 56
6 54 12 6
7 79 12 56
8 18 68 34
9 43 68 84
10 68 68 34
11 93 68 84
12 19 72 36
13 44 72 86
14 69 72 36
15 94 72 86
I've tried sampling the series and merging back to the main data, like this:
In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))
This creates the desired output, but it seems inefficient. My real dataset has millions of values in b and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.
Is there a more efficient way to do this?
I really doubt that isin is slow:
uniques = df.b.unique()
# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
# sampling here
df[df.b.isin(samples)]
You can profile the steps above. In case samples=... is slow, you can try:
idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]
Those took about 100 ms on my system on 10 million rows.
Note: d.b.sample(int(.16 * d.b.nunique())) does not sample 0.16 of the unique values in b.
Related
I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.
If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary
Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.
I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!
Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...
I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]
Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.
Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99
Here is the problem:
import numpy
import pandas
dfl = pandas.DataFrame(numpy.random.randn(30,10))
now, I want the following cells put in a data frame:
For row 1: columns 3 to 6 (length = 4 cells),
For row 2: columns 4 to 7 (length = 4 cells),
For row 3: columns 1 to 4 (length = 4 cells),
ect...
Each of these range is always 4 cells wide, but the start/end are different columns.
The row-wise start point are in a list [3, 4, 1,...] and so are the row-wise end points. The list of rows I'm interested in is also a list [1, 2, 3].
Finally, dfl has an datetime-index which I would like to preserve
(meaning the end result should be a data frame with indexes dfl.index[1, 2, 3].
Edit: range exceeds
Some of the entries of the vector of row-wise start points are too large (say a row-wise start point of 9 in the example matrix above). In those case, I just want all the columns from the row-wise start point and then as many NaN's as necessary to get the right shape (so since 9+4 > 10, then the corresponding row of the result data frame should be [9, 10, NaN, NaN]
Using NumPy broadcasting to create all those column indices and then advanced-indexing into the array data -
def extract_rows(dfl, starts, L, fillval=np.nan):
a = dfl.values
idx = np.asarray(starts)[:,None] + range(L)
valid_mask = idx < dfl.shape[1]
idx[~valid_mask] = 0
val = a[np.arange(len(idx))[:,None],idx]
return pd.DataFrame(np.where(valid_mask, val, fillval))
Sample runs -
In [541]: np.random.seed(0)
In [542]: dfl = pandas.DataFrame(numpy.random.randint(11,99,(3,10)))
In [543]: dfl
Out[543]:
0 1 2 3 4 5 6 7 8 9
0 55 58 75 78 78 20 94 32 47 98
1 81 23 69 76 50 98 57 92 48 36
2 88 83 20 31 91 80 90 58 75 93
In [544]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=np.nan)
Out[544]:
0 1 2 3
0 78.0 78.0 20.0 94.0
1 50.0 98.0 57.0 92.0
2 75.0 93.0 NaN NaN
In [545]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=-1)
Out[545]:
0 1 2 3
0 78 78 20 94
1 50 98 57 92
2 75 93 -1 -1
Or we can using .iloc and enumerate
l=[3, 4, 1]
pd.DataFrame(data=[df.iloc[x:x+1,y:y+4].values[0] for x,y in enumerate(l)])
Out[107]:
0 1 2 3
0 1.224124 -0.938459 -1.114081 -1.128225
1 -0.445288 0.445390 -0.154295 -1.871210
2 0.784677 0.997053 2.144286 -0.179895
If I could generate two columns of data per iteration in a for-loop and I want to save it in a csv file, how will it be done if the next iteration that I would generate two columns it will be stacked side by side on the same csv file(no overwriting)? same goes for the next iterations. I have searched for pandas.DataFrame(mode='a') but it only appends the columns vertically (by rows). I have looked into concatenating pd.concat, however, I don't know how to implement it in a for loop for more than two dataframes. Do you have some sample codes for this one? or some ideas to share?
import numpy as np, pandas as pd
for i in xrange (0, 4):
x = pd.DataFrame(np.arange(10).reshape((5,1)))
y = pd.DataFrame(np.arange(10).reshape((5,1)))
data = np.array([x,y])
df = pd.DataFrame(data.T, columns=['X','Y'])
A file is a one dimensional object that only grows in length. The rows are only separated by a \n character. So, it is impossible to add rows without rewriting the file.
You can load the file in memory and concatenate using dataframe and then write it back to (some other file). Here:
import numpy as np, pandas as pd
a = pd.DataFrame(np.arange(10).reshape((5,2)))
b = pd.DataFrame(np.arange(20).reshape((5,4)))
pd.concat([a,b],axis=1)
is that what you want?
In [84]: %paste
df = pd.DataFrame(np.arange(10).reshape((5,2)))
for i in range (0, 4):
new = pd.DataFrame(np.random.randint(0, 100, (5,2)))
df = pd.concat([df, new], axis=1)
## -- End pasted text --
In [85]: df
Out[85]:
0 1 0 1 0 1 0 1 0 1
0 0 1 50 82 24 53 84 65 59 48
1 2 3 26 37 83 28 86 59 38 33
2 4 5 12 25 19 39 1 36 26 9
3 6 7 35 17 46 27 53 5 97 52
4 8 9 45 17 3 85 55 7 94 97
An alternative:
def iter_stack(n, shape):
df = pd.DataFrame(np.random.choice(range(10), shape)).T
for _ in range(n-1):
df = df.append(pd.DataFrame(np.random.choice(range(10), shape)).T)
return df.T
iterstacking(5, (5, 2))