Setting values in pandas df using location - python

I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.

We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

Related

Chunk a large dataset by using a string in the column name using pandas

I have a dataset containing 3000 columns
every column looks like 'abc_dummy0', 'dfg_dummy0, asd_dummy0' and it's of length 130 before it moves onto 'dfg_dummy1'.... and so on until 'lkj_dummy39'
I can use
cols = [col for col in df.columns if 'dummy1' in col]
And it lists all the columns (130) containing the dummy1 at the end of each column name.
My question is, how can i create smaller chunks of data, each one containing 'dummy0', 'dummy1', 'dummy2'.. and so on all the way till 'dummy39' without doing this 40 times
I think it has something to do with
dummy{i} for i in range(0,39)
But I am not quite sure how to approach this, in a memory efficient and code efficient way
(because I would just write 40 lines of code for each respective 'dummy' group)
Here's what I can do, one at the time:
group_0 = [col for col in df.columns if 'dummy0' in col]
group_0 = df[group_0]
but how do I do this for all other 39 groups (both the "_i" at the end of the name and the dummy{i} part) ?
Try with groupby on axis=1 and create a dict entry for each group:
dfs = {group_name: frame
for group_name, frame in
df.groupby(df.columns.str.split('_').str[-1], axis=1)}
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (2, 6)),
columns=[f'{chr(3 * i + j + 97)}_dummy{i}'
for i in range(2)
for j in range(3)])
df:
a_dummy0 b_dummy0 c_dummy0 d_dummy1 e_dummy1 f_dummy1
0 79 62 17 74 9 63
1 28 31 81 8 77 16
groups based on values after the _:
df.columns.str.split('_').str[-1]
Index(['dummy0', 'dummy0', 'dummy0', 'dummy1', 'dummy1', 'dummy1'], dtype='object')
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Then each group can be accessed like:
dfs['dummy0']
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
Alternatively with a loop + filter:
for i in range(0, 2):
print(f'dummy{i}')
curr_df = df.filter(like=f'_dummy{i}')
# Do something with curr_df
print(curr_df)
Output:
dummy0
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
dummy1
d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16
You can use this like:
[col for col in df.columns for i in range(39) if str(i) in col if 'dummy' in col]

Creating a data frame named after values from another data frame

I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.
If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary
Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.

Loop logic to calculate % change

My dataframe:
A B C A_Q B_Q C_Q
27 40 41 2 1 etc
28 39 40 1 5
30 28 29 3 6
28 27 28 4 1
15 10 11 5 4
17 13 14 1 5
16 60 17 8 10
14 21 18 9 1
20 34 23 10 2
21 45 34 7 4
I want to iterate through each row in every column with a _Q suffix, starting with A_Q and do the following:
if row value = '1', grab the corresponding value in col 'A'
assign that value to a variable, call it x
keep looping down the col A_Q
if row value is either 1,2,3,4,5,6,7,8 or 9, ignore
if the value is 10, then get the corresponding value in col 'A' and assign that to variable y
calculate % change, call it chg, between y and x: (y/x)-1)*100
append chg to dataframe
keep going down the column with steps 1-7 above until the end
Then do the same for the other columns B_Q, C_Q etc
So for example, in the above, the first "1" that appears corresponds to 28 in col A. So x = 28. Then keep iterating, ignoring values 1 through 9, until you get a 10, which corresponds to 20 in col A. Calculate % change = ((20/27)-1)*100 = -25.9% and append that to df in a newly created col A_S. Then resume from that point on with same steps until reach end of the file. And finally, do the same for the rest of the columns.
So then the df would look like:
A B C A_Q B_Q C_Q A_S B_S C_S etc
27 40 41 2 1 etc
28 39 40 1 5
30 28 29 3 6
28 27 28 4 1
15 10 11 5 4
17 13 14 1 5
16 60 17 8 10 50
14 21 18 9 1
20 34 23 10 2 -25.9
21 45 34 7 4
I thought to create a function and then do something like df ['_S'] = df.apply ( function, axis =1) but am stuck on the implementation of the above steps 1-8. Thanks!
Do you need to append the results as a new column? You're going to end up with nearly empty columns with just one data value. Could you just append all of the results at the bottom of the '_Q' columns? Anyway here's my stab at the function to do all you asked:
def func(col1, col2):
l = []
x = None
for index in range(0, len(col1)):
if x is None and col1[index] == 1:
x = col2[index]
l.append(0)
elif not(x is None) and col1[index] == 10:
y = col2[index]
l.append(((float(y)/x)-1)*100)
x = None
else:
l.append(0)
return l
You'd then pass this function A_Q as col1 and A as col2 and it should return what you want. For passing functions, assuming that every A, B, C column has an associated _Q column, you could do something like:
q = [col for col in df.columns if '_Q' in col]
for col in q:
df[col[:len(col) - 2] + '_S] = func(df[col], df[col[:len(col) - 2]

How to stack (per iteration) dataframes in column side by side in one csv file in python pandas?

If I could generate two columns of data per iteration in a for-loop and I want to save it in a csv file, how will it be done if the next iteration that I would generate two columns it will be stacked side by side on the same csv file(no overwriting)? same goes for the next iterations. I have searched for pandas.DataFrame(mode='a') but it only appends the columns vertically (by rows). I have looked into concatenating pd.concat, however, I don't know how to implement it in a for loop for more than two dataframes. Do you have some sample codes for this one? or some ideas to share?
import numpy as np, pandas as pd
for i in xrange (0, 4):
x = pd.DataFrame(np.arange(10).reshape((5,1)))
y = pd.DataFrame(np.arange(10).reshape((5,1)))
data = np.array([x,y])
df = pd.DataFrame(data.T, columns=['X','Y'])
A file is a one dimensional object that only grows in length. The rows are only separated by a \n character. So, it is impossible to add rows without rewriting the file.
You can load the file in memory and concatenate using dataframe and then write it back to (some other file). Here:
import numpy as np, pandas as pd
a = pd.DataFrame(np.arange(10).reshape((5,2)))
b = pd.DataFrame(np.arange(20).reshape((5,4)))
pd.concat([a,b],axis=1)
is that what you want?
In [84]: %paste
df = pd.DataFrame(np.arange(10).reshape((5,2)))
for i in range (0, 4):
new = pd.DataFrame(np.random.randint(0, 100, (5,2)))
df = pd.concat([df, new], axis=1)
## -- End pasted text --
In [85]: df
Out[85]:
0 1 0 1 0 1 0 1 0 1
0 0 1 50 82 24 53 84 65 59 48
1 2 3 26 37 83 28 86 59 38 33
2 4 5 12 25 19 39 1 36 26 9
3 6 7 35 17 46 27 53 5 97 52
4 8 9 45 17 3 85 55 7 94 97
An alternative:
def iter_stack(n, shape):
df = pd.DataFrame(np.random.choice(range(10), shape)).T
for _ in range(n-1):
df = df.append(pd.DataFrame(np.random.choice(range(10), shape)).T)
return df.T
iterstacking(5, (5, 2))

Python: how to subset from a large dataframe based upon a maximum value in a column and nrows before and nrows after including the self

Suppose:
df['Column_Name'].max() # is the maximum value in a particular column in a dataframe
Then, you want to select 10 rows before the row that has maximum value in a particular column and 10 rows after that row (i.e. 10 + 1 + 10 = 21 rows total), then, how can it be done in Python?
Here is an addition to #2rs2ts solution to account for your max value being near the beginning or end of your series or dataframe.
df['a'][max(0,index_of_max_value-10):min(len(df['a']), index_of_max_value+11)]
You want to get the index of the row that has the maximum value. Assuming you're using Pandas, this would be done by using idxmax().
>>> from pandas import DataFrame
>>> data = [{'a':x} for x in range(40)]
>>> from random import shuffle
>>> shuffle(data)
>>> df = DataFrame(data)
>>> index_of_max_value = df['a'].idxmax()
>>> df['a'][max(0,index_of_max_value-10):min(len(df['a']), index_of_max_value+11)]
19 16
20 36
21 8
22 20
23 14
24 31
25 6
26 18
27 17
28 23
29 39
30 5
31 25
32 4
33 12
34 35
35 26
36 0
37 27
38 21
39 30
Name: a, dtype: int64

Categories

Resources