split column header number and string in python - python

Here is my toy example, my question is how to create new column called trial={2,3}, 2 & 3comes from the number part in the columns names 2.0__sum_values, 3.0__sum_values,
my code is:
import pandas as pd
before_spliting = {"ID": [1, 2,3], "2.0__sum_values": [33,28,40],"2.0__mediane": [33,70,20],"2.0__root_mean_square":[33,4,30],"3.0__sum_values": [33,28,40],"3.0__mediane": [33,70,20],"3.0__root_mean_square":[33,4,30]}
before_spliting = pd.DataFrame(before_spliting)
print(before_spliting)
ID 2.0__sum_values 2.0__mediane 2.0__root_mean_square 3.0__sum_values \
0 1 33 33 33 33
1 2 28 70 4 28
2 3 40 20 30 40
3.0__mediane 3.0__root_mean_square
0 33 33
1 70 4
2 20 30
after_spliting = { "ID": [1,1,2, 2,3,3], "trial": [2, 3,2,3,2,3],"sum_values": [33,33,28,28,40,40],"mediane": [33,33,70,70,20,20],"root_mean_square":[33,33,4,4,30,30]}
after_spliting = pd.DataFrame(after_spliting)
print(after_spliting)
ID trial sum_values mediane root_mean_square
0 1 2 33 33 33
1 1 3 33 33 33
2 2 2 28 70 4
3 2 3 28 70 4
4 3 2 40 20 30
5 3 3 40 20 30

You could try:
res = df.melt(id_vars="ID")
res[["trial", "columns"]] = res["variable"].str.split("__", expand=True)
res = (
res
.pivot_table(
index=["ID", "trial"], columns="columns", values="value", aggfunc=list
)
.explode(sorted(set(res["columns"])))
.reset_index()
)
Result for the following input dataframe
data = {
"ID": [1, 2, 3],
"2.0__sum_values": [33, 28, 40], "2.0__mediane": [43, 80, 30], "2.0__root_mean_square":[37, 4, 39],
"3.0__sum_values": [34, 29, 41], "3.0__mediane": [44, 81, 31], "3.0__root_mean_square":[38, 5, 40]
}
df = pd.DataFrame(data)
is
columns ID trial mediane root_mean_square sum_values
0 1 2.0 43 37 33
1 1 3.0 44 38 34
2 2 2.0 80 4 28
3 2 3.0 81 5 29
4 3 2.0 30 39 40
5 3 3.0 31 40 41
Alternative solution with the same output:
res = df.melt(id_vars="ID")
res[["trial", "columns"]] = res["variable"].str.split("__", expand=True)
res = res.set_index(["ID", "trial"]).drop(columns="variable").sort_index()
res = pd.concat(
(group[["value"]].rename(columns={"value": key})
for key, group in res.groupby("columns")),
axis=1
).reset_index()

Because you are using curly brackets {}, it is not possible to have duplicate variables inside the curly brackets {}, so instead you can use brackets [], for creating new column.
trial = []
for i in range(len(d1)):
trial.append([d1['2.0__sum_values'][i], d1['3.0__sum_values'][i]])
d1['trial'] = trial
Best of Luck

Related

Select columns and create new dataframe

I have a dataframe with more than 5000 columns but here is an example what it looks like:
data = {'AST_0-1': [1, 2, 3],
'AST_0-45': [4, 5, 6],
'AST_0-135': [7, 8, 20],
'AST_10-1': [10, 20, 32],
'AST_10-45': [47, 56, 67],
'AST_10-135': [48, 57, 64],
'AST_110-1': [100, 85, 93],
'AST_110-45': [100, 25, 37],
'AST_110-135': [44, 55, 67]}
I want to create multiple new dataframes based on the numbers after the "-" in the columns names. For example, a dataframe with all the columns that endes with "1" [df1=(AST_0-1;AST_10-1;AST_100-1)], another that ends with "45" and another ends with "135". To do that I know I will need a loop but I am actually having trouble to select the columns to then create the dataframes.
You can use str.extract on the v column names to get the wanted I'd, then groupby on axis=1.
Here creating a dictionary of dataframes.
group = df.columns.str.extract(r'(\d+)$', expand=False)
out = dict(list(df.groupby(group, axis=1)))
Output:
{'1': AST_0-1 AST_10-1 AST_110-1
0 1 10 100
1 2 20 85
2 3 32 93,
'135': AST_0-135 AST_10-135 AST_110-135
0 7 48 44
1 8 57 55
2 20 64 67,
'45': AST_0-45 AST_10-45 AST_110-45
0 4 47 100
1 5 56 25
2 6 67 37}
Accessing ID 135:
out['135']
AST_0-135 AST_10-135 AST_110-135
0 7 48 44
1 8 57 55
2 20 64 67
Use:
df = pd.DataFrame(data)
dfs = dict(list(df.groupby(df.columns.str.rsplit('-', n=1).str[1], axis=1)))
Output:
>>> dfs
{'1': AST_0-1 AST_10-1 AST_110-1
0 1 10 100
1 2 20 85
2 3 32 93,
'135': AST_0-135 AST_10-135 AST_110-135
0 7 48 44
1 8 57 55
2 20 64 67,
'45': AST_0-45 AST_10-45 AST_110-45
0 4 47 100
1 5 56 25
2 6 67 37}
I know it's strongly discouraged but maybe you want to create dataframes like df1, df135, df45. In this case, you can use:
for name, df in dfs.items():
locals()[f'df{name}'] = df
>>> df1
AST_0-1 AST_10-1 AST_110-1
0 1 10 100
1 2 20 85
2 3 32 93
>>> df135
AST_0-135 AST_10-135 AST_110-135
0 7 48 44
1 8 57 55
2 20 64 67
>>> df45
AST_0-45 AST_10-45 AST_110-45
0 4 47 100
1 5 56 25
2 6 67 37
data = {'AST_0-1': [1, 2, 3],
'AST_0-45': [4, 5, 6],
'AST_0-135': [7, 8, 20],
'AST_10-1': [10, 20, 32],
'AST_10-45': [47, 56, 67],
'AST_10-135': [48, 57, 64],
'AST_110-1': [100, 85, 93],
'AST_110-45': [100, 25, 37],
'AST_110-135': [44, 55, 67]}
import pandas as pd
df = pd.DataFrame(data)
value_list = ["1", "45", "135"]
for value in value_list:
interest_columns = [col for col in df.columns if col.split("-")[1] == value]
df_filtered = df[interest_columns]
print(df_filtered)
Output:
AST_0-1 AST_10-1 AST_110-1
0 1 10 100
1 2 20 85
2 3 32 93
AST_0-45 AST_10-45 AST_110-45
0 4 47 100
1 5 56 25
2 6 67 37
AST_0-135 AST_10-135 AST_110-135
0 7 48 44
1 8 57 55
2 20 64 67
I assume your problem is with the keys of the dictionary. you can get list of the keys with data.keys() then iterate it
for example
df1 = pd.DataFrame()
df45 = pd.DataFrame()
df135 = pd.DataFrame()
for i in list(data.keys()):
the_key = i.split('-')
if the_key[1] == '1':
df1[i] = data[i]
elif the_key[1] == '45':
df45[i] = data[i]
elif the_key[1] == '135':
df135[i] = data[i]

pythonic way to count multiple columns conditionaly check

I'm trying to make a ordinary loop under specific conditions.
I want to interact over rows, checking conditions, and then interact over columns counting how many times the condition was meet.
This counting should generate a new column e my dataframe indicating the total count for each row.
I tried to use apply and mapapply with no success.
I successfully generated the following code to reach my goal.
But I bet there is more efficient ways, or even, built-in pandas functions to do it.
Anyone know how?
sample code:
import pandas as pd
df = pd.DataFrame({'1column': [11, 22, 33, 44],
'2column': [32, 42, 15, 35],
'3column': [33, 77, 26, 64],
'4column': [99, 11, 110, 22],
'5column': [20, 64, 55, 33],
'6column': [10, 77, 77, 10]})
check_columns = ['3column','5column', '6column' ]
df1 = df.copy()
df1['bignum_count'] = 0
for column in check_columns:
inner_loop_count = []
bigseries = df[column]>=50
for big in bigseries:
if big:
inner_loop_count.append(1)
else:
inner_loop_count.append(0)
df1['bignum_count'] += inner_loop_count
# View the dataframe
df1
results:
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1
Index on the columns of interest and check which are greater or equal (ge) than a threshold:
df['bignum_count'] = df[check_columns].ge(50).sum(1)
print(df)
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1
check_columns
df1 = df.copy()
Use DataFrame.ge for >= with counts Trues values by sum:
df['bignum_count'] = df[check_columns].ge(50).sum(axis=1)
#alternative
#df['bignum_count'] = (df[check_columns]>=50).sum(axis=1)
print(df)
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1

Pandas first 5 and last 5 rows in single iloc operation

I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]
Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.
Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99

Creating a dictionary column using combination of two columns in dataframe, and then computing ratio of values of two columns with common keys

I have a dataframe of the following format:
Id Name_prev Weight_prev Name_now Weight_now
1 [1,3,4,5] [10,34,67,37] [1,3,5] [45,76,12]
2 [10,3,40,5] [100,134,627,347] [10,40,5] [34,56,78]
3 [1,30,4,50] [11,22,45,67] [1,30,50] [12,45,78]
4 [1,7,8,9] [32,54,76,98] [7,8,9] [34,12,32]
I want to create two new variables:
Union of Name_prev and Name_now : This is intersection of Name_prev and Name_now fields and can be done using set operation on the two columns, I am able to compute the same.
Ratio of Name_prev and Name_now : This is ratio of values (Weight_prev and Weight_now) corresponding to common names in (Name_prev and Name_now).
Expected Output:
Id Union of Name_prev and Name_now Ratio of Name_prev and Name_now
1 [1,3,5] [10/45, 34/76,37/12]
2 [10,40,5] [100/34,627/56,347/78]
3 [1,30,50] [11/12,22/45,67/78]
4 [7,8,9] [54/34,76/12,98/32]
I am trying to create a dictionary like structure by combining Name_prev and Weigth_prev as key, value pair and doing the same for Name_now and Weight_now and then taking ratio for common keys, but am stuck...
Use:
a, b = [],[]
for n1, n2, w1, w2 in zip(df['Name_prev'], df['Name_now'],
df['Weight_prev'], df['Weight_now']):
#get intersection of lists
n = [val for val in n1 if val in n2]
#get indices by enumerate and select weights
w3 = [w1[i] for i, val in enumerate(n1) if val in n2]
w4 = [w2[i] for i, val in enumerate(n2) if val in n1]
#divide each value in list
w = [i/j for i, j in zip(w3, w4)]
a.append(n)
b.append(w)
df = df.assign(name=a, weight=b)
print (df)
Id Name_prev Weight_prev Name_now Weight_now \
0 1 [1, 3, 4, 5] [10, 34, 67, 37] [1, 3, 5] [45, 76, 12]
1 2 [10, 3, 40, 5] [100, 134, 627, 347] [10, 40, 5] [34, 56, 78]
2 3 [1, 30, 4, 50] [11, 22, 45, 67] [1, 30, 50] [12, 45, 78]
3 4 [1, 7, 8, 9] [32, 54, 76, 98] [7, 8, 9] [34, 12, 32]
name weight
0 [1, 3, 5] [0.2222222222222222, 0.4473684210526316, 3.083...
1 [10, 40, 5] [2.9411764705882355, 11.196428571428571, 4.448...
2 [1, 30, 50] [0.9166666666666666, 0.4888888888888889, 0.858...
3 [7, 8, 9] [1.588235294117647, 6.333333333333333, 3.0625]
If need remove original columns use DataFrame.pop:
a, b = [],[]
for n1, n2, w1, w2 in zip(df.pop('Name_prev'), df.pop('Name_now'),
df.pop('Weight_prev'), df.pop('Weight_now')):
n = [val for val in n1 if val in n2]
w3 = [w1[i] for i, val in enumerate(n1) if val in n2]
w4 = [w2[i] for i, val in enumerate(n2) if val in n1]
w = [i/j for i, j in zip(w3, w4)]
a.append(n)
b.append(w)
df = df.assign(name=a, weight=b)
print (df)
Id name weight
0 1 [1, 3, 5] [0.2222222222222222, 0.4473684210526316, 3.083...
1 2 [10, 40, 5] [2.9411764705882355, 11.196428571428571, 4.448...
2 3 [1, 30, 50] [0.9166666666666666, 0.4888888888888889, 0.858...
3 4 [7, 8, 9] [1.588235294117647, 6.333333333333333, 3.0625]
EDIT:
Working with lists in pandas is always not vectorized, so better is flatten lists first, merge and if necessary aggregate lists:
from itertools import chain
df_prev = pd.DataFrame({
'Name' : list(chain.from_iterable(df['Name_prev'].values.tolist())),
'Weight_prev' : list(chain.from_iterable(df['Weight_prev'].values.tolist())),
'Id' : df['Id'].values.repeat(df['Name_prev'].str.len())
})
print (df_prev)
Name Weight_prev Id
0 1 10 1
1 3 34 1
2 4 67 1
3 5 37 1
4 10 100 2
5 3 134 2
6 40 627 2
7 5 347 2
8 1 11 3
9 30 22 3
10 4 45 3
11 50 67 3
12 1 32 4
13 7 54 4
14 8 76 4
15 9 98 4
df_now = pd.DataFrame({
'Name' : list(chain.from_iterable(df['Name_now'].values.tolist())),
'Weight_now' : list(chain.from_iterable(df['Weight_now'].values.tolist())),
'Id' : df['Id'].values.repeat(df['Name_now'].str.len())
})
print (df_now)
Name Weight_now Id
0 1 45 1
1 3 76 1
2 5 12 1
3 10 34 2
4 40 56 2
5 5 78 2
6 1 12 3
7 30 45 3
8 50 78 3
9 7 34 4
10 8 12 4
11 9 32 4
df = df_prev.merge(df_now, on=['Id','Name'])
df['Weight'] = df['Weight_prev'] / df['Weight_now']
print (df)
Name Weight_prev Id Weight_now Weight
0 1 10 1 45 0.222222
1 3 34 1 76 0.447368
2 5 37 1 12 3.083333
3 10 100 2 34 2.941176
4 40 627 2 56 11.196429
5 5 347 2 78 4.448718
6 1 11 3 12 0.916667
7 30 22 3 45 0.488889
8 50 67 3 78 0.858974
9 7 54 4 34 1.588235
10 8 76 4 12 6.333333
11 9 98 4 32 3.062500
df = df.groupby('Id')['Name','Weight'].agg(list).reset_index()
print (df)
Id Name Weight
0 1 [1, 3, 5] [0.2222222222222222, 0.4473684210526316, 3.083...
1 2 [10, 40, 5] [2.9411764705882355, 11.196428571428571, 4.448...
2 3 [1, 30, 50] [0.9166666666666666, 0.4888888888888889, 0.858...
3 4 [7, 8, 9] [1.588235294117647, 6.333333333333333, 3.0625]

new python pandas dataframe column based on value of variable, using function

I have a variable, 'ImageName' which ranges from 0-1600. I want to create a new variable, 'LocationCode', based on the value of 'ImageName'.
If 'ImageName' is less than 70, I want 'LocationCode' to be 1. if 'ImageName' is between 71 and 90, I want 'LocationCode' to be 2. I have 13 different codes in all. I'm not sure how to write this in python pandas. Here's what I tried:
def spatLoc(ImageName):
if ImageName <=70:
LocationCode = 1
elif ImageName >70 and ImageName <=90:
LocationCode = 2
return LocationCode
df['test'] = df.apply(spatLoc(df['ImageName'])
but it returned an error. I'm clearly not defining things the right way but I can't figure out how to.
You can just use 2 boolean masks:
df.loc[df['ImageName'] <= 70, 'Test'] = 1
df.loc[(df['ImageName'] > 70) & (df['ImageName'] <= 90), 'Test'] = 2
By using the masks you only set the value where the boolean condition is met, for the second mask you need to use the & operator to and the conditions and enclose the conditions in parentheses due to operator precedence
Actually I think it would be better to define your bin values and call cut, example:
In [20]:
df = pd.DataFrame({'ImageName': np.random.randint(0, 100, 20)})
df
Out[20]:
ImageName
0 48
1 78
2 5
3 4
4 9
5 81
6 49
7 11
8 57
9 17
10 92
11 30
12 74
13 62
14 83
15 21
16 97
17 11
18 34
19 78
In [22]:
df['group'] = pd.cut(df['ImageName'], range(0, 105, 10), right=False)
df
Out[22]:
ImageName group
0 48 [40, 50)
1 78 [70, 80)
2 5 [0, 10)
3 4 [0, 10)
4 9 [0, 10)
5 81 [80, 90)
6 49 [40, 50)
7 11 [10, 20)
8 57 [50, 60)
9 17 [10, 20)
10 92 [90, 100)
11 30 [30, 40)
12 74 [70, 80)
13 62 [60, 70)
14 83 [80, 90)
15 21 [20, 30)
16 97 [90, 100)
17 11 [10, 20)
18 34 [30, 40)
19 78 [70, 80)
Here the bin values were generated using range but you could pass your list of bin values yourself, once you have the bin values you can define a lookup dict:
In [32]:
d = dict(zip(df['group'].unique(), range(len(df['group'].unique()))))
d
Out[32]:
{'[0, 10)': 2,
'[10, 20)': 4,
'[20, 30)': 9,
'[30, 40)': 7,
'[40, 50)': 0,
'[50, 60)': 5,
'[60, 70)': 8,
'[70, 80)': 1,
'[80, 90)': 3,
'[90, 100)': 6}
You can now call map and add your new column:
In [33]:
df['test'] = df['group'].map(d)
df
Out[33]:
ImageName group test
0 48 [40, 50) 0
1 78 [70, 80) 1
2 5 [0, 10) 2
3 4 [0, 10) 2
4 9 [0, 10) 2
5 81 [80, 90) 3
6 49 [40, 50) 0
7 11 [10, 20) 4
8 57 [50, 60) 5
9 17 [10, 20) 4
10 92 [90, 100) 6
11 30 [30, 40) 7
12 74 [70, 80) 1
13 62 [60, 70) 8
14 83 [80, 90) 3
15 21 [20, 30) 9
16 97 [90, 100) 6
17 11 [10, 20) 4
18 34 [30, 40) 7
19 78 [70, 80) 1
The above can be modified to suit your needs but it's just to demonstrate an approach which should be fast and without the need to iterate over your df.
In Python, you use the dictionary lookup notation to find a field within a row. The field name is ImageName. In the spatLoc() function below, the parameter row is a dictionary containing the entire row, and you would find an individual column by using the field name as key to the dictionary.
def spatLoc(row):
if row['ImageName'] <=70:
LocationCode = 1
elif row['ImageName'] >70 and row['ImageName'] <=90:
LocationCode = 2
return LocationCode
df['test'] = df.apply(spatLoc, axis=1)

Categories

Resources