Say I have a dataframe like this:
I know I can drop levels so that I have columns: ['Season', 'Players', 'Teams']. Is there a function in pandas where I can collapse 'First team' name into a column so that the entire column says 'First team'?
IIUC, you can do a few different things, here are two ways:
Where dummy dataframe,
df = pd.DataFrame(np.random.randint(0,100,(5,3)),
columns = pd.MultiIndex.from_tuples([('Season', 'Season'),
('First team', 'Players'),
('First team', 'Teams')]))
Input dummy dataframe:
Season First team
Season Players Teams
0 28 41 53
1 62 87 87
2 43 94 4
3 23 12 93
4 14 43 62
Then use droplevel:
df = df.droplevel(0, axis=1)
Output:
Season Players Teams
0 54 94 19
1 54 47 91
2 56 35 40
3 37 68 14
4 17 78 68
Or flatten multiindex column header using list comprehension:
df.columns = [f'{i}_{j}' for i, j in df.columns]
#Also, can use df.columns = df.columns.map('_'.join)
df
Output:
Season_Season First team_Players First team_Teams
0 54 94 19
1 54 47 91
2 56 35 40
3 37 68 14
4 17 78 68
Related
In my dataframe i have a links with utm parameters:
utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}
Also in dataframe i have sevral columns with id - CampaignId, AdGroupId, Keyword, Keyword ID
And I need to replace the values in curly brackets in the link with the values from these columns
For exmaple i need to replace {campaign_id} with values from CampaignId colums. And do this for each value in the link
The result should be like this -
utm_content=keys_3745473327|cid|31757442|aid|CRM|38372916231|src&utm_term=CRM
You can try this:
import pandas as pd
import numpy as np
# create some sample data
df = pd.DataFrame(columns=['CampaignId', 'AdGroupId', 'Keyword', 'Keyword ID'],
data=np.random.randint(low=0, high=100, size=(10, 4)))
df['url'] = 'utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}'
df
Output:
CampaignId AdGroupId Keyword Keyword ID url
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
And then write a custom function to replace your variables with f string and apply it to the dataframe creating a new column (You can also replace with the url column if you want):
def fill_link(CampaignId, AdGroupId, Keyword, KeywordID, url):
campaign_id = CampaignId
keyword = Keyword
gbid = AdGroupId
phrase_id = KeywordID
return eval("f'" + f"{url}" + "'")
df['url_filled'] = df.apply(lambda row: fill_link(row['CampaignId'], row['AdGroupId'], row['Keyword'], row['Keyword ID'], row['url']), axis=1)
df
CampaignId AdGroupId Keyword Keyword ID url url_filled
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|21|aid|26|41|src&utm_t...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_9|cid|28|aid|19|3|src&utm_ter...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|11|aid|37|43|src&utm_t...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|25|aid|17|54|src&utm_t...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_19|cid|32|aid|17|48|src&utm_t...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_92|cid|26|aid|80|90|src&utm_t...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|25|aid|1|54|src&utm_te...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_7|cid|81|aid|68|85|src&utm_te...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_55|cid|75|aid|37|56|src&utm_t...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_53|cid|14|aid|34|84|src&utm_t...
I am not sure if the name of your variables are correctly assigned as they are not exactly the same. But it shouldn't be a problem for you to replace them as you wish.
I have a dataframe as a result of various pivot operations with float numbers (this example using integers for simplicity)
import numpy as np
import pandas as pd
np.random.seed(365)
rows = 10
cols= {'col_a': [np.random.randint(100) for _ in range(rows)],
'col_b': [np.random.randint(100) for _ in range(rows)],
'col_c': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(cols)
data
col_a col_b col_c
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
I want to detect the two minimum values in a row and get the diff in a new column.
For example, in first row, the 2 minimum values are 36 and 43, so the difference will be 7
I've tried this way:
data['difference']=data[data.apply(lambda x: x.nsmallest(2).astype(float), axis=1).isna()].subtract(axis=1)
but i get:
TypeError: f() missing 1 required positional argument: 'other'
Better use numpy:
a = np.sort(data)
data['difference'] = a[:,1]-a[:,0]
output:
col_a col_b col_c difference
0 82 36 43 7
1 52 48 12 36
2 33 28 77 5
3 91 99 11 80
4 44 95 27 17
5 5 94 64 59
6 98 3 88 85
7 73 39 92 34
8 26 39 62 13
9 56 74 50 6
Follow your idea with nsmallest on rows
data['difference'] = data.apply(lambda x: x.nsmallest(2).tolist(), axis=1, result_type='expand').diff(axis=1)[1]
# or
data['difference'] = data.apply(lambda x: x.nsmallest(2).diff().iloc[-1], axis=1)
print(data)
col_a col_b col_c difference
0 82 36 43 7
1 52 48 12 36
2 33 28 77 5
3 91 99 11 80
4 44 95 27 17
5 5 94 64 59
6 98 3 88 85
7 73 39 92 34
8 26 39 62 13
9 56 74 50 6
Here is a way using rank()
(df.where(
df.rank(axis=1,method = 'first')
.le(2))
.stack()
.sort_values()
.groupby(level=0)
.agg(lambda x: x.diff().sum()))
If your df was larger and you wanted to potentially use more than the 2 smallest, this should work
(df.where(
df.rank(axis=1,method = 'first')
.le(2))
.stack()
.sort_values(ascending=False)
.groupby(level=0)
.agg(lambda x: x.mul(-1).cumsum().add(x.max()*2).iloc[-1]))
I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass
I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))
What is the nicest way to see which rows are duplicated in DataFrame with the duplicate rows sorted and stacked on top of each other? I know I can filter for duplicates with df.duplicated() or something like df[df.duplicated()==True] but need to be able to produce a dataframe with the duplicates and then sort them to show both records in the Dataframe. I also do not need to use a col subset argument for this. -Thank you
One idea is to sort by all columns. Not sure how efficient that is though.
In [20]: df = pd.DataFrame (np.random.randint (100,size=(3,3)), columns= list('ABC'))
In [21]: df = df.append(df, ignore_index=True)
In [22]: df
Out[22]:
A B C
0 23 71 65
1 63 0 47
2 47 13 44
3 23 71 65
4 63 0 47
5 47 13 44
In [23]: df.sort(df.columns.tolist())
Out[23]:
A B C
0 23 71 65
3 23 71 65
2 47 13 44
5 47 13 44
1 63 0 47
4 63 0 47