Replace text in url in pandas dataframe

Replace text in url in pandas dataframe - python

In my dataframe i have a links with utm parameters:
utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}
Also in dataframe i have sevral columns with id - CampaignId, AdGroupId, Keyword, Keyword ID
And I need to replace the values in curly brackets in the link with the values from these columns
For exmaple i need to replace {campaign_id} with values from CampaignId colums. And do this for each value in the link
The result should be like this -
utm_content=keys_3745473327|cid|31757442|aid|CRM|38372916231|src&utm_term=CRM

You can try this:
import pandas as pd
import numpy as np
# create some sample data
df = pd.DataFrame(columns=['CampaignId', 'AdGroupId', 'Keyword', 'Keyword ID'],
data=np.random.randint(low=0, high=100, size=(10, 4)))
df['url'] = 'utm_content=keys_{gbid}|cid|{campaign_id}|aid|{keyword}|{phrase_id}|src&utm_term={keyword}'
df
Output:
CampaignId AdGroupId Keyword Keyword ID url
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|...
And then write a custom function to replace your variables with f string and apply it to the dataframe creating a new column (You can also replace with the url column if you want):
def fill_link(CampaignId, AdGroupId, Keyword, KeywordID, url):
campaign_id = CampaignId
keyword = Keyword
gbid = AdGroupId
phrase_id = KeywordID
return eval("f'" + f"{url}" + "'")
df['url_filled'] = df.apply(lambda row: fill_link(row['CampaignId'], row['AdGroupId'], row['Keyword'], row['Keyword ID'], row['url']), axis=1)
df
CampaignId AdGroupId Keyword Keyword ID url url_filled
0 21 13 26 41 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|21|aid|26|41|src&utm_t...
1 28 9 19 3 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_9|cid|28|aid|19|3|src&utm_ter...
2 11 17 37 43 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|11|aid|37|43|src&utm_t...
3 25 13 17 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_13|cid|25|aid|17|54|src&utm_t...
4 32 19 17 48 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_19|cid|32|aid|17|48|src&utm_t...
5 26 92 80 90 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_92|cid|26|aid|80|90|src&utm_t...
6 25 17 1 54 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_17|cid|25|aid|1|54|src&utm_te...
7 81 7 68 85 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_7|cid|81|aid|68|85|src&utm_te...
8 75 55 37 56 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_55|cid|75|aid|37|56|src&utm_t...
9 14 53 34 84 utm_content=keys_{gbid}|cid|{campaign_id}|aid|... utm_content=keys_53|cid|14|aid|34|84|src&utm_t...
I am not sure if the name of your variables are correctly assigned as they are not exactly the same. But it shouldn't be a problem for you to replace them as you wish.

Related

Issue with sorting in pandas column in ascending order

I have the following code.
I am trying to sort the values of the first column of the 'happydflist' dataframe in ascending order.
However, the output this gives me includes some values such as '2','3' and '8' that do not fit in with the ascending order theme.
happydflist = happydflist[happydflist.columns[0]]
happydflistnew = happydflist.sort_values(ascending=True)
print(happydflistnew)
12 13
10 19
13 2
11 24
15 3
6 33
24 35
8 36
5 37
25 49
17 49
20 50
26 51
22 52
16 52
18 52
19 52
28 53
27 54
23 54
21 59
9 74
7 75
14 8
Name: 0_happy, dtype: object
I would be so grateful for a helping hand!
'happydflist' looks like this:
5 37
6 33
7 75
8 36
9 74
10 19
11 24
12 13
13 2
14 8
15 3
16 52
17 49
18 52
19 52
20 50
21 59
22 52
23 54
24 35
25 49
26 51
27 54
28 53
Name: 0_happy, dtype: object

Maybe your dataframe's dtype of some is str so make that to int instead.
happydflist.astype('int').sort_values()
if you need str dtype use astype 1more so:
happydflist.astype('int').sort_values().astype('str')

I managed to resolve the issue by using the df.strip() function to remove 'white space' around text in a dataframe, combined with the .dropna() function.
happydflistnew = happydflist[happydflist.columns[0]].str.strip()
happydflistnew = happydflistnew.dropna()
happydflistsorted = happydflistnew.astype('int').sort_values(ascending=True)
maxvalue = len(happydflistsorted)
minhappiness = happydflistsorted.iloc[0]
maxhappiness = happydflistsorted.iloc[maxvalue-1]

Calculate mean from multiple columns

I have 12 columns filled with wages. I want to calculate the mean but my output is 12 different means from each column, but I want one mean which is calculated with the whole dataset as one.
This is how my df looks:
Month 1 Month 2 Month 3 Month 4 ... Month 9 Month 10 Month 11 Month 12
0 1429.97 2816.61 2123.29 2123.29 ... 2816.61 2816.61 1429.97 1776.63
1 3499.53 3326.20 3499.53 2112.89 ... 1939.56 2806.21 2632.88 2459.55
2 2599.95 3119.94 3813.26 3466.60 ... 3466.60 3466.60 2946.61 2946.61
3 2599.95 2946.61 3466.60 2773.28 ... 2253.29 3119.94 1906.63 2773.28
I used this code to calculate the mean:
mean = df.mean()
Do i have to convert these 12 columns into one column or how can i calculate one mean?

Just call the mean again to get the mean of those 12 values:
df.mean().mean()

Use numpy.mean with convert values to 2d array:
mean = np.mean(df.to_numpy())
print (mean)
2914.254166666667
Or use DataFrame.melt:
mean = df.melt()['value'].mean()
print (mean)
2914.254166666666

You can also use stack:
df.stack().mean()
Suppose this dataframe:
>>> df
A B C D E F G H
0 60 1 59 25 8 27 34 43
1 81 48 32 30 60 3 90 22
2 66 15 21 5 23 36 83 46
3 56 42 14 86 41 64 89 56
4 28 53 89 89 52 13 12 39
5 64 7 2 16 91 46 74 35
6 81 81 27 67 26 80 19 35
7 56 8 17 39 63 6 34 26
8 56 25 26 39 37 14 41 27
9 41 56 68 38 57 23 36 8
>>> df.stack().mean()
41.6625

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.

pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])

Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64

One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483

You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Export multiple html tables to Excel with Pandas as the Middleman

I'm collecting data over the years 1981-2018 from a website, whereby the this link shows the 2018 data:
If one changes 2018 to a year from 1981-2018 in the aforementioned link one obtains the remaining dataset.
Using Pandas and urllib.request I collect the data as follows:
url = ['ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/' + \
'{}'.format(i) + '/Population.Heating.txt' for i in range(1981,2019)]
data_url = [pd.read_csv(url[i], sep=" ", header=None) for i in range(len(url))]
Questions
First, is there a cleaner and more efficient way of collecting the data from the links than the above list comprehension? Second, how would I export the entire list comprehension to an Excel spreadsheet?
I had tried the following method for exporting, however; the code only exported the year 2018:
from pandas import ExcelWriter
writer = ExcelWriter('PythonExport.xlsx')
for i in range(len(data_url)):
data_url[i].to_excel(writer,'Sheet1')
writer.save()
To address the question of why I didn't directly import the data to Excel: Ultimately, I would like to have the data for each year in a DataFrame, namely one column contains the Region' data and the other column contains the 'Conus' data. In trying to construct this DataFrame it seemed to be easier to munge the data in Excel than working with the list comprehension data_url from above, then use the data to build the DataFrame.

Here is a way to parse that data into a single dataframe:
Code:
url = [
'ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/'
'{}'.format(i) + '/Population.Heating.txt' for i in range(1981, 2018)]
data_url = [pd.read_csv(url[i], sep="|", skiprows=3, index_col=0).T
for i in range(len(url))]
df = pd.concat(data_url)
print(df.head())
print(df.tail())
Results:
Region 1 2 3 4 5 6 7 8 9 CONUS
19810101 51 45 36 33 24 24 14 22 14 28
19810102 46 42 43 40 23 29 17 22 16 29
19810103 55 50 51 46 26 28 17 23 14 33
19810104 66 59 62 55 27 30 18 23 15 37
19810105 62 56 59 47 34 42 22 24 14 38
Region 1 2 3 4 5 6 7 8 9 CONUS
20171227 53 49 62 64 22 35 28 29 15 37
20171228 59 54 60 57 27 37 28 26 13 38
20171229 59 53 54 54 26 33 23 24 11 35
20171230 57 50 54 62 24 32 19 27 12 34
20171231 59 55 60 68 29 39 27 30 15 40

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace text in url in pandas dataframe - python

Related

Issue with sorting in pandas column in ascending order

Calculate mean from multiple columns

Place data from a Pandas DF into a Grid or Template

Sum row values of all columns where column names meet string match condition

Export multiple html tables to Excel with Pandas as the Middleman

Categories

Resources