I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i
Related
This question have been asked multiple times in this community but I couldn't find the correct answers since I am beginner in Python. I got 2 questions actually:
I want to concatenate 3 columns (A,B,C) with its value into 1 Column. Header would be ABC.
import os
import pandas as pd
directory = 'C:/Path'
ext = ('.csv')
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
if f.endswith(ext):
head_tail = os.path.split(f)
head_tail1 = 'C:/Output'
k =head_tail[1]
r=k.split(".")[0]
p=head_tail1 + "/" + r + " - Revised.csv"
mydata = pd.read_csv(f)
new =mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
new.to_csv(p ,index=False)
Once concatenated, is it possible to count the uniqueid and put the total in Column D? Basically, to get the total count per uniqueid (Column ABC),the data can be found on a link when you click that UniqueID. For ex: Column ABC - uniqueid1, -> click -> go to the next page, total of that uniqueid.
On the link page, you can get the total numbers of uniqueid by Serial ID
I have no idea how to do this, but I would really appreciate if someone can help me on this project and would learn a lot from this.
Thank you very much. God Bless
Searched in Google, Youtube and Stackoverflow, couldn't find the correct answer.
I'm not sure that I understand your question correctly. However, if you know exactly the column names (e.g., A, B, and C) that you want to concatenate you can do something like code below.
''.join(merge_columns) is to concatenate column names.
new[merge_columns].apply(lambda x: ''.join(x), axis=1) is to concatenate their values.
Then, you can count unique values of the new column using groupby().count().
new = mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
# added lines
merge_columns = ['A', 'B', 'C']
merged_col = ''.join(merge_columns)
new[merged_col] = new[merge_columns].apply(lambda x: ''.join(x), axis=1)
new.drop(merge_columns, axis=1, inplace=True)
new = new.groupby(merged_col).count().reset_index()
new.to_csv(p ,index=False)
example:
# before
> new
A B C Total
0 a b c 1
1 x y z 1
2 a b c 1
# after execute added lines
> new
ABC Total
0 abc 2
1 xyz 1
Next time, try to specify your issues and give a minimal reproducible example.
This is just an example how to use pd.melt and pd.groupby.
I hope it helps with your question.
import pandas as pd
### example dataframe
df = pd.DataFrame([['first', 1, 2, 3], ['second', 4, 5, 6], ['third', 7, 8, 9]], columns=['ID', 'A', 'B', 'C'])
### directly sum up A, B and C
df['total'] = df.sum(axis=1, numeric_only=True)
print(df)
### how to create a so called long dataframe with melt
df_long = pd.melt(df, id_vars='ID', value_vars=['A', 'B', 'C'], var_name='ABC')
print(df_long)
### group long dataframe by column and sum up all values with this ID
df_group = df_long.groupby(by='ID').sum()
print(df_group)
I have the following format of a csv file:
id a_mean_val_1 a_mean_val_2 a_var_val_1 a_var_val_2 b_mean_val_1 b_mean_val_2 b_var_val_1 b_var_val_2
I would like to melt the columns 1 and 2 for all a and b features into rows as follows:
id a_mean a_var b_mean b_var
1 val1 val1 val1 val1
1 val2 val2 val2 val2
I am unsure how to achieve the melt function in pandas, where I could basically have an expression that matches keeps the base name: a_mean as root column and everything that has a suffix for that variable to melt them into rows.
Is there another method I could use to specify these rules?
Thank you
Like this:
rows = []
for line in open('mycsv.csv'):
fields = line.split(',')
rows.append( fields[0::2] )
rows.append( fields[1::2] )
df = pandas.DataFrame(rows, fields=['a_mean','a_var','b_mean','b_var'])
That doesn't provide an ID number. Is the ID part of the CSV file?
I went through the columns and if they were a part of a base column, then appended to a list. Finally, converted those to a dataframe.
So this code would work regardless of the order of the columns
[UPDATED WITH ID]
Since we're adding the entire columns one after the other, the ids will always start from the top, go to the end, and then repeat. So we can take "id" of the original df and multiply that by the number of rows to get the "id" for the new df.
Here's the CSV I used:
id,a_mean_val_1,a_mean_val_2,a_var_val_1,a_var_val_2,b_mean_val_1,b_mean_val_2,b_var_val_1,b_var_val_2
1,a_mean_val_1, a_mean_val_2, a_var_val_1, a_var_val_2, b_mean_val_1 ,b_mean_val_2, b_var_val_1, b_var_val_2
2,a_mean_val_5, a_mean_val_6, a_var_val_5, a_var_val_6, b_mean_val_5 ,b_mean_val_6, b_var_val_5, b_var_val_6
df = pd.read_csv('data_csv.csv')
# Ignore ID
columns = df.columns.tolist()[1:]
df_dict = {}
base = ['a_mean', 'a_var', 'b_mean', 'b_var']
for bas in base:
df_dict[bas] = []
for col in columns:
# for example, "a_mean" in "a_mean_val_1" then append
if(bas in col):
df_dict[bas] = df_dict[bas] + df[col].tolist()
ids = df['id'].tolist()
df_new = pd.DataFrame(df_dict)
df_new['id'] = ids*df.shape[0]
a_mean a_var b_mean b_var id
a_mean_val_1 a_var_val_1 b_mean_val_1 b_var_val_1 1
a_mean_val_5 a_var_val_5 b_mean_val_5 b_var_val_5 2
a_mean_val_2 a_var_val_2 b_mean_val_2 b_var_val_2 1
a_mean_val_6 a_var_val_6 b_mean_val_6 b_var_val_6 2
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]
I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)
I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5