count of one specific item of panda dataframe - python

i have used following line to get the count of number of
"Read" s from the specific column (containing READ,WRITE,NOP)of a file . which is not csv file but a .out file with \t as delimiter.
data = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t')
df2=df1.iloc[start:end,]
count=df2.str.count("R").sum()
I am getting error
AttributeError:
'DataFrame' object has no attribute 'str'
But when i use
if filename.endswith(".csv"):
data = pd.read_csv(filename)
df1=data.loc[:,"operation"]
df2=df1.iloc[start:end,]
count=df2.str.count("R").sum()
There is no error. But here i have to enter in each csv file.I have to open the file and insert "operation" in the column I need. KIndly give a soultion

I believe need select column 1 for Series, else get one column DataFrame:
count=df2[1].str.count("R").sum()
Or compare by eq and sum of Trues:
count=df2[1].eq("R").sum()
EDIT:
Another solution is return Series in read_csv by parameter squeeze:
s = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t', squeeze=True)
count=s.iloc[start:end].str.count("R").sum()
#for another solution
#count=s.iloc[start:end].eq("R").sum()
Sample:
df2 = pd.DataFrame({1:['R','RR','Q']})
print (df2)
1
0 R
1 RR
2 Q
#count all substrings
count=df2[1].str.count("R").sum()
print (count)
3
#count only strings
count=df2[1].eq("R").sum()
print (count)
1

Just add 0 to df2 assignment:
data = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t')
df2=df1.iloc[start:end, 0]
count=df2.str.count("R").sum()
And I think it should be:
df2 = data.iloc[start:end, 0]
But maybe you have some other steps that create df1.

Related

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Python - keep rows in dataframe based on partial string match

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.
So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('#', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc#gmail.com'],
['mailbox2', 'def#yahoo.com'],
['mailbox3', 'ghi#msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"#{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
Solution
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc#gmail.com', 'def#yahoo.com', 'ghi#msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc#gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
Output:
> {'Email_ID': ('abc#gmail.com', 'ghi#msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Some notes:
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have a a list of rows already, you just have to do the filtering part

Eliminate duplicate for a column value in a Dataframe - the column holds multiple URL's

so i have a column called "URL's" in my DataFrame Pd1
URL
row 1 : url1,url1,url2
row 2 : url2,url2,url3
output :
URL
row 1 : url1,url2
row 2 : url2,url3
I assume that your column contains only URL list.
One of possible solutions is to:
apply a function to URL column, containing the following steps:
split the source string on each comma (tre result is a list of
fragments),
create a set from this list (thus eleminating repetitions),
join keys from this set, using a comma,
save the result back into the source column.
Something like:
df.URL = df.URL.apply(lambda x: ','.join(set(re.split(',', x))))
As this code uses re module, you have to import re before.
split and apply set
d = {"url": ["url1,url1,url2",
"url2,url2,url3"]}
df = pd.DataFrame(d)
df.url.str.split(",").apply(set)
df['URL'] = df.URL.str.split(':').apply(lambda x: [x[0],','.join(sorted(set(x[1].split(','))))]).apply(' : '.join)
URL
0 row 1 : url1,url2
1 row 2 : url2,url3
if data
URL
0 url1,url1,url2
1 url2,url2,url3
then
df['URL'] = df.URL.str.split(',').apply(lambda x: ','.join(sorted(set(x))))
##print(df)
URL
0 url1,url2
1 url2,url3

Text To Column Function

I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i

Create unique ID from the existing two columns, python

My question is: how to efficiently sign data unique id numbers from existing id columns? For example: I have two columns [household_id], and [person_no]. I try to make a new column, the query would be: household_id + '_' + person_no.
here is a sample:
hh_id pno
682138 1
365348 1
365348 2
try to get:
unique_id
682138_1
365348_1
365348_2
and add this unique_id as a new column.
I am applying Python. My data is very large. Any efficient way to do it would be great. Thanks!
You can use pandas.
Assuming your data is in a csv file, read in the data:
import pandas as pd
df = pd.read_csv('data.csv', delim_whitespace=True)
Create the new id column:
df['unique_id'] = df.hh_id.astype(str) + '_' + df.pno.astype(str)
Now df looks like this:
hh_id pno unique_id
0 682138 1 682138_1
1 365348 1 365348_1
2 365348 2 365348_2
Write back to a csv file:
df.to_csv('out.csv', index=False)
The file content looks like this:
hh_id,pno,unique_id
682138,1,682138_1
365348,1,365348_1
365348,2,365348_2

Categories

Resources