Deleting rows with certain keywords of csv file - python

I have a large data file and I need to delete rows that have certain keywords.
Here is an example of the file I'm using:
User Name DN
MB31212 CN=MB31212,CN=Users,DC=prod,DC=trovp,DC=net
MB23423 CN=MB23423 ,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB23424 CN=MB23424 ,CN=Users,DC=prod,DC=trovp,DC=net
MB23423 CN=MB23423,OU=DNA,DC=prod,DC=trovp,DC=net
MB23234 CN=MB23234 ,OU=DNA,DC=prod,DC=trovp,DC=net
This is how I import file:
import pandas as pd
df = pd.read_csv('sample.csv', sep=',', encoding='latin1')
How can I
Delete all rows that contain 'OU=DNA' in DN column for example?
How can I delete the first attribute 'CN= x' in the DN column without deleting the rest of the data in the column?
I would like to get something like what is posted below, with the 2 rows that contained 'OU=DNA' deleted and the 'CN=x' deleted from every row:
User Name DN
MB31212 CN=Users,DC=prod,DC=trovp,DC=net
MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB23424 CN=Users,DC=prod,DC=trovp,DC=net

You can try this two-step filtering as your logic. Use the str.contains method to filter out rows with OU=DNA and use str.replace method with regular expression to trim the leading CN=x:
newDf = df.loc[~df.DN.str.contains("OU=DNA")]
newDf.DN = newDf.DN.str.replace("^CN=[^,]*,", "")
newDf
UserName DN
0 MB31212 CN=Users,DC=prod,DC=trovp,DC=net
1 MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
2 MB23424 CN=Users,DC=prod,DC=trovp,DC=net
A little break down of the regular expression: ^ stands for the beginning of the string which is followed by CN= and use [^,]*, to capture pattern until the first comma;

To read the file sample you gave I used:
df = pd.read_csv('sample.csv', sep=' ', encoding='latin1', engine="python")
and then:
df = df.drop(df[df.DN.str.contains("OU=DNA")].index)
df.DN = df.DN.str.replace('(CN=MB[0-9]{5}\s*,)', '')
df
gave the desired result:
User Name DN
0 MB31212 CN=Users,DC=prod,DC=trovp,DC=net
1 MB23423 OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
2 MB23424 CN=Users,DC=prod,DC=trovp,DC=net

Related

How to find those rows which don't exist in another CSV file using python 3.7.5

I have a file ua.csv which has 2 rows and another file pr.csv which has 4 rows. I would like to know what are those rows which are present in pr.csv and ua.csv doesn't. Need to have count of extra rows present in pr.csv in the output.
ua.csv
Name|Address|City|Country|Pincode
Jim Smith|123 Any Street|Boston|US|02134
Jane Lee|248 Another St.|Boston|US|02130
pr.csv
Name|Address|City|Country|Pincode
Jim Smith|123 Any Street|Boston|US|02134
Smoet|coffee shop|finland|Europe|3453335
Jane Lee|248 Another St.|Boston|US|02130
Jack|long street|malasiya|Asia|585858
Below is the expected output:
pr.csv has 2 rows extra
Name|Address|City|Country|Pincode
Smoet|coffee shop|finland|Europe|3453335
Jack|long street|malasiya|Asia|585858
I guess you could use the set datastructure:
ua_set = set()
pr_set = set()
# Code to populate the sets reading the csv files (use the `add` method of sets)
...
# Find the difference
diff = pr_set.difference(ua_set)
print(f"pr.csv has {len(diff)} rows extra")
# It would be better to not hardcode the name of the columns in the output
# but getting the info depends on the package you use to read csv files
print("Name|Address|City|Country|Pincode")
for row in diff:
print(row)
A better solution using the pandas module:
import pandas as pd
df_ua = pd.read_csv("ua.scv") # Must modify path to ua.csv
df_pr = pd.read_csv("pr.csv") # Must modify path to pr.csv
df_diff = df_pr.merge(df_ua, how="outer", indicator=True).loc[lambda x: x["_merge"] == "left_only"].drop("_merge", axis=1)
print(f"pr.csv has {len(df_diff)} rows extra")
print(df_diff)
import csv
ua_dic={}
with open('ua.csv') as ua:
data=csv.reader(ua,delimiter=',')
for i in data:
if str(i) not in ua_dic:
ua_dic[str(i)]=1
output=[]
with open('pr.csv') as pr:
data=csv.reader(pr,delimiter=',')
for j in data:
if str(j) not in ua_dic:
output.append(j)
print(output)

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

change string in a big list with another string

i have some list as t_pre_eks_tfberita
i want to replace the string in row with "Label" string that contain "BUKAN HOAX (1)" to "BUKAN HOAX" and change a string that contain "HOAX (1)" as "HOAX".
but i found error when i use this code.
for i in range (len(t_pre_eks_tfberita)):
if(t_pre_eks_tfberita[i][0]=="Label"):
j=1
while j in range (len(t_pre_eks_tfberita[i])):
cek = re.search("BUKAN",t_pre_eks_tfberita[i][j])
if(cek):
t_pre_eks_tfberita[i][j] = "BUKANHOAX"
else:
t_pre_eks_tfberita[i][j] = "HOAX"
j+=1
dfr_eks_tfberita = pd.DataFrame(list(map(list, zip(*t_pre_eks_tfberita))))
new_header = dfr_eks_tfberita.iloc[0] #grab the first row for the header
dfr_eks_tfberita = dfr_eks_tfberita[1:] #take the data less the header row
dfr_eks_tfberita.columns = new_header
for i in range(len(new_header)):
if new_header[i] != 'Label' and new_header[i] != 'Isi_Dokumen':
dfr_eks_tfberita[new_header[i]] = dfr_eks_tfberita[new_header[i]].astype('int')
dfr_eks_tfberita
when i run it, i found error like this.
any solution for this problem?
using re is in overkill here.
you need to traverse the df values, and just check if "BUKAN HOAX (1)" or "HOAX (1)".
if "HOAX (1)" in t_pre_eks_tfberita[i][j]:
dosomething()
but you can actually do it inside the DF using pandas own function like iterrows().
IIUC, try pandas.Series.str.replace with strip:
import pandas as pd
s = pd.Series(['HOAX', 'HOAX (1)', 'BUKAN HOAX', 'BUKAN HOAX (1000)'])
# Sample input
new_s = s.str.replace('\(\d+\)', '').str.strip()
print(new_s)
Output:
0 HOAX
1 HOAX
2 BUKAN HOAX
3 BUKAN HOAX
dtype: object

Python pandas setting NaN values fails

I am trying to clean a dataset in pandas, information is stored ona csv file and is imported using:
tester = pd.read_csv('date.csv')
Every column contains a '?' where the value is missing. For example there is an age column that contains 9 question marks (?)
I am trying to set the all the question marks to NaN, i have tried:
tester = pd.read_csv('date.csv', na_values=["?"])
tester['age'].replace("?", np.NaN)
tester.replace('?', np.NaN)
for col in tester :
print tester[col].value_counts(dropna=False)
Still returns 0 for the age when I know there is 9 (?s). In this case I assume the check is failing as the value is never seen as being ?.
I have looked at the csv file in notepage and there is no space etc around the character.
Is there anyway of forcing this so that it is recognised?
sample data:
read_csv had a na_values parameter. See here.
df = pd.read_csv('date.csv', na_values='?')
You are very near:
# IT looks like file is having spaces after comma, so use `sep`
tester = pd.read_csv('date.csv', sep=', ', engine='python')
tester['age'].replace('?', np.nan)
There seems problem with data somewhere so, for debug..
pd.read_csv('file', error_bad_lines=False)
tester = tester [~(tester == '?').any(axis=1)]
OR
pd.read_csv('file', sep='delimiter', header=None)
OR
pd.read_csv('file',header=None,sep=', ')

count of one specific item of panda dataframe

i have used following line to get the count of number of
"Read" s from the specific column (containing READ,WRITE,NOP)of a file . which is not csv file but a .out file with \t as delimiter.
data = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t')
df2=df1.iloc[start:end,]
count=df2.str.count("R").sum()
I am getting error
AttributeError:
'DataFrame' object has no attribute 'str'
But when i use
if filename.endswith(".csv"):
data = pd.read_csv(filename)
df1=data.loc[:,"operation"]
df2=df1.iloc[start:end,]
count=df2.str.count("R").sum()
There is no error. But here i have to enter in each csv file.I have to open the file and insert "operation" in the column I need. KIndly give a soultion
I believe need select column 1 for Series, else get one column DataFrame:
count=df2[1].str.count("R").sum()
Or compare by eq and sum of Trues:
count=df2[1].eq("R").sum()
EDIT:
Another solution is return Series in read_csv by parameter squeeze:
s = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t', squeeze=True)
count=s.iloc[start:end].str.count("R").sum()
#for another solution
#count=s.iloc[start:end].eq("R").sum()
Sample:
df2 = pd.DataFrame({1:['R','RR','Q']})
print (df2)
1
0 R
1 RR
2 Q
#count all substrings
count=df2[1].str.count("R").sum()
print (count)
3
#count only strings
count=df2[1].eq("R").sum()
print (count)
1
Just add 0 to df2 assignment:
data = pd.read_csv('xaa',usecols=[1], header=None,delimiter='\t')
df2=df1.iloc[start:end, 0]
count=df2.str.count("R").sum()
And I think it should be:
df2 = data.iloc[start:end, 0]
But maybe you have some other steps that create df1.

Categories

Resources