Wrong boolean result in the main program (python) - python

I'm trying to write this simple code in Python: if the second element of a line of a csv file contains one of the family specified in the "malware_list" list, the main program should print "true". However, the result, is that the program prints always "FALSE".
Each line in the file is in the form:
"NAME,FAMILY"
This is the code:
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
def is_malware (line):
line_splitted = line.split(",")
family = line_splitted[1]
if family in malware_list:
return True
return False
def main():
with open("datset_small.csv", "r") as f:
for i in range(1,100):
line = f.readline()
print(is_malware(line))
if __name__ == "__main__":
main()

line = f.readline()
readline doesn't strip the trailing newline off of the result, so most likely line here looks something like "STEVE,FakeDoc\n". Then family becomes "FakeDoc\n", which is not a member of malware_list, so your function returns False.
Try stripping out the whitespace after reading:
line = f.readline().strip()

python has a package called pandas. By using pandas we can read CSV file in dataframe format.
import pandas as pd
df=pd.read_csv("datset_small.csv")
Please post your content in CSV file so that I can help you out

It can be easily achieved using dataframe.
example code is as follows
import pandas as pd
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
# read csv into dataframe
df = pd.read_csv('datset_small.csv')
print(df['FAMILY'].isin(malware_list))
output is
0 True
1 True
2 True
sample csv used is
NAME,FAMILY
090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton

I would you set instead of list for speed and definitely Pandas is better due to speed and easiness of the code. You can use x in y logic to get the results ;)
import io #not needed in your case
import pandas as pd
data = io.StringIO('''090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton''')
df = pd.read_csv(data,sep=',',header=None)
malware_set = ("FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg")
df.columns = ['id','software']
df['malware'] = df['software'].apply(lambda x: x.strip() in malware_set)
print(df)

Related

Converting csv to json and wrap the json by another variable

I have a csv data that looks like the following:
test_subject confidence_score
maths 0.41
english 0.51
I used pandas to create a json file using the following code.
tt1.to_json(orient = "records", lines = True)
The output of the above code is as follows:
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
Now, I want to add source to all the rows like the following and may be backslash on all the variables as follows.
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
How can I achieve this?
Using regex (could read/write from/to a file etc. if required), try:
import re
data = '''
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
'''
data2 = ''
for line in data.splitlines():
data2 = data2 + re.sub(r'{\"(.*?)\":\"(.*?)\",\"(.*?)\":(.*?)}', r'{"source":"{\\"\1\\":\2,\\"\3\\":\4}}\n', line)
print(data2)
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
For writing to a file (example):
f = open("myFile.txt", "a")
for line in data.splitlines():
f.writelines([re.sub(r'{\"(.*?)\":\"(.*?)\",\"(.*?)\":(.*?)}', r'{"source":"{\\"\1\\":\2,\\"\3\\":\4}}', line)])
f.close()
The regex used in re.sub is shown here: https://regex101.com/r/g64mzx/2. If there is more than 'test_subject', 'maths', 'confidence_score', and a float, the regex would need to be updated to match the new string.

Ignoring commas in string literals while reading in .csv file without using any outside libraries

I am trying to read in a .csv file that has a line that looks something like this:
"Red","Apple, Tomato".
I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported.
The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now:
file_folder = sys.argv[1]
file_path = open(file_folder+ "/food_colors.csv", "r")
food_dict = {}
for line in file_path:
(color, description) = line.rstrip().split(',')
print(f"{color}, {description}")
But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so
color, *description = line.rstrip().split(',')
You can then join the description strings back together to make the value for your dict
Another way
color, description = line.rstrip().split(',', 1)
Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv.
For example, this works:
from io import StringIO
import pandas as pd
TESTDATA = StringIO('"Red","Apple, Tomato"')
df = pd.read_csv(TESTDATA, sep=",", header=None)
print(df)

string manipulation with python pandas and replacement function

I'm trying to write a code that checks the sentences in a csv file and search for the words that are given from a second csv file and replace them,my code is as bellow it doesn't return any errors but it is not replacing any words for some reasons and printing back the same sentences without and replacement.
import string
import pandas as pd
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
for row in text:
print(text.replace(change['word'],change['replacement']))
the sentences csv file looks like
and the change csv file looks like
Try:
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
toupdate = dict(zip(change.word, change.replacement))
text = text['sentences'].replace(toupdate, regex=True)
print(text)
dataframe.replace(x,y) changes complete x to y, not part of x.
you have to use regex or custom function to do what you want. for example :
change_dict = dict(zip(change.word,change.replacement))
def replace_word(txt):
for key,val in change_dict.items():
txt = txt.replace(key,val)
return txt
print(text['sentences'].apply(replace_word))
// to create one more additonal column to avoid any change in original colum
text["new_sentence"]=text["sentences"]
for changeInd in change.index:
for eachTextid in text.index:
text["new_sentence"][eachTextid]=text["new_sentence"][eachTextid].replace(change['word'][changeInd],change['replacement'][changeInd])
clear code: click here plz

python code only reading partial csv file from gcs bucket

I am facing a strange issue while reading a csv file from Google cloud storage bucket and writing it to a file in different folder in the same bucket.
I have a csv file named test.csv with 1000001 lines in it. I am trying to replace " in each line with blank space and write to a file called cleansed_test.csv.
I tested my code in local and works as expected.
below is the code i am using in my local
import pandas as pd
import csv
import re
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
with open('c:\\Users\test_file.csv','r') as f:
lines = f.readlines()
print(len(lines))
for line in lines:
new_line = re.sub('["]','',line)
new_line= new_line.strip()
new_lines.append(new_line)
# elif line.count('|') < 295:
# new_line_error_less = re.sub('["]','inches',line)
# new_line_error_less= new_line_error_less.strip()
# new_lines_error_less_cols.append(new_line_error_less)
# else:
# new_line_error_more = re.sub('["]','inches',line)
# new_line_error_more= new_line_error_more.strip()
# new_lines_error_more_cols.append(new_line_error_more)
new_data = pd.DataFrame(new_lines)
print(new_data.info())
#new_data.to_csv('c:\\cleansed_file.csv',header=None,index=False,encoding='utf-8')
But when i try doing the same file in gcs bucket only 67514 rows are being read
code I am using in composer
def replace_quotes(project,bucket,**context):
import pandas as pd
import numpy as np
import csv
import os
import re
import gcsfs
import io
fs = gcsfs.GCSFileSystem(project='project_name')
updated_file_list = fs.ls('bucketname/FULL')
updated_file_list = [ x for x in updated_file_list if "filename" in x ]
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
for f in updated_file_list:
file_name = os.path.splitext(f)[0]
parse_names = file_name.split('/')
filename = parse_names[2]
bucketname = parse_names[0]
with fs.open("gs://"+f,'r') as pf:
lines = pf.readlines()
print("length of lines----->",len(lines))#even here showing 67514
for line in lines:
new_line = re.sub('["]','',line)
new_line= new_line.strip()
new_lines.append(new_line)
new_data = pd.DataFrame(new_lines)
#new_data.to_csv("gs://"+bucketname+"/ERROR_FILES/cleansed_"+filename+".csv",escapechar='',header = None,index=False,encoding='utf-8',quoting=csv.QUOTE_NONE)
Also in the bucket i see the sizes of the files test.csv and cleansed_test.csv are the same.
The only thing i can think of is since files are compressed in gcs buckets should i be opening the files in a different way. cuz when i download the files to local they are lot larger than what i see in the bucket.
Please advise.
Thanks.
I think you can achieve what you want by using the replace method of the dataframe column object, and specifying the bool true parameter (otherwise field string must perfectly match the condition of matching character). In this way you can simply iterate per each column and replace the unwanted string, rewriting each column with the newly modified one afterwards.
I modified a bit your code and ran it on my VM in GCP. As you can see I preferred to use the Pandas.read_csv() method as the GCSF one was throwing me some errors. The code seems doing its job as I initially tested by replacing a dummy common string and it worked smoothly.
This is your modified code. Please also note I refactored a bit the reading part as did not properly match the path of the csv in my bucket.
from pandas.api.types import is_string_dtype
import pandas as pd
import numpy as np
import csv
import os
import re
import gcsfs
import io
fs = gcsfs.GCSFileSystem(project='my-project')
updated_file_list = fs.ls('test-bucket/')
updated_file_list = [ x for x in updated_file_list if "simple.csv" in x ]
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
for f in updated_file_list:
file_name = os.path.splitext(f)[0]
print(f, file_name)
parse_names = file_name.split('/')
filename = parse_names[1]
bucketname = parse_names[0]
with fs.open("gs://"+f) as pf:
df = pd.read_csv(pf)
#print(df.head(len(df))) #To check results
for col in df:
if is_string_dtype(df[col]):
df[col] = df[col].replace(to_replace=['"'], value= '', regex= True)
#print(df.head(len(df))) #To check results
new_data = pd.DataFrame(df)
#new_data.to_csv("gs://"+bucketname+"/ERROR_FILES/cleansed_"+filename+".csv",escapechar='',header = None,index=False,encoding='utf-8',quoting=csv.QUOTE_NONE
Hope my efforts solved you issue...
for any one curious this is how to inflate a file that has extension .csv but actually is compressed with gzip.
gsutil cat gs://BUCKET/File_Name.csv | zcat | gsutil cp - gs://BUCKET/Newfile.csv
The only issue here i see is
i cant use wild cards or i should say to put it plainly we have to give the destination file name
the down side is since i have to specify the destination file name i cannot use it in bash operator in airflow(this is what i thik i may be wrong)
thanks
any ways hope this helps

Python Programming Error for DataScience DataFrame

I am reading my data from a CSV file using pandas and it works well with range 700. But as soon as I go above 700 and trying to append to a list in python it is showing me list index out of range. But the CSV has around 500K of rows
Can anyone help me with that why is it happening?
Thanks in advance.
import pandas as pd
df_email = pd.read_csv('emails.csv',nrows=800)
test_email = df_email.iloc[:,-1]
list_of_emails = []
for i in range(len(test_email)):
var_email = test_email[i].split("\n") #this code takes one single email splits based on a new line giving a python list of all the strings in the email
email = {}
message_body = ''
for _ in var_email:
if ":" in _:
var_sentence = _.split(":") #this part actually uses the ":" to find the elements in the list that have ":" present
for j in range(len(var_sentence)):
if var_sentence[j].lower().strip() == "from":
email['from'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == "to":
email['to'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == 'subject':
if var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip() == 're':
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+2])].lower().strip()
else:
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif ":" not in _:
message_body += _.strip()
email['body'] = message_body
list_of_emails.append(email)
I am not sure of what you are trying to say here (might as well put example inputs and outputs here), but I came across this problem, which might be of the same nature, sometime weeks ago.
CSV files are comma-separated, which means it always takes note of every comma in a line to separate them in columns. If some dirty input from strings in your CSV file are present, then it will mess up the columns that you are expecting to have.
Best solution here is have some code to cleanup your CSV file, change its delimiter to another character (probably '|', '&', or anything that also doesn't mess up with the data), and revise your code to reflect these changes to the CSV.
use the pandas library to read the file.
it is very efficient and saves you time in writing the code yourself.
eg :
import pandas as pd
training_data = pd.read_csv( "train.csv", sep = ",", header = None )

Categories

Resources