I have a csv data that looks like the following:
test_subject confidence_score
maths 0.41
english 0.51
I used pandas to create a json file using the following code.
tt1.to_json(orient = "records", lines = True)
The output of the above code is as follows:
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
Now, I want to add source to all the rows like the following and may be backslash on all the variables as follows.
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
How can I achieve this?
Using regex (could read/write from/to a file etc. if required), try:
import re
data = '''
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
{"test_subject":"maths","confidence_score":0.41}
{"test_subject":"english","confidence_score":0.51}
'''
data2 = ''
for line in data.splitlines():
data2 = data2 + re.sub(r'{\"(.*?)\":\"(.*?)\",\"(.*?)\":(.*?)}', r'{"source":"{\\"\1\\":\2,\\"\3\\":\4}}\n', line)
print(data2)
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
{"source":"{\"test_subject\":maths,\"confidence_score\":0.41}}
{"source":"{\"test_subject\":english,\"confidence_score\":0.51}}
For writing to a file (example):
f = open("myFile.txt", "a")
for line in data.splitlines():
f.writelines([re.sub(r'{\"(.*?)\":\"(.*?)\",\"(.*?)\":(.*?)}', r'{"source":"{\\"\1\\":\2,\\"\3\\":\4}}', line)])
f.close()
The regex used in re.sub is shown here: https://regex101.com/r/g64mzx/2. If there is more than 'test_subject', 'maths', 'confidence_score', and a float, the regex would need to be updated to match the new string.
Related
How to convert tuple
text = ('John', '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
to csv format
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL\\r\\n""", """Johny\\nIs\\nHere"""'
or even omitting the special chars at the end
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL""", """Johny\\nIs\\nHere"""'
I came up with this monster
out1 = ','.join(f'""{t}""' if t.startswith('"') and t.endswith('"')
else f'"{t}"' for t in text)
out2 = out1.replace('\n', '\\n').replace('\r', '\\r')
You can get pretty close to what you want with the csv and io modules from the standard library:
use csv to correctly encode the delimiters and handle the quoting rules; it only writes to a file handle
use io.StringIO for that file handle to get the resulting CSV as a string
import csv
import io
f = io.StringIO()
text = ("John", '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
writer = csv.writer(f)
writer.writerow(text)
csv_str = f.getvalue()
csv_repr = repr(csv_str)
print("CSV_STR")
print("=======")
print(csv_str)
print("CSV_REPR")
print("========")
print(csv_repr)
and that prints:
CSV_STR
=======
John,"""n""","""ABC 123
DEF, 456GH
ijKl""
","""Johny
Is
Here"""
CSV_REPR
========
'John,"""n""","""ABC 123\nDEF, 456GH\nijKl""\r\n","""Johny\nIs\nHere"""\r\n'
csv_str is what you'd see in a file if you wrote directly to a file you opened for writing, it is true CSV
csv_repr is kinda what you asked for when you showed us out, but not quite. Your example included "doubly escaped" newlines \\n and carriage returns \\r\\n. CSV doesn't need to escape those characters any more because the entire field is quoted. If you need that, you'll need to do it yourself with something like:
csv_repr.replace(r"\r", r"\\r").replace(r"\n", r"\\n")
but again, that's not necessary for valid CSV.
Also, I don't know how to make the writer include an initial space before every field after the first field, like the spaces you show between "John" and "n" and then after "n" in:
out = 'John, """n""", ...'
The reader can be configured to expect and ignore an initial space, with Dialect.skipinitialspace, but I don't see any options for the writer.
I am trying to read in a .csv file that has a line that looks something like this:
"Red","Apple, Tomato".
I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported.
The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now:
file_folder = sys.argv[1]
file_path = open(file_folder+ "/food_colors.csv", "r")
food_dict = {}
for line in file_path:
(color, description) = line.rstrip().split(',')
print(f"{color}, {description}")
But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so
color, *description = line.rstrip().split(',')
You can then join the description strings back together to make the value for your dict
Another way
color, description = line.rstrip().split(',', 1)
Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv.
For example, this works:
from io import StringIO
import pandas as pd
TESTDATA = StringIO('"Red","Apple, Tomato"')
df = pd.read_csv(TESTDATA, sep=",", header=None)
print(df)
I have a CSV file in which a single row is getting split into multiple rows.
The source file contents are:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAA
XYZ"
"4",
"ABCD"
As we can see, IDs 3 and 4 are getting split into multiple rows. So, is there any way in Python to join those rows with the previous line?
Desired output:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
This the code I have:
data = open(r"C:\Users\suksengupta\Downloads\BotReport_V01.csv","r")
It looks like your CSV file has control characters embedded in the fields contents. If that is the case, you need to strip them out in order to have each field contents printed joined together.
With that in mind, something like this will fix the problem:
import re
src = r'C:\Users\suksengupta\Downloads\BotReport_V01.csv'
with open(src) as f:
data = re.sub(r'([\w|,])\s+', r'\1', f.read())
print(data)
The above code will result in the output below printed to console:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
I'm trying to write a code that checks the sentences in a csv file and search for the words that are given from a second csv file and replace them,my code is as bellow it doesn't return any errors but it is not replacing any words for some reasons and printing back the same sentences without and replacement.
import string
import pandas as pd
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
for row in text:
print(text.replace(change['word'],change['replacement']))
the sentences csv file looks like
and the change csv file looks like
Try:
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
toupdate = dict(zip(change.word, change.replacement))
text = text['sentences'].replace(toupdate, regex=True)
print(text)
dataframe.replace(x,y) changes complete x to y, not part of x.
you have to use regex or custom function to do what you want. for example :
change_dict = dict(zip(change.word,change.replacement))
def replace_word(txt):
for key,val in change_dict.items():
txt = txt.replace(key,val)
return txt
print(text['sentences'].apply(replace_word))
// to create one more additonal column to avoid any change in original colum
text["new_sentence"]=text["sentences"]
for changeInd in change.index:
for eachTextid in text.index:
text["new_sentence"][eachTextid]=text["new_sentence"][eachTextid].replace(change['word'][changeInd],change['replacement'][changeInd])
clear code: click here plz
I'm trying to write this simple code in Python: if the second element of a line of a csv file contains one of the family specified in the "malware_list" list, the main program should print "true". However, the result, is that the program prints always "FALSE".
Each line in the file is in the form:
"NAME,FAMILY"
This is the code:
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
def is_malware (line):
line_splitted = line.split(",")
family = line_splitted[1]
if family in malware_list:
return True
return False
def main():
with open("datset_small.csv", "r") as f:
for i in range(1,100):
line = f.readline()
print(is_malware(line))
if __name__ == "__main__":
main()
line = f.readline()
readline doesn't strip the trailing newline off of the result, so most likely line here looks something like "STEVE,FakeDoc\n". Then family becomes "FakeDoc\n", which is not a member of malware_list, so your function returns False.
Try stripping out the whitespace after reading:
line = f.readline().strip()
python has a package called pandas. By using pandas we can read CSV file in dataframe format.
import pandas as pd
df=pd.read_csv("datset_small.csv")
Please post your content in CSV file so that I can help you out
It can be easily achieved using dataframe.
example code is as follows
import pandas as pd
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
# read csv into dataframe
df = pd.read_csv('datset_small.csv')
print(df['FAMILY'].isin(malware_list))
output is
0 True
1 True
2 True
sample csv used is
NAME,FAMILY
090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton
I would you set instead of list for speed and definitely Pandas is better due to speed and easiness of the code. You can use x in y logic to get the results ;)
import io #not needed in your case
import pandas as pd
data = io.StringIO('''090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton''')
df = pd.read_csv(data,sep=',',header=None)
malware_set = ("FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg")
df.columns = ['id','software']
df['malware'] = df['software'].apply(lambda x: x.strip() in malware_set)
print(df)