string manipulation with python pandas and replacement function - python

I'm trying to write a code that checks the sentences in a csv file and search for the words that are given from a second csv file and replace them,my code is as bellow it doesn't return any errors but it is not replacing any words for some reasons and printing back the same sentences without and replacement.
import string
import pandas as pd
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
for row in text:
print(text.replace(change['word'],change['replacement']))
the sentences csv file looks like
and the change csv file looks like

Try:
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
toupdate = dict(zip(change.word, change.replacement))
text = text['sentences'].replace(toupdate, regex=True)
print(text)

dataframe.replace(x,y) changes complete x to y, not part of x.
you have to use regex or custom function to do what you want. for example :
change_dict = dict(zip(change.word,change.replacement))
def replace_word(txt):
for key,val in change_dict.items():
txt = txt.replace(key,val)
return txt
print(text['sentences'].apply(replace_word))

// to create one more additonal column to avoid any change in original colum
text["new_sentence"]=text["sentences"]
for changeInd in change.index:
for eachTextid in text.index:
text["new_sentence"][eachTextid]=text["new_sentence"][eachTextid].replace(change['word'][changeInd],change['replacement'][changeInd])
clear code: click here plz

Related

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

trying to convert txt data to csv columns

I am having a small issue. The below code works, but when i put a /test1 and /test2 into the file and change '/test1 (.*?) /test2', it doesnt see it in the file and except runs.
import re
with open ('test.txt') as f:
fin = f.read()
try:
print(re.search('test1 (.*?) test2', fin) .group(1))
except:
print('Didnt find test')
My goal is to extract from a list of text files and push into CSV Columns that has text like this below where i would extract /J6 to /K6 as a value range. There is multiple different lines of /J6 to /K6 , each value to be put into a separate column in the CSV.
/J60000,0000,0819,0016,0356,-13,0363/K60013
,0012,0013,0875,-0021,00465,0120/L60089,0002,
I just want to understand is there a syntax problem detecting the / . I am trying to extract values between a value and another value .thank you
You can use the re.split function.
Something like this...
In [84]: import re
In [85]: inp = "/J60000,0000,0819,0016,0356,-13,0363/K60013 ,0012,0013,0875,-0021,00465,0120/L60089,0002,"
In [86]: re.split("/[J,K,L]\d", inp)
Out[86]:
['',
'0000,0000,0819,0016,0356,-13,0363',
'0013 ,0012,0013,0875,-0021,00465,0120',
'0089,0002,']
Disclaimer: I'm not good with regex at all. I used this link as a reference.
https://www.dataquest.io/blog/regex-cheatsheet/

Ignoring commas in string literals while reading in .csv file without using any outside libraries

I am trying to read in a .csv file that has a line that looks something like this:
"Red","Apple, Tomato".
I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported.
The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now:
file_folder = sys.argv[1]
file_path = open(file_folder+ "/food_colors.csv", "r")
food_dict = {}
for line in file_path:
(color, description) = line.rstrip().split(',')
print(f"{color}, {description}")
But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so
color, *description = line.rstrip().split(',')
You can then join the description strings back together to make the value for your dict
Another way
color, description = line.rstrip().split(',', 1)
Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv.
For example, this works:
from io import StringIO
import pandas as pd
TESTDATA = StringIO('"Red","Apple, Tomato"')
df = pd.read_csv(TESTDATA, sep=",", header=None)
print(df)

Remove embedded line-feeds from fields of a CSV file

I have a CSV file in which a single row is getting split into multiple rows.
The source file contents are:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAA
XYZ"
"4",
"ABCD"
As we can see, IDs 3 and 4 are getting split into multiple rows. So, is there any way in Python to join those rows with the previous line?
Desired output:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
This the code I have:
data = open(r"C:\Users\suksengupta\Downloads\BotReport_V01.csv","r")
It looks like your CSV file has control characters embedded in the fields contents. If that is the case, you need to strip them out in order to have each field contents printed joined together.
With that in mind, something like this will fix the problem:
import re
src = r'C:\Users\suksengupta\Downloads\BotReport_V01.csv'
with open(src) as f:
data = re.sub(r'([\w|,])\s+', r'\1', f.read())
print(data)
The above code will result in the output below printed to console:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"

Wrong boolean result in the main program (python)

I'm trying to write this simple code in Python: if the second element of a line of a csv file contains one of the family specified in the "malware_list" list, the main program should print "true". However, the result, is that the program prints always "FALSE".
Each line in the file is in the form:
"NAME,FAMILY"
This is the code:
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
def is_malware (line):
line_splitted = line.split(",")
family = line_splitted[1]
if family in malware_list:
return True
return False
def main():
with open("datset_small.csv", "r") as f:
for i in range(1,100):
line = f.readline()
print(is_malware(line))
if __name__ == "__main__":
main()
line = f.readline()
readline doesn't strip the trailing newline off of the result, so most likely line here looks something like "STEVE,FakeDoc\n". Then family becomes "FakeDoc\n", which is not a member of malware_list, so your function returns False.
Try stripping out the whitespace after reading:
line = f.readline().strip()
python has a package called pandas. By using pandas we can read CSV file in dataframe format.
import pandas as pd
df=pd.read_csv("datset_small.csv")
Please post your content in CSV file so that I can help you out
It can be easily achieved using dataframe.
example code is as follows
import pandas as pd
malware_list = ["FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg"]
# read csv into dataframe
df = pd.read_csv('datset_small.csv')
print(df['FAMILY'].isin(malware_list))
output is
0 True
1 True
2 True
sample csv used is
NAME,FAMILY
090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton
I would you set instead of list for speed and definitely Pandas is better due to speed and easiness of the code. You can use x in y logic to get the results ;)
import io #not needed in your case
import pandas as pd
data = io.StringIO('''090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton
bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu
149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton''')
df = pd.read_csv(data,sep=',',header=None)
malware_set = ("FakeInstaller","DroidKungFu", "Plankton",
"Opfake", "GingerMaster", "BaseBridge",
"Iconosys", "Kmin", "FakeDoc", "Geinimi",
"Adrd", "DroidDream", "LinuxLotoor", "GoldDream"
"MobileTx", "FakeRun", "SendPay", "Gappusin",
"Imlog", "SMSreg")
df.columns = ['id','software']
df['malware'] = df['software'].apply(lambda x: x.strip() in malware_set)
print(df)

Categories

Resources