Remove embedded line-feeds from fields of a CSV file - python

I have a CSV file in which a single row is getting split into multiple rows.
The source file contents are:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAA
XYZ"
"4",
"ABCD"
As we can see, IDs 3 and 4 are getting split into multiple rows. So, is there any way in Python to join those rows with the previous line?
Desired output:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
This the code I have:
data = open(r"C:\Users\suksengupta\Downloads\BotReport_V01.csv","r")

It looks like your CSV file has control characters embedded in the fields contents. If that is the case, you need to strip them out in order to have each field contents printed joined together.
With that in mind, something like this will fix the problem:
import re
src = r'C:\Users\suksengupta\Downloads\BotReport_V01.csv'
with open(src) as f:
data = re.sub(r'([\w|,])\s+', r'\1', f.read())
print(data)
The above code will result in the output below printed to console:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"

Related

How to read a delimited file using Spark RDD, if the actual data is embedded with same delimiter

I am trying to read a text file into an rdd
My sample data is below
"1" "Hai How are you!" "56"
"2" "0213"
3 columns with Tab delimiter. My data is also getting embedded with same delimiter(How\tHow areyou!). Can some one help me here to parse the data properly in pyspark.
my_Rdd = Spark.SparkContext.textFile("Mytext.txt").map(lambda line:line.split('\t'))
When I do the above code I get below output
ColA,ColB,Colc
"1","Hai,How are you!"
"2","0123"
2nd column splitted to 3rd as it is having the same delimiter in actual data and for 2nd row the 3rd value is getting mapped to 2nd
My expected output is
ColA,ColB,Colc
"1","Hai How are you!","56"
"2",,"0123"
In Dataframe I can keep quote options, but how can we do the same in RDD?
Use shlex.split() which ignores quoted delimiters:
import shlex
sc.textFile('Mytext.txt').map(lambda line: shlex.split(line))
Another example with string:
import shlex
rdd = sc.parallelize(['"1"\t"Hai\tHow are you!"\t"56"']).map(lambda line: shlex.split(line))
>>> rdd.collect()
[['1', 'Hai\tHow are you!', '56']]

Ignoring commas in string literals while reading in .csv file without using any outside libraries

I am trying to read in a .csv file that has a line that looks something like this:
"Red","Apple, Tomato".
I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported.
The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now:
file_folder = sys.argv[1]
file_path = open(file_folder+ "/food_colors.csv", "r")
food_dict = {}
for line in file_path:
(color, description) = line.rstrip().split(',')
print(f"{color}, {description}")
But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so
color, *description = line.rstrip().split(',')
You can then join the description strings back together to make the value for your dict
Another way
color, description = line.rstrip().split(',', 1)
Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv.
For example, this works:
from io import StringIO
import pandas as pd
TESTDATA = StringIO('"Red","Apple, Tomato"')
df = pd.read_csv(TESTDATA, sep=",", header=None)
print(df)

string manipulation with python pandas and replacement function

I'm trying to write a code that checks the sentences in a csv file and search for the words that are given from a second csv file and replace them,my code is as bellow it doesn't return any errors but it is not replacing any words for some reasons and printing back the same sentences without and replacement.
import string
import pandas as pd
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
for row in text:
print(text.replace(change['word'],change['replacement']))
the sentences csv file looks like
and the change csv file looks like
Try:
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
toupdate = dict(zip(change.word, change.replacement))
text = text['sentences'].replace(toupdate, regex=True)
print(text)
dataframe.replace(x,y) changes complete x to y, not part of x.
you have to use regex or custom function to do what you want. for example :
change_dict = dict(zip(change.word,change.replacement))
def replace_word(txt):
for key,val in change_dict.items():
txt = txt.replace(key,val)
return txt
print(text['sentences'].apply(replace_word))
// to create one more additonal column to avoid any change in original colum
text["new_sentence"]=text["sentences"]
for changeInd in change.index:
for eachTextid in text.index:
text["new_sentence"][eachTextid]=text["new_sentence"][eachTextid].replace(change['word'][changeInd],change['replacement'][changeInd])
clear code: click here plz

Extracting gene location from FASTA file

I am trying to extract the gene location from a fasta file using BioPython but the function .location is not working. I would like to avoid regex because this function will have to work with different files and the all have slightly different headers.
The header looks like:
>X dna:chromosome chromosome:GRCh38:X:111410060:111411807:-1
I would like the output to be:
start = 111410060
end = 111411807
If the headers of your different fasta files always end with '...chr:start:end:strand' and the different parts are separated by ':' you could try to split .description by .split(":") and select the penultimate and antepenultimate position of the resulting list.
The following works for me with your example header:
from Bio import SeqIO
path = 'fasta_test.fasta'
records = SeqIO.parse(open(path), 'fasta')
record = next(records)
parts = record.description.split(":")
print('start =', parts[-3], 'end =', parts[-2])

Copy number file format issue (Need to modify the structure)

I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))

Categories

Resources