Copy number file format issue (Need to modify the structure) - python

I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!

Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))

Related

commas in between data cells not quoted load to dataframe in pandas

read a comma-separated CSV file with commas in cells has no quotes in python
For example the CSV file is in the below format
product,unit,count,alter,denom
(any name) xyz,kg,1,000,volume,1
reposting with data
read a comma separated csv file with commas in cells has no quotes in python example the csv file is in below format
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,1
1142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
here count value is 1,000 but it is separated by comma which gives 2 values this should be rectified and load data to dataframes output should be like
product unit count alter denom
xyz kg 1,000 volume 1
i have used
df=pd.read_csv("filename.csv",sep=",")
here count value is 1,000 but it is separated by a comma which gives 2 values
this should be rectified and load data to data frames
the output should be like
product unit count alter denom
xyz kg 1,000 volume 1
1142 KG 1,000 L 910
I have used
df=pd.read_csv("filename.csv",sep=",")
The fundamental problem is that your input is not a valid .csv file. Either a comma is part of the data or it is a field delimiter. It can't be both.
The simplest approach is to go back to whoever or whatever supplied the file and complain that the format is invalid.
The producer of the file has several, usually easy, options to fix this: (1) Suppress the thousands separator. (2) Quote the field containing the comma, for example "1,000". (3) Choose a different field delimiter, such as ;. This is a very common approach in Europe because , frequently means a decimal point and so ignoring it is a bad idea.
You should not be in the position of having to clean up someone else's sloppy export.
However, since you have the file that you have, and don't seem in a position to take this advice, your only option is to reprocess the file so that it is valid.
The approach is to read the defective input file, check each row to see how many fields it has, and if it has one too many and the cause is a thousands separator comma masquerading as a field delimiter, then glue the two halves of the number back together; and then write out the modified file.
# fixit.py
# Program to accept an invalid csv file with an unescaped comma in column 3 and regularize it
# Use like this: python fixit.py < wrongfile.csv > rightfile.csv
import sys
import csv
def fix(row: list[str]) -> list[str]:
"""
If there are 5 columns:
return unchanged.
If there are 6 columns
and columns 2 and 3 can be interpreted as a number with a thousand separator:
combine columns 2 and 3 and return the row.
Otherwise return an empty list.
"""
if len(row) == 5:
return row
if len(row) == 6 and row[2].isdigit() and row[3].isdigit():
return row[:2] + [row[2] + row[3]] + row[4:]
return []
def main(infile, outfile):
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if fixed_row := fix(row):
writer.writerow(fixed_row)
else:
print(f"Line {reader.line_num} could not be fixed", file=sys.stderr)
if __name__ == '__main__':
sys.stdout.reconfigure(newline="")
# This is because module csv does its own thing with end-of-line and requires the file have newline=""
main(sys.stdin,sys.stdout)
Given this input:
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,11142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
you will see this output:
product,unit,count,alter,denom
1143,v,1000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1000,TO,1
11567,K,28,EA,100
11569,v,1000,TO,1
along with a warning written to the console about line 2.
Your question shows the data with a blank line between each row of data. I'm assuming that your data is not really like that and the blank lines are the result of your inexperience in formatting a Stack Overflow question properly. But if your data really is like that, the program will still work. You will just get a lot of warnings about blank lines. There won't be any blank lines in the output because pandas.read_csv() doesn't need them.

Reading columns of a txt file on python

I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file
x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)
Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)
You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".
You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html

How to organise dataFrame like this, in Python:

I have a file which has some information:
1.Movie ID (the first character before a ":")
2.User ID
4.User Rating
3.Date
All elements are splited by a "," but Movie ID, which is separated by a colon
if I create a dataframe like this:
df=pd.read_csv('combined_data_1.txt',header = None,names['Movie_ID','User_ID','Rating','Date'])
and print the dataframe, I will get this:
Which is not correct, obviosly.
So, if you look at the "Movie_ID" column, in the first row, there is a 1:1488844. Only the number "1" (just before the colon) should be in the "Movie_ID" column, not "1:1488844". The rest (1488844) should be in the User_ID column.
Another problem is that not every "Movie_ID" column have its correctly ID, and in this case, it should be "1" until I find another movie id, that again, will be the first number before a colon.
I know that the ids of all the movies follow a sequence, that is: 1,2,3,4,...
Another problem that I saw, was that when I read the file, for some reason a split occours when there is a colon, so after the first row (which doesn't get splited), when a colon appears, a row in "Movie_ID" is created containing only, for example: "2:", not something like the first row.
In the end, I would like to get something like this:
But I don't know how to organise like this.
Thanks for the help!
Use shift with axis=1 and simply modify the columns:
df=df.shift(axis=1)
df['Movie_ID']=df['User_ID'].str[0]
df['User_ID']=df['User_ID'].str[2:]
And now:
print(df)
Would be desired result.
I believe the issue might be coming from how your data is being stored and thus parsed due to the way your Movie ID is stored separated by a : (colon) rather than a , (comma) as would be needed in a CSV.
If you are able to parse to have it delineate by commas exclusively. the text before it is opened as a CSV, you may be able to eliminate this issue. I only note this because Pandas does not permit multiple delimiters.
Here is what I was able to come up with regarding making something which delineates by colon and comma for how you desire. While I know this isn't your ultimate goal, hopefully this is able to get you on the right path.
import pandas as pd
with open("combined_data_1.txt") as file:
lines = file.readlines()
#Splitting the data into a list delineated by colons
data = []
for line in lines:
if(":" in line):
data.append([])
else: #Using else here prevents the line containing the colon from being saved.
data[len(data)-1].append(line)
for x in range(len(data)):
print("Section " + str(x+1) + ":\n")
print(str(data[x]) + "\n\n")

Writing out comma separated values in a single cell in spreadsheet

I am cataloging attribute fields for each feature class in the input list, below, and then I am writing the output to a spreadsheet for the occurance of the attribute in one or more of the feature classes.
import arcpy,collections,re
arcpy.env.overwriteOutput = True
input = [list of feature classes]
outfile= # path to csv file
f=open(outfile,'w')
f.write('ATTRIBUTE,FEATURE CLASS\n\n')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
#remove these characters
char_removal = ["[","'",",","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
line=','.join([keys,v])+'\n'
print line
f.write(line)
f.close()
This code gets me 90% of the way to the intended solution. I still cannot get the feature classes(values) to separate by a comma within the same cell(being comma delimited it shifts each value over to the next column as I mentioned). In this particular code the "v" on line 20(feature class names) are output to the spreadsheet, separated by a space(" ") in the same cell. Not a huge deal because the replace " " with "," can be done very quickly in the spreadsheet itself but it would be nice to work this into the code to improve reusability.
For a CSV file, use double-quotes around the cell content to preserve interior commas within, like this:
content1,content2,"content3,contains,commas",content4
Generally speaking, many libraries that output CSV just put all contents in quotes, like this:
"content1","content2","content3,contains,commas","content4"
As a side note, I'd strongly recommend using an existing library to create CSV files instead of reinventing the wheel. One such library is built into Python 2.6+.
As they say, "Good coders write. Great coders reuse."
import arcpy,collections,re,csv
arcpy.env.overwriteOutput = True
input = [# list of feature classes]
outfile= # path to output csv file
f=open(outfile,'wb')
csv_write=csv.writer(f)
csv_write.writerow(['Field','Feature Class'])
csv_write.writerow('')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
# remove these characters
char_removal = ["[","'","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
csv_write.writerow([keys,""+v+""])
f.close()

Looping a write command to output many different indices from a list separately in Python

Im trying to get an output like:
KPLR003222854-2009131105131
in a text file. The way I am attempting to derive that output is as such:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = []
for line in file_P:
splt_file_P = line.split()
nameData.append(splt_file_P[0])
for key in nameData:
namelist.write('\n' 'KPLR00' + "".join(str(w) for w in nameData) + '-2009131105131')
However I am having an issue in that the numbers in the nameData array are all appearing at once in the specified output, instead of using on ID cleanly as shown above the output is something like this:
KPLR00322285472138721382172198371823798123781923781237819237894676472634973256279234987-2009131105131
So my question is how do I loop the write command in a way that will allow me to get each separate ID (each has a specific index value, but there are over 150) to be properly outputted.
EDIT:
Also, some of the ID's in the list are not the same length, so I wanted to add 0's to the front of the 'key' to make them all equal 9 digits. I cheated this by adding the 0's into the KPLR in quotes but not all of the ID's need just two 0's. The question is, could I add 0's between KPLR and the key in any way to match the 9-digit format?
Your code looks like it's working as one would expect: "".join(str(w) for w in nameData) makes a string composed of the concatenation of every item in nameData.
Chances are you want;
for key in nameData:
namelist.write('\n' 'KPLR00' + key + '-2009131105131')
Or even better:
for key in nameData:
namelist.write('\nKPLR%09i-2009131105131'%int(key)) #no string concatenation
String concatenation tends to be slower, and if you're not only operating on strings, will involve explicit calls to str. Here's a pair of ideone snippets showing the difference: http://ideone.com/RR5RnL and http://ideone.com/VH2gzx
Also, the above form with the format string '%09i' will pad with 0s to make the number up to 9 digits. Because the format is '%i', I've added an explicit conversion to int. See here for full details: http://docs.python.org/2/library/stdtypes.html#string-formatting-operations
Finally, here's a single line version (excepting the with statement, which you should of course keep):
namelist.write("\n".join("KPLR%09i-2009131105131"%int(line.split()[0]) for line in file_P))
You can change this:
"".join(str(w) for w in nameData)
to this:
",".join(str(w) for w in nameData)
Basically, the "," will comma delimit the elements in your nameData list. If you use "", then there will be nothing to separate the elements, so they appear all at once. You can change the delimiter to suit your needs.
Just for kicks:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = [line.split()[0] for line in file_P]
namelist.write("\n".join("KPLR00" + str(key) + '-2009131105131' for key in nameData))
I think that will work, but I haven't tested it. You can make it even smaller/uglier by not using nameData at all, and just use that list comprehension right in its place.

Categories

Resources