How to define column headers when reading a csv file in Python - python

I have a comma separated value table that I want to read in Python. What I need to do is first tell Python not to skip the first row because that contains the headers. Then I need to tell it to read in the data as a list and not a string because I need to build an array out of the data and the first column is non-integer (row headers).
There are a total of 11 columns and 5 rows.
Here is the format of the table (except there are no row spaces):
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
w0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w2 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Is there a way to do this? Any help is greatly appreciated!

You can use the csv module for this sort of thing. It will read in each row as a list of strings representing the different fields.
How exactly you'd want to use it depends on how you're going to process the data afterwards, but you might consider making a Reader object (from the csv.reader() function), calling next() on it once to get the first row, i.e. the headers, and then iterating over the remaining lines in a for loop.
r = csv.reader(...)
headers = r.next()
for fields in r:
# do stuff
If you're going to wind up putting the fields into a dict, you'd use DictReader instead (and that class will automatically take the field names from the first row, so you can just construct it an use it in a loop).

Related

How to (log) transform *args arguments without losing structure

I am attempting to apply statistical tests to some datasets with variable numbers of groups. This causes a problem when I try to perform a log transformation for said groups while maintaining the ability to perform the test function (in this case scipy's kruskal()), which takes a variable number of arguments, one for each group of data.
The code below is an idea of what I want. Naturally stats.kruskal([np.log(i) for i in args]) does not work, as kruskal() does not expect a list of arrays, but one argument for each array. How do I perform log transformation (or any kind of alteration, really), while still being able to use the function?
import scipy.stats as stats
import numpy as np
def t(*args):
test = stats.kruskal([np.log(i) for i in args])
return test
a = [11, 12, 4, 42, 12, 1, 21, 12, 6]
b = [1, 12, 4, 3, 14, 8, 8, 6]
c = [2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8]
print(t(a, b, c))
IIUC, * in front of the list you are forming while calling kruskal should do the trick:
test = stats.kruskal(*[np.log(i) for i in args])
Asterisk unpacks the list and passes each entry of the list as arguments to the function being called i.e. kruskal here.

Dataframe with fixed length (over writing)

I write a code that generates a mass amount of data in each round. So, I need to only store data for the last 10 rounds. How can I create a dataframe which erases the oldest object when I add a need object (over-writing)? The order of observations -from old to new- should be maintained. Is there any simple function or data format to do this?
Thanks in advance!
You could use this function:
def ins(arr, item):
if len(arr) < 10:
arr.insert(0, item)
else:
arr.pop()
arr.insert(0, item)
ex = [1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'a')
print(ex)
# ['a', 1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'b')
print(ex)
# ['b', 'a', 1, 2, 3, 4, 5, 6, 7, 8]
In order for this to work you MUST pass a list as argument to the function ins(), so that the new item is inserted and the 10th is removed (if there is one).
(I considered that the question is not pandas specific, but rather a way to store a maximum amount of items in an array)

connecting two dictionaries and storing it into an RDD

I have a dictionary users with 1748 elements as (showing only the first 12 elements)-
defaultdict(int,
{'470520068': 1,
'2176120173': 1,
'145087572': 3,
'23047147': 1,
'526506000': 1,
'326311693': 1,
'851106379': 4,
'161900469': 1,
'3222966471': 1,
'2562842034': 1,
'18658617': 1,
'73654065': 4,})
and another dictionary partition with 452743 elements as(showing first 42 elements)-
{'609232972': 4,
'975151075': 4,
'14247572': 4,
'2987788788': 4,
'3064695250': 2,
'54097674': 3,
'510333371': 0,
'34150587': 4,
'26170001': 0,
'1339755391': 3,
'419536996': 4,
'2558131184': 2,
'23068646': 6,
'2781517567': 3,
'701206260771905541': 4,
'754263126': 4,
'33799684': 0,
'1625984816': 4,
'4893416104': 3,
'263520530': 3,
'60625681': 4,
'470528618': 3,
'4512063372': 6,
'933683112': 3,
'402379005': 4,
'1015823005': 2,
'244673821': 0,
'3279677882': 4,
'16206240': 4,
'3243924564': 6,
'2438275574': 6,
'205941266': 3,
'330723222': 1,
'3037002897': 0,
'75454729': 0,
'3033154947': 6,
'67475302': 3,
'922914019': 6,
'2598199242': 6,
'2382444216': 3,
'1388012203': 4,
'3950452641': 5,}
The keys in users(all unique) are all in partition and also are repeated with different values(and also partition contains some extra keys which is not of our use). What I want is a new dictionary final which connects the keys of users matching with those of partition with the values of partition, i.e. if I have '145087572' as a key in users and the same key has been repeated twice or thrice in partition with different values as: {'145087572':2, '145087572':3,'145087572':7} then I should get all these three elements in the new dictionary final. Also I have to store this dictionary as a key:value RDD.
Here's what I tried:
user_key=list(users.keys())
final=[]
for x in user_key:
s={x:partition.get(x) for x in partition}
final.append(s)
After running this code my laptop stops to respond (the code still shows [*]) and I have to restart it. May I know that is there any problem with my code and a more efficient way to do this.
First dictionary cannot hold duplicate keys, duplicate key's value will be ovewritten by the last value of same key.
Now lets analyze your code
user_key=list(users.keys()) # here you get all the keys say(1,2,3)
final=[]
for x in user_key: #you are iterating over the keys so x will be 1, 2, 3
s={x:partition.get(x) for x in partition} #This is the reason for halting
''' breaking the above line this is what it looks like.
s = {}
for x in partition:
s[x] = partition.get(x)
isn't the outer forloop and inner forloop is using the same variable x
so basically instead of iterating over the keys of users you are
iterating over the keys of partition table,
as x is updated inside inner foorloop(so x contains the keys of partition
table).
'''
final.append(s)
Now the reason for halting is (say you have 10 keys in users dictionary).
so outer forloop will iterate 10 times and for the 10 times
Inner forloop will iterate over whole partition keys and make a copy
which is causing memory error and eventually your system gets hung due to out of memory.
I think this will work for you
store partition data in a python defaultdict(list)
from collections import defaultdict
user_key = users.keys()
part_dict = defaultdict(list)
# partition = [[key1, value], [key2, value], ....]
# store your parition data in this way (list inside list)
for index in parition:
if index[0] not in part_dict:
part_dict[index[0]] = index[1]
else:
part_dict[index[0]].append(index[1])
# part_dict = {key1:[1,2,3], key2:[1,2,3], key3:[4,5],....}
final = []
for x in user_keys:
for values in part_dict[x]:
final.append([x, values])
# if you want your result of dictionary format(I don't think it's required) then you ca use
# final.append({x:values})
# final = [{key1: 1}, {key2: 2}, ....]
# final = [[key1, 1], [key1, 2], [key1, 3], .....]
The above code is not tested, some minor changes may be required

creating nested dictionary to loop over my text files and folders to create multiple key dictionary

I have counts.txt files in 50 folders that each related to one sample. There are two columns in counts.txt: the first one is a string, and the other is a number. I try to make a nested dictionary through them. The goal is to use the first column of counts.txt and folders as a key of dictionary and the second column in counts.txt as a value. Unfortunately, the list of folders, that I want to make a loop over them to give me the proper shape is not working and face a problem!
import os
from natsort import natsorted
path1 = "/home/ali/Desktop/SAMPLES/"
data_ali={}
samples_name=natsorted(os.listdir(path1))
data_ali = {}
samples_name=natsorted(os.listdir(path1))
for i in samples_name:
with open(path1+i[0:]+"/counts.txt","rt") as fin:
for l in fin.readlines():
l=l.strip().split()
if l[0][:4]=='ENSG':
gene=l[0]
data_ali[gene]={}
reads=int(l[1])
data_ali[gene][samples_name]=reads
print(data_ali)
i expect the output like this:
'ENSG00000120659': {
'Sample_1-Leish_011_v2': 14,
'Sample_2-leish_011_v3': 7,
'Sample_3-leish_012_v2': 6,
'Sample_4-leish_012_v3': 1,
'Sample_5-leish_015_v2': 9,
'Sample_6-leish_015_v3': 3,
'Sample_7-leish_016_v2': 4,
'Sample_8-leish_016_v3': 8,
'Sample_9-leish_017_v2': 8,
'Sample_10-leish_017_v3': 2,
'Sample_11-leish_018_v2': 4,
'Sample_12-leish_018_v3': 4,
'Sample_13-leish_019_v2': 7,
'Sample_14-leish_019_v3': 4,
'Sample_15-leish_021_v2': 12,
'Sample_16-leish_021_v3': 5,
'Sample_17-leish_022_v2': 4,
'Sample_18-leish_022_v3': 2,
'Sample_19-leish_023_v2': 9,
'Sample_20-leish_023_v3': 6,
'Sample_21-leish_024_v2': 22,
'Sample_22-leish_024_v3': 10,
'Sample_23-leish026_v2': 9,
'Sample_24-leish026_v3': 5,
'Sample_25-leish027_v2': 4,
'Sample_26-leish027_v3': 1,
'Sample_27-leish028_v2': 7,
'Sample_28-leish028_v3': 5,
'Sample_29-leish032_v2': 8,
'Sample_30-leish032_v3': 2
}
Try this:
if l[0][:4] == 'ENSG':
gene = l[0]
reads = int(l[1])
data_ali.setdefault(gene, {})[i] = reads
Two important changes:
Your code data_ali[gene]={} always cleared what was previously there and made a new empty dictionary instead. setdefault only creates the dictionary if the key gene is not already present.
The second key should be i, not the list samples_name.
Full code cleanup:
import os
from natsort import natsorted
root = "/home/ali/Desktop/SAMPLES/"
data_ali = {}
for sample_name in natsorted(os.listdir(root)):
with open(os.path.join(root, sample_name, "counts.txt"), "r") as fin:
for line in fin.readlines():
gene, reads = line.split()
reads = int(reads)
if gene.startswith('ENSG'):
data_ali.setdefault(gene, {})[sample_name] = reads
print(data_ali)

Piping a pipe-delimited flat file into python for use in Pandas and Stats

I have searched a lot, but haven't found an answer to this.
I am trying to pipe in a flat file with data and put into something python read and that I can do analysis with (for instance, perform a t-test).
First, I created a simple pipe delimited flat file:
1|2
3|4
4|5
1|6
2|7
3|8
8|9
and saved it as "simpledata".
Then I created a bash script in nano as
#!/usr/bin/env python
import sys
from scipy import stats
A = sys.stdin.read()
print A
paired_sample = stats.ttest_rel(A[:,0],A[:,1])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample
Then I save the script as pairedttest.sh and run it as
cat simpledata | pairedttest.sh
The error I get is
TypeError: string indices must be integers, not tuple
Thanks for your help in advance
Are you trying to call this?:
paired_sample = stats.ttest_rel([1,3,4,1,2,3,8], [2,4,5,6,7,8,9])
If so, you can't do it the way you're trying. A is just a string when you read it from stdin, so you can't index it the way you're trying. You need to build the two lists from the string. The most obvious way is like this:
left = []
right = []
for line in A.splitlines():
l, r = line.split("|")
left.append(int(l))
right.append(int(r))
print left
print right
This will output:
[1, 3, 4, 1, 2, 3, 8]
[2, 4, 5, 6, 7, 8, 9]
So you can call stats.ttest_rel(left, right)
Or to be really clever and make a (nearly impossible to read) one-liner out of it:
z = zip(*[map(int, line.split("|")) for line in A.splitlines()])
This will output:
[(1, 3, 4, 1, 2, 3, 8), (2, 4, 5, 6, 7, 8, 9)]
So you can call stats.ttest_rel(*z)

Categories

Resources