Adding list elements to empty list - python

I'm trying to take the first element of a list created within a loop, and add it to a empty list.
Easy right? But my code is not working...
The empty list is index
The list I'm pulling the numbers from is data
import pandas as pd
import numpy as np
ff = open("/Users/me/Documents/ff_monthly.txt", "r")
data = list(ff)
f = []
index = []
for i in range (1, len(data)):
t = data[i].split()
index.append(int(t[0]))
for j in range(1, 5):
k = float(t[j])
f.append(k / 100)
n = len(f)
f1 = np.reshape(f,[n / 4, 4])
ff = pd.DataFrame(f1, index= index, columns = ['Mkt_Rf','SMB','HML','Rf'])
ff.to_pickle("ffMonthly.pickle")
print ff.head()
When I create the list t in the loop, I've checked to see if it is being created correctly.
len(t) = 5
print t[0] = 192607
print t = ['192607', '2.95', '-2.54', '-2.65', '0.22']
The code:
index.append(t[0])
It should add the 1st element of the list t to index ... correct?
However, I get this error:
IndexError: list index out of range
What am I missing here?
Edit:
Posted the entire code above.
Here are the first few rows of "ff_monthly.txt"
Mkt-RF SMB HML RF
192607 2.95 -2.54 -2.65 0.22
192608 2.64 -1.22 4.25 0.25
192609 0.37 -1.21 -0.39 0.23
192610 -3.24 -0.09 0.26 0.32
192611 2.55 -0.15 -0.54 0.31
192612 2.62 -0.13 -0.08 0.28

Check if there is an empty line in your txt file.
Because "".split() will return an empty list which will fail the t[0]

Quick and uggly:
for i in range (1, len(data)):
t = data[i].split()
if t:
index.append(int(t[0]))
for j in range(1, 5):
k = float(t[j])
f.append(k / 100)
Reason:
You have empty lines at the end of the file. Reproduced locally.

Related

Calculate running total based off original value in pandas

I wish to take an inital value of 1000 and multiply it by the first value in the 'Change' and then take that value and multiply it to the second value in the 'Change' column and so on.
I could do this by using a loop as follows
changes = [0.97,1.02,1.1,0.88,1.01 ]
df = pd.DataFrame()
df['Change'] = changes
df['Total'] = np.nan
df['Total'][0] = 1000*df['Change'][0]
for i in range(1,len(df)):
df['Total'][i] = df['Total'][i-1] * df['Change'][i]
Output:
Change Total
0 0.97 970.000000
1 1.02 989.400000
2 1.10 1088.340000
3 0.88 957.739200
4 1.01 967.316592
But this will be too slow for a large dataset. Is there any way to do this without loops?
Thanks

Using pandas.DataFrame.apply to look up and replace values with values from a different DataFrame

I have two pandas DataFrames with the same DateTime index.
The first one is J:
A B C
01/01/10 100 400 200
01/02/10 300 200 400
01/03/10 200 100 300
The second one is K:
100 200 300 400
01/01/10 0.05 -0.42 0.61 -0.12
01/02/10 -0.23 0.11 0.82 0.34
01/03/10 -0.55 0.24 -0.01 -0.73
I would like to use J to reference K and create a third DataFrame L that looks like:
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01
To do so, I need to take each value in J and look up the corresponding value in K where the column name is that value for the same date.
I tried to do:
L = J.apply( lambda x: K.loc[ x.index, x ], axis='index' )
but get:
ValueError: If using all scalar values, you must pass an index
I would ideally like to use this so that any NaN values contained in J will remain as is, and will not be looked up in K. I had unsuccessfully tried this:
L = J.apply( lambda x: np.nan if np.isnan( x.astype( float ) ) else K.loc[ x.index, x ] )
Use DataFrame.melt and DataFrame.stack to use DataFrame.join to map the new values, then We return the DataFrame to original shape with DataFrame.pivot:
#if neccesary
#K = K.rename(columns = int)
L = (J.reset_index()
.melt('index')
.join(K.stack().rename('new_values'),on = ['index','value'])
.pivot(index = 'index',
columns='variable',
values = 'new_values')
.rename_axis(columns = None,index = None)
)
print(L)
Or with DataFrame.lookup
L = J.reset_index().melt('index')
L['value'] = K.lookup(L['index'],L['value'])
L = L.pivot(*L).rename_axis(columns = None,index = None)
print(L)
Output
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01
I think that apply could be a good option but I'm not sure,
I recommend you see When should I want use apply in my code
Use DataFrame.apply with DataFrame.lookup for label based indexing.
# if needed, convert columns of df2 to integers
# K.columns = K.columns.astype(int)
L = J.apply(lambda x: K.lookup(x.index, x))
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01

Pandas dataframe to nested counter dictionary

I've seen a lot of questions on how to convert pandas dataframes to nested dictionaries, but none of them deal with aggregating the information. I may even be able to do what I need within pandas, but I'm stuck.
Input
I have a dataframe that looks like this:
FeatureID gene Target pos bc_count
0 1_1_1 NRAS_3 TAGCAC 0 0.42
1 1_1_1 NRAS_3 TGCACA 1 1.00
2 1_1_1 NRAS_3 GCACAA 2 0.50
3 1_1_1 NRAS_3 CACAAA 3 2.00
4 1_1_1 NRAS_3 CAGAAA 3 0.42
# create df as below
import pandas as pd
df = pd.DataFrame([{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TAGCAC",
"pos":0, "bc_count":.42},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TGCACA", "pos":1,
"bc_count":1.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"GCACAA", "pos":2,
"bc_count":0.50},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CACAAA", "pos":3,
"bc_count":2.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CAGAAA", "pos":4,
"bc_count":0.42}])
The problem
I need to break apart the Target column for each row to return a tuple of (position, letter, count), where the starting position is given in the "pos" column, and then enumerating the string for each position following, and the count is the value found for that row in the "bc_count" column.
For example, in the first row, the desired list of tuples would be:
[(0, "T", 0.42), (1,"A", 0.42), (2,"G", 0.42), (3,"C", 0.42), (4,"A", 0.42), (5,"C", 0.42)]
What I've tried
I've created code that breaks up the target column into the position found, returning a tuple of position, nucleotide (letter) and count for that letter, and adds them as a column to the dataframe:
def index_target(row):
count_list = [((row.pos + x),y,
row.bc_count) for x,y in
enumerate(row.Target)]
df['pos_count'] = df.apply(self.index_target, axis=1)
Which returns a list of tuples for each row based on that row's target column.
I need to take every row in df, for each target, and sum the counts. Which is why I thought of using a dictionary as a counter:
position[letter] += bc_count
I've tried creating a defaultdict, but it is appending each list of tuples separately instead of summing the counts for each position:
from collections import defaultdict
d = defaultdict(dict) # also tried defaultdict(list) here
for x,y,z in row.pos_count:
d[x][y] += z
Desired Output
For each feature in the dataframe, where the numbers below represent a sum of the individual counts found in the bc_count column for each position and x representing positions where ties were found and no one letter can be returned as the max:
pos A T G C
0 25 80 25 57
1 32 19 100 32
2 27 18 16 27
3 90 90 90 90
4 10 42 37 18
consensus= TGXXT
This may not be the most elegant solution, but I think it might accomplish what you need:
new_df = pd.DataFrame(
df.apply(
# this lambda is basically the same thing you're doing,
# but we create a pd.Series with it
lambda row: pd.Series(
[(row.pos + i, c, row.bc_count) for i, c in enumerate(row.Target)]
),
axis=1)
.stack().tolist(),
columns=["pos", "nucl", "count"]
)
Where new_df looks like this:
pos nucl count
0 0 T 0.42
1 1 A 0.42
2 2 G 0.42
3 3 C 0.42
4 4 A 0.42
5 5 C 0.42
6 1 T 1.00
7 2 G 1.00
8 3 C 1.00
9 4 A 1.00
Then I would pivot this to get the aggregated counts:
nucleotide_count_by_pos = new_df.pivot_table(
index="pos",
columns="nucl",
values="count",
aggfunc="sum",
fill_value=0
)
Where nucleotide_count_by_pos looks like:
nucl A C G T
pos
0 0.00 0.00 0.00 0.42
1 0.42 0.00 0.00 1.00
2 0.00 0.00 1.92 0.00
3 0.00 4.34 0.00 0.00
4 4.34 0.00 0.00 0.00
And then to get the consensus:
def get_consensus(row):
max_value = row.max()
nuc = row.idxmax()
if (row == max_value).sum() == 1:
return nuc
else:
return "X"
consensus = ''.join(nucleotide_count_by_pos.apply(get_consensus, axis=1).tolist())
Which in the case of your example data would be:
'TTGCACAAA'
Unsure how to get your desired output, but I created the list d which contains the tuples you desired for a dataframe. Hopefully, it provides some direction in what you want to create:
d = []
for t,c,p in zip(df.Target,df.bc_count,df.pos):
d.extend([(p,c,i) for i in list(t)])
df_new = pd.DataFrame(d, columns = ['pos','count','val'])
df_new = df_new.groupby(['pos','val']).agg({'count':'sum'}).reset_index()
df_new.pivot(index = 'pos', columns = 'val', values = 'count')

nested for loop seem to be overlapping

I'm writing a program that have two nested for loops and and I wrote down my idea on a paper and drew a diagram for it and thought my logic was fine. But when I executed that I got something totally different that what I wanted.
Then I created a small version of my program with only those two loops but the problem remains. so here is the smaller version I was testing
lopt = 0
dwl = 0.04
Dt = 0.01
for i in numpy.arange(0, 5):
for t in numpy.arange(lopt, dwl, Dt): #Inner loop (time)
print 't = ', t, 'row = ', i
lopt = t + 0.001
the output I got was
t = 0.0 row = 0
t = 0.01 row = 0
t = 0.02 row = 0
t = 0.03 row = 0
t = 0.031 row = 1
t = 0.032 row = 2
t = 0.033 row = 3
t = 0.034 row = 4
but what I want it to be is
t = 0.0 row = 0
t = 0.01 row = 0
t = 0.02 row = 0
t = 0.03 row = 0
t = 0.04 row = 0
t = 0.041 row = 1
t = 0.051 row = 1
t = 0.061 row = 1
t = 0.071 row = 1
t = 0.081 row = 1
t = 0.082 row = 2
t = 0.092 row = 2
t = 0.102 row = 2
t = 0.112 row = 2
t = 0.122 row = 2
t = 0.123 row = 3
t = 0.133 row = 3
t = 0.143 row = 3
t = 0.153 row = 3
t = 0.163 row = 3
t = 0.164 row = 4
t = 0.174 row = 4
t = 0.184 row = 4
t = 0.194 row = 4
t = 0.204 row = 4
t = 0.205 row = 5
t = 0.215 row = 5
t = 0.225 row = 5
t = 0.265 row = 5
t = 0.275 row = 5
my logic to it is:
the outer loop starts with 0 then the inner loop goes from 0-0.04
then back to the final step of the outer loop by incrementing the variable lopt by 0.001. then back and execute one step of the outer loop again. now we get the inner loop one more time and start with time 0.041 to 0.081. this keep repeating until the outer loop is executed 6 times.
Another question I have is, at least in my current output the inner loop seems to be executed successfully in the first step, but it only go to 0.03. Since my range goes from 0 to 0.04 with increment of 0.01, shouldn't the loop runs for 0-0.01-0.02-0.03-0.04? also for the outer loop, shouldn't it run 0-1-2-3-4-5?
I'm sorry if i'm bugging you with this question but I honestly drew like a 8 diagrams for this problem and also did it with my hand and think it should run as intended. I feel really dumb because when I did by hand I just couldn't catch any flow in the logic
Firstly, the range returned by numpy.arange is not inclusive, so
for i in numpy.arange(0, 5):
print i
will print 0 1 2 3 4 but not 5. In order to get it to print the 5, you want to increment it by the step, which is 1 in the case of the outer loop (by default) and 0.01 in the case of the inner one.
After the first execution of the inner loop, lopt is 0.03, which means that the arguments to the inner numpy.arange are (0.03, 0.04, 0.01) which results in only one loop. What you really want is
lopt = 0
dwl = 0.04
Dt = 0.01
for i in numpy.arange(0, 5):
for t in numpy.arange(lopt, dwl + lopt + Dt, Dt):
print 't = ', t, 'row = ', i
lopt = t + 0.001
because that way the second evaluation of that inner numpy.arange will have arguments of (0.041, 0.09, 0.01), and the third will be (0.082, 0.131, 0.01) and so on.

Efficient Pandas Dataframe insert

I'm trying to add float values like [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)],[...]]
to a Pandas dataframe which looks like a matrix build from the first value of the tuples
df = 1 2 3
1 0.44 0.5 0.1
2 0.85 0.63 0.11
3 ... ... ...
I tried this:
for key, value in enumerate(outer_list):
for tuplevalue in value:
df.ix[key][tuplevalue[0]] = tuplevalue[1]
The Problem is that my NxN-Matrix contains about 10000x10000 elements and hence it takes really long with my approach. Is there another possibility to speed this up?
(Unfortunately the values in the list are not ordered by the first tuple element)
Use list comprehensions to first sort and extract your data. Then create your dataframe from the sorted and cleaned data.
data = [[(1, 0.44), (2, 0.50), (3, 0.10)],
[(2, 0.63), (1, 0.85), (3, 0.11)]]
# First, sort each row.
_ = [row.sort() for row in data]
# Then extract the second element of each tuple.
new_data = [[t[1] for t in row] for row in data]
# Now create a dataframe from your data.
>>> pd.DataFrame(new_data)
0 1 2
0 0.44 0.50 0.10
1 0.85 0.63 0.11
This works using a dictionary (if you need to preserve your column order, or if the column names were a string). Maybe Alexander will update his answer to account for that, I'm nearly certain he'll have a better solution than my proposed one :)
Here's an example:
from collections import defaultdict
a = [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)]]
b = [[('A',0.44),('B',0.5),('C',0.1)],[('B',0.63),('A',0.85),('C',0.11)]]
First on a:
row_to_dic = [{str(y[0]): y[1] for y in x} for x in a]
dd = defaultdict(list)
for d in (row_to_dic):
for key, value in d.iteritems():
dd[key].append(value)
pd.DataFrame.from_dict(dd)
1 2 3
0 0.44 0.50 0.10
1 0.85 0.63 0.11
and b:
row_to_dic = [{str(y[0]): y[1] for y in x} for x in b]
dd = defaultdict(list)
for d in (row_to_dic):
for key, value in d.iteritems():
dd[key].append(value)
pd.DataFrame.from_dict(dd)
A B C
0 0.44 0.50 0.10
1 0.85 0.63 0.11

Categories

Resources