Converting pandas dataframe into list of tuples with index - python

I'm currently trying convert a pandas dataframe into a list of tuples. However I'm having difficulties getting the Index (which is the Date) for the values in the tuple as well. My first step was going here, but they do not add any index to the tuple.
Pandas convert dataframe to array of tuples
My only problem is accessing the index for each row in the numpy array. I have one solution shown below, but it uses an additional counter indexCounter and it looks sloppy. I feel like there should be a more elegant solution to retrieving an index from a particular numpy array.
def get_Quandl_daily_data(ticker, start, end):
prices = []
symbol = format_ticker(ticker)
try:
data = quandl.get("WIKI/" + symbol, start_date=start, end_date=end)
except Exception, e:
print "Could not download QUANDL data: %s" % e
subset = data[['Open','High','Low','Close','Adj. Close','Volume']]
indexCounter = 0
for row in subset.values:
dateIndex = subset.index.values[indexCounter]
tup = (dateIndex, "%.4f" % row[0], "%.4f" % row[1], "%.4f" % row[2], "%.4f" % row[3], "%.4f" % row[4],row[5])
prices.append(tup)
indexCounter += 1
Thanks in advance for any help!

You can iterate over the result of to_records(index=True).
Say you start with this:
In [6]: df = pd.DataFrame({'a': range(3, 7), 'b': range(1, 5), 'c': range(2, 6)}).set_index('a')
In [7]: df
Out[7]:
b c
a
3 1 2
4 2 3
5 3 4
6 4 5
then this works, except that it does not include the index (a):
In [8]: [tuple(x) for x in df.to_records(index=False)]
Out[8]: [(1, 2), (2, 3), (3, 4), (4, 5)]
However, if you pass index=True, then it does what you want:
In [9]: [tuple(x) for x in df.to_records(index=True)]
Out[9]: [(3, 1, 2), (4, 2, 3), (5, 3, 4), (6, 4, 5)]

Related

python pandas pulling two values out of the same column

What I have is a basic dataframe that I want to pull two values out of, based on index position. So for this:
first_column
second_column
1
1
2
2
3
3
4
4
5
5
I want to extract the values in row 1 and row 2 (1 2) out of first_column, then extract values in row 2 and row 3 (2 3) out of the first_column, so on and so forth until I've iterated over the entire column. I ran into an issue with the four loop and am stuck with getting the next index value.
I have code like below:
import pandas as pd
data = {'first_column': [1, 2, 3, 4, 5],
'second_column': [1, 2, 3, 4, 5],
}
df = pd.DataFrame(data)
for index, row in df.iterrows():
print(index, row['first_column']) # value1
print(index + 1, row['first_column'].values(index + 1)) # value2 <-- error in logic here
Ignoring the prints, which will eventually become variables that are returned, how can I improve this to return (1 2), (2 3), (3 4), (4 5), etc.?
Also, is this easier done with iteritems() method instead of iterrows?
Not sure if this is what you want to achieve:
(temp= df.assign(second_column = df.second_column.shift(-1))
.dropna()
.assign(second_column = lambda df: df.second_column.astype(int))
)
[*zip(temp.first_column.array, temp.second_column.array)]
[(1, 2), (2, 3), (3, 4), (4, 5)]
A simpler solution from #HenryEcker:
list(zip(df['first_column'], df['first_column'].iloc[1:]))
I don't know if this answers your question, but maybe you can try this :
for i, val in enumerate(df['first_column']):
if val+1>5:
break
else:
print(val,", ",val+1)
If you Want to take these items in the same fashion you should consider using iloc instead of using iterrow.
out = []
for i in range(len(df)-1):
print(i, df.iloc[i]["first_column"])
print(i+1, df.iloc[i+1]["first_column"])
out.append((df.iloc[i]["first_column"],
df.iloc[i+1]["first_column"]))
print(out)
[(1, 2), (2, 3), (3, 4), (4, 5)]

How to random assign 0 or 1 for x rows depend upon y column value in excel

I'm trying to generate a below sample data in excel. There are 3 columns and I want output similar present to IsShade column. I've tried =RANDARRAY(20,1,0,1,TRUE) but not working exactly.
I want to display random '1' value only upto value present in shading for NoOfcells value rows.
NoOfCells Shading IsShade(o/p)
5 2 0
5 2 0
5 2 1
5 2 0
5 2 1
--------------------
4 3 1
4 3 1
4 3 0
4 3 1
--------------------
4 1 0
4 1 0
4 1 0
4 1 1
Appreciate if anyone can help me out.Python code will also work since the excel I will read in csv and try to generate output IsShade column. Thank you!!
A small snippet of Python that writes your excel file. This code does not use Pandas or NumPy, only the standard library, to keep it simple if you want to use Python with Excel.
import random
import itertools
import csv
cols = ['NoOfCells', 'Shading', 'IsShade(o/p)']
data = [(5, 2), (4, 3), (4, 1)] # (c, s)
lst = []
for c, s in data: # c=5, s=2
l = [0]*(c-s) + [1]*s # 3x[0], 2x[1] -> [0, 0, 0, 1, 1]
random.shuffle(l) # shuffle -> [1, 0, 0, 0, 1]
lst.append(zip([c]*c, [s]*c, l))
# flat the list
lst = list(itertools.chain(*lst))
with open('shade.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(cols)
writer.writerows(lst)
>>> lst
[(5, 2, 1),
(5, 2, 0),
(5, 2, 0),
(5, 2, 0),
(5, 2, 1),
(4, 3, 1),
(4, 3, 0),
(4, 3, 1),
(4, 3, 1),
(4, 1, 0),
(4, 1, 0),
(4, 1, 1),
(4, 1, 0)]
$ cat shade.csv
NoOfCells,Shading,IsShade(o/p)
5,2,0
5,2,0
5,2,1
5,2,0
5,2,1
4,3,1
4,3,1
4,3,1
4,3,0
4,1,0
4,1,1
4,1,0
4,1,0
You can count the number or rows for RANDARRAY to return using COUNTA. Also. to exclude the dividing lines, test for ISNUMBER
=LET(Data,FILTER(B:B,(B:B<>"")*(ROW(B:B)>1)),IF(ISNUMBER(Data),RANDARRAY(COUNTA(Data),1,0,1,TRUE),""))

Pandas: return index values for first instance and last instance of value

I have the following DataFrame:
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7,8,9,10], 'X':[0,0,1,1,0,0,1,1,1,0,0]})
df.set_index('index', inplace = True)
X
index
0 0
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
9 0
10 0
What I need is to return a list of tuples showing the index value for the first and last instances of the 1s for each sequence of 1s (sorry if that's confusing). i.e.
Want:
[(2,3), (6,8)]
The first instance of the first 1 occurs at index point 2, then the last 1 in that sequence occurs at index point 3. The next 1 occurs at index point 6, and the last 1 in that sequence occurs at index point 8.
What I've tried:
I can grab the first one using numpy's argmax function. i.e.
x1 = np.argmax(df.values)
y1 = np.argmin(df.values[x1:])
(x1,2 + y1 - 1)
Which will give me the first tuple, but iterating through seems messy and I feel like there's a better way.
You need more_itertools.consecutive_groups
import more_itertools as mit
def find_ranges(iterable):
"""Yield range of consecutive numbers."""
for group in mit.consecutive_groups(iterable):
group = list(group)
if len(group) == 1:
yield group[0]
else:
yield group[0], group[-1]
list(find_ranges(df['X'][df['X']==1].index))
Output:
[(2, 3), (6, 8)]
You can use a third party library: more_itertools
loc with mit.consecutive_groups
[list(group) for group in mit.consecutive_groups(df.loc[df.ones == 1].index)]
# [[2, 3], [6, 7, 8]]
Simple list comprehension:
x = [(i[0], i[-1]) for i in x]
# [(2, 3), (6, 8)]
An approach using numpy, adapted from a great answer by #Warren Weckesser
def runs(a):
isone = np.concatenate(([0], np.equal(a, 1).view(np.int8), [0]))
absdiff = np.abs(np.diff(isone))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return [(i, j-1) for i, j in ranges]
runs(df.ones.values)
# [(2, 3), (6, 8)]
Here's a pure pandas solution:
df.groupby(df['X'].eq(0).cumsum().mask(df['X'].eq(0)))\
.apply(lambda x: (x.first_valid_index(),x.last_valid_index()))\
.tolist()
Output:
[(2, 3), (6, 8)]

Finding the min or max sum of a row in an array

How can I quickly find the min or max sum of the elements of a row in an array?
For example:
1, 2
3, 4
5, 6
7, 8
The minimum sum would be row 0 (1 + 2), and the maximum sum would be row 3 (7 + 8)
print mat.shape
(8, 1, 2)
print mat
[[[-995.40045 -409.15112]]
[[-989.1511 3365.3267 ]]
[[-989.1511 3365.3267 ]]
[[1674.5447 3035.3523 ]]
[[ 0. 0. ]]
[[ 0. 3199. ]]
[[ 0. 3199. ]]
[[2367. 3199. ]]]
In native Python, min and max have key functions:
>>> LoT=[(1, 2), (3, 4), (5, 6), (7, 8)]
>>> min(LoT, key=sum)
(1, 2)
>>> max(LoT, key=sum)
(7, 8)
If you want the index of the first min or max in Python, you would do something like:
>>> min(((i, t) for i, t in enumerate(LoT)), key=lambda (i,x): sum(x))
(0, (1, 2))
And then peel that tuple apart to get what you want. You also could use that in numpy, but at unknown (to me) performance cost.
In numpy, you can do:
>>> a=np.array(LoT)
>>> a[a.sum(axis=1).argmin()]
array([1, 2])
>>> a[a.sum(axis=1).argmax()]
array([7, 8])
To get the index only:
>>> a.sum(axis=1).argmax()
3
x = np.sum(x,axis=1)
min_x = x.min()
max_x = x.max()
presuming x is 4,2 array use np.sum to sum across the rows, then .min() returns the min value of your array and .max() returns the max value
You can do this using np.argmin and np.sum:
array_minimum_index = np.argmin([np.sum(x, axis=1) for x in mat])
array_maximum_index = np.argmax([np.sum(x, axis=1) for x in mat])
For your array, this results in array_minimum_index = 0 and array_maximum_index = 7, as your sums at those indices are -1404.55157 and 5566.0
To simply print out the values of the min and max sum, you can do this:
array_sum_min = min([np.sum(x,axis=1) for x in mat])
array_sum_max = max([np.sum(x,axis=1) for x in mat])
You can use min and max and use sum as their key.
lst = [(1, 2), (3, 4), (5, 6), (7, 8)]
min(lst, key=sum) # (1, 2)
max(lst, key=sum) # (7, 8)
If you want the sum directly and you do not care about the tuple itself, then map can be of help.
min(map(sum, lst)) # 3
max(map(sum, lst)) # 15

Select specific index, column pairs from pandas dataframe

I have a dataframe x:
x = pd.DataFrame(np.random.randn(3,3), index=[1,2,3], columns=['A', 'B', 'C'])
x
A B C
1 0.256668 -0.338741 0.733561
2 0.200978 0.145738 -0.409657
3 -0.891879 0.039337 0.400449
and I would like to select a bunch of index column pairs to populate a new Series. For example, I could select [(1, 'A'), (1, 'B'), (1, 'A'), (3, 'C')] which would generate a list or array or series with 4 elements:
[0.256668, -0.338741, 0.256668, 0.400449]
Any idea of how I should do that?
I think get_value() and lookup() is faster:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.random.randn(3,3), index=[1,2,3], columns=['A', 'B', 'C'])
locations = [(1, "A"), (1, "B"), (1, "A"), (3, "C")]
print x.get_value(1, "A")
row_labels, col_labels = zip(*locations)
print x.lookup(row_labels, col_labels)
If your pairs are positions instead of index/column names,
row_position = [0,0,0,2]
col_position = [0,1,0,2]
x.values[row_position, col_position]
Or get the position from np.searchsorted
row_position = np.searchsorted(x.index,row_labels,sorter = np.argsort(x.index))
Use ix should be able to locate the elements in the data frame, like this:
import pandas as pd
# using your data sample
df = pd.read_clipboard()
df
Out[170]:
A B C
1 0.256668 -0.338741 0.733561
2 0.200978 0.145738 -0.409657
3 -0.891879 0.039337 0.400449
# however you cannot store A, B, C... as they are undefined names
l = [(1, 'A'), (1, 'B'), (1, 'A'), (3, 'C')]
# you can also use a for/loop, simply iterate the list and LOCATE the element
map(lambda x: df.ix[x[0], x[1]], l)
Out[172]: [0.25666800000000001, -0.33874099999999996, 0.25666800000000001, 0.400449]

Categories

Resources