What I have is a basic dataframe that I want to pull two values out of, based on index position. So for this:
first_column
second_column
1
1
2
2
3
3
4
4
5
5
I want to extract the values in row 1 and row 2 (1 2) out of first_column, then extract values in row 2 and row 3 (2 3) out of the first_column, so on and so forth until I've iterated over the entire column. I ran into an issue with the four loop and am stuck with getting the next index value.
I have code like below:
import pandas as pd
data = {'first_column': [1, 2, 3, 4, 5],
'second_column': [1, 2, 3, 4, 5],
}
df = pd.DataFrame(data)
for index, row in df.iterrows():
print(index, row['first_column']) # value1
print(index + 1, row['first_column'].values(index + 1)) # value2 <-- error in logic here
Ignoring the prints, which will eventually become variables that are returned, how can I improve this to return (1 2), (2 3), (3 4), (4 5), etc.?
Also, is this easier done with iteritems() method instead of iterrows?
Not sure if this is what you want to achieve:
(temp= df.assign(second_column = df.second_column.shift(-1))
.dropna()
.assign(second_column = lambda df: df.second_column.astype(int))
)
[*zip(temp.first_column.array, temp.second_column.array)]
[(1, 2), (2, 3), (3, 4), (4, 5)]
A simpler solution from #HenryEcker:
list(zip(df['first_column'], df['first_column'].iloc[1:]))
I don't know if this answers your question, but maybe you can try this :
for i, val in enumerate(df['first_column']):
if val+1>5:
break
else:
print(val,", ",val+1)
If you Want to take these items in the same fashion you should consider using iloc instead of using iterrow.
out = []
for i in range(len(df)-1):
print(i, df.iloc[i]["first_column"])
print(i+1, df.iloc[i+1]["first_column"])
out.append((df.iloc[i]["first_column"],
df.iloc[i+1]["first_column"]))
print(out)
[(1, 2), (2, 3), (3, 4), (4, 5)]
Related
I need to do something very similar to this question: Pandas convert dataframe to array of tuples
The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.
Supposing this is my data set:
t_id A B
----- ---- -----
0 AAAA 1 2.0
1 AAAA 3 4.0
2 AAAA 5 6.0
3 BBBB 7 8.0
4 BBBB 9 10.0
...
I want to produce as output:
[[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]
That is, one list for 'AAAA', another for 'BBBB' and so on.
I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):
result = []
for t in df['t_id'].unique():
tuple_list= []
for x in df[df['t_id' == t]].iterrows():
row = x[1][['A', 'B']]
tuple_list.append(tuple(x))
result.append(tuple_list)
Is there a faster way to do it?
You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:
[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
I think this should work too:
import pandas as pd
import itertools
df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})
tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)
Out:
[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]
I'm trying to generate a below sample data in excel. There are 3 columns and I want output similar present to IsShade column. I've tried =RANDARRAY(20,1,0,1,TRUE) but not working exactly.
I want to display random '1' value only upto value present in shading for NoOfcells value rows.
NoOfCells Shading IsShade(o/p)
5 2 0
5 2 0
5 2 1
5 2 0
5 2 1
--------------------
4 3 1
4 3 1
4 3 0
4 3 1
--------------------
4 1 0
4 1 0
4 1 0
4 1 1
Appreciate if anyone can help me out.Python code will also work since the excel I will read in csv and try to generate output IsShade column. Thank you!!
A small snippet of Python that writes your excel file. This code does not use Pandas or NumPy, only the standard library, to keep it simple if you want to use Python with Excel.
import random
import itertools
import csv
cols = ['NoOfCells', 'Shading', 'IsShade(o/p)']
data = [(5, 2), (4, 3), (4, 1)] # (c, s)
lst = []
for c, s in data: # c=5, s=2
l = [0]*(c-s) + [1]*s # 3x[0], 2x[1] -> [0, 0, 0, 1, 1]
random.shuffle(l) # shuffle -> [1, 0, 0, 0, 1]
lst.append(zip([c]*c, [s]*c, l))
# flat the list
lst = list(itertools.chain(*lst))
with open('shade.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(cols)
writer.writerows(lst)
>>> lst
[(5, 2, 1),
(5, 2, 0),
(5, 2, 0),
(5, 2, 0),
(5, 2, 1),
(4, 3, 1),
(4, 3, 0),
(4, 3, 1),
(4, 3, 1),
(4, 1, 0),
(4, 1, 0),
(4, 1, 1),
(4, 1, 0)]
$ cat shade.csv
NoOfCells,Shading,IsShade(o/p)
5,2,0
5,2,0
5,2,1
5,2,0
5,2,1
4,3,1
4,3,1
4,3,1
4,3,0
4,1,0
4,1,1
4,1,0
4,1,0
You can count the number or rows for RANDARRAY to return using COUNTA. Also. to exclude the dividing lines, test for ISNUMBER
=LET(Data,FILTER(B:B,(B:B<>"")*(ROW(B:B)>1)),IF(ISNUMBER(Data),RANDARRAY(COUNTA(Data),1,0,1,TRUE),""))
I have a csv file:
a,b,c
1,2,3
4,5,6
7
as you can see the last row doesn't match with count of headers. So I want to raise an {ValueError}Shape of passed values is (3, 1), indices imply (3, 3) error when I read the file with pd.read_csv() function
But if I do it in this way:
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7]]), columns=['a', 'b', 'c'])
I get this error.
Any thoughts?
Thanks in advance!
Using remove NaN from array you can try to create a new dataframe with NaNs filtered from values of dataframe from read_csv()
df = pd.read_csv(io.StringIO("""a,b,c
1,2,3
4,5,6
7"""))#.astype("Int64")
try:
pd.DataFrame(np.array([r[~np.isnan(r)] for r in df.values]), columns=df.columns)
except Exception as e:
print(str(e))
output
Shape of passed values is (3, 1), indices imply (3, 3)
I have the following DataFrame:
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7,8,9,10], 'X':[0,0,1,1,0,0,1,1,1,0,0]})
df.set_index('index', inplace = True)
X
index
0 0
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
9 0
10 0
What I need is to return a list of tuples showing the index value for the first and last instances of the 1s for each sequence of 1s (sorry if that's confusing). i.e.
Want:
[(2,3), (6,8)]
The first instance of the first 1 occurs at index point 2, then the last 1 in that sequence occurs at index point 3. The next 1 occurs at index point 6, and the last 1 in that sequence occurs at index point 8.
What I've tried:
I can grab the first one using numpy's argmax function. i.e.
x1 = np.argmax(df.values)
y1 = np.argmin(df.values[x1:])
(x1,2 + y1 - 1)
Which will give me the first tuple, but iterating through seems messy and I feel like there's a better way.
You need more_itertools.consecutive_groups
import more_itertools as mit
def find_ranges(iterable):
"""Yield range of consecutive numbers."""
for group in mit.consecutive_groups(iterable):
group = list(group)
if len(group) == 1:
yield group[0]
else:
yield group[0], group[-1]
list(find_ranges(df['X'][df['X']==1].index))
Output:
[(2, 3), (6, 8)]
You can use a third party library: more_itertools
loc with mit.consecutive_groups
[list(group) for group in mit.consecutive_groups(df.loc[df.ones == 1].index)]
# [[2, 3], [6, 7, 8]]
Simple list comprehension:
x = [(i[0], i[-1]) for i in x]
# [(2, 3), (6, 8)]
An approach using numpy, adapted from a great answer by #Warren Weckesser
def runs(a):
isone = np.concatenate(([0], np.equal(a, 1).view(np.int8), [0]))
absdiff = np.abs(np.diff(isone))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return [(i, j-1) for i, j in ranges]
runs(df.ones.values)
# [(2, 3), (6, 8)]
Here's a pure pandas solution:
df.groupby(df['X'].eq(0).cumsum().mask(df['X'].eq(0)))\
.apply(lambda x: (x.first_valid_index(),x.last_valid_index()))\
.tolist()
Output:
[(2, 3), (6, 8)]
I'm currently trying convert a pandas dataframe into a list of tuples. However I'm having difficulties getting the Index (which is the Date) for the values in the tuple as well. My first step was going here, but they do not add any index to the tuple.
Pandas convert dataframe to array of tuples
My only problem is accessing the index for each row in the numpy array. I have one solution shown below, but it uses an additional counter indexCounter and it looks sloppy. I feel like there should be a more elegant solution to retrieving an index from a particular numpy array.
def get_Quandl_daily_data(ticker, start, end):
prices = []
symbol = format_ticker(ticker)
try:
data = quandl.get("WIKI/" + symbol, start_date=start, end_date=end)
except Exception, e:
print "Could not download QUANDL data: %s" % e
subset = data[['Open','High','Low','Close','Adj. Close','Volume']]
indexCounter = 0
for row in subset.values:
dateIndex = subset.index.values[indexCounter]
tup = (dateIndex, "%.4f" % row[0], "%.4f" % row[1], "%.4f" % row[2], "%.4f" % row[3], "%.4f" % row[4],row[5])
prices.append(tup)
indexCounter += 1
Thanks in advance for any help!
You can iterate over the result of to_records(index=True).
Say you start with this:
In [6]: df = pd.DataFrame({'a': range(3, 7), 'b': range(1, 5), 'c': range(2, 6)}).set_index('a')
In [7]: df
Out[7]:
b c
a
3 1 2
4 2 3
5 3 4
6 4 5
then this works, except that it does not include the index (a):
In [8]: [tuple(x) for x in df.to_records(index=False)]
Out[8]: [(1, 2), (2, 3), (3, 4), (4, 5)]
However, if you pass index=True, then it does what you want:
In [9]: [tuple(x) for x in df.to_records(index=True)]
Out[9]: [(3, 1, 2), (4, 2, 3), (5, 3, 4), (6, 4, 5)]