I have a dataframe df, which has the column months_to_maturity and has multiple rows associated with a months_to_maturity of 1,2, etc. each. I am trying to keep only the first 3 rows associated with a particular months_to_maturity value. For example, for months_to_maturity = 1 I would like to have only 3 associated rows and for months_to_maturity = 2, another 3 rows and so on. I try to do this using the code below, but get the error IndexError: index 21836 is out of bounds for axis 0 with size 4412 and hence am wondering if there is a better way to do this. pairwise gives the current and next row of the dataframe. The values of months_to_maturity are sorted.
count = 0
for (i1, row1), (i2,row2) in pairwise(df.iterrows()):
if row1.months_to_maturity == row2.months_to_maturity:
count = count + 1
if count == 3:
df.drop(df.index[i1])
df = df.reset_index()
elif row1.months_to_maturity != row2.months_to_maturity:
count = 0
Thank You
You can do:
df.groupby('months_to_maturity').head(3)
Related
I have data frame. Under the same index, I have "early_date" & "latest_date", which are in "int" dtype. I want to create additional values in between the "early_date" & "latest_date" row-values. Incidentally, I want to stack the generated values into new rows between them.
Here is how I did it,
df = pd.DataFrame({'index': [1,1,2,2,3,3],
'variable': ['early_date', 'late_date']*3,
'value': [201952,202001,202002,202004,202006,202012]})
# This is what your data looks like unmelted
df_p = df.pivot('index', 'variable', 'value').reset_index()
df_p.columns.name = ''
df_p['new'] = [list(range(x,y+1)) for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
This is the result
In the column "new", the filling between "201952" & "202001" in index 1 has became 201952, 201953, 201954...201999, 202001.
However, since the "new" column is actually representing the year and weeks. In index 1 case,
It shall not be filling anything between 201952 & 202001, and the result should be [201952, 202001]. Since week 52 is the end of the year.
What can I do to handling these cases?
IIUC, you can add a condition in your for loop:
df_p['new'] = [list(range(x,y+1)) if str(x)[-2:]!='52' else [x,y]
for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
print(df_p)
index new
0 1 [201952, 202001]
1 2 [202002, 202003, 202004]
2 3 [202006, 202007, 202008, 202009, 202010, 20201...
I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.
I'm trying to code following logic in pandas, for first three rows of every group i want to create a variable which should have value 1(1st row), 2 (2nd row), 3(3rd row). I'm doing it like below, In the below code I'm not creating a new variable because i don't know how to do that, so I'm replacing the variable that's already present in the data set. Though my code doesn't throw error, it's giving me very strange results.
def func (i):
data.loc[data.groupby('ID').nth(i).index,'date'] = i
func(1)
Any suggestions?
Thanks in Advance.
If you don't have duplicated index, you can create a row id for each group, filter out id which is larger than 3 and then assign it back to the data frame:
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
This gives the first three rows for each ID 1,2,3, rows beyond 3 will have NaN values.
data = pd.DataFrame({"ID":[1,1,1,1,2,2,3,3,3]})
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
data
I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.
I tried:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df['Num Hits'] = val
Which returns an error:
-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.
You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:
In [89]:
print df
0 1 2 3
0 0.835396 0.330275 0.786579 0.493567
1 0.751678 0.299354 0.050638 0.483490
2 0.559348 0.106477 0.807911 0.883195
3 0.250296 0.281871 0.439523 0.117846
4 0.480055 0.269579 0.282295 0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0 3
1 3
2 3
3 3
4 3
dtype: int64
#DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.
You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df.loc[index, 'Num Hits'] = val
As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values