Assume the following DataFrame:
id A
1 0
2 10
3 200
4 3000
I would like to make a calculation betweeen all rows to all other rows.
For example, if the calculation were lambda r1, r2: abs(r1-r2), then the output would be (in some order)
id col_name
1 10
2 200
3 3000
4 190
5 2990
6 2800
Questions:
How to get only the above output?
How to associate a result to its creators in the most "pandas like" way?
I would like to keep everything in a single table as much as possible, in a way that still supports reasonable lookup.
The size of my data is not large, and never will be.
EDIT1:
One way that would answer my question 2 would be
id col_name origin1 origin2
1 10 1 2
2 200 1 3
3 3000 1 4
4 190 2 3
5 2990 2 4
6 2800 3 4
And I would like to know if this is standard, and has a built in way of doing this, or if there is another/better way
IIUC itertools
import itertools
s=list(itertools.combinations(df.index, 2))
pd.Series([df.A.loc[x[1]]-df.A.loc[x[0]] for x in s ])
Out[495]:
0 10
1 200
2 3000
3 190
4 2990
5 2800
dtype: int64
Update
s=list(itertools.combinations(df.index, 2))
pd.DataFrame([x+(df.A.loc[x[1]]-df.A.loc[x[0]],) for x in s ])
Out[518]:
0 1 2
0 0 1 10
1 0 2 200
2 0 3 3000
3 1 2 190
4 1 3 2990
5 2 3 2800
Use broadcasted subtraction, then np.tril_indices to extract the lower diagonal (positive values).
# <= 0.23
# u = df['A'].values
# 0.24+
u = df['A'].to_numpy()
u2 = (u[:,None] - u)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
0 10
1 200
2 190
3 3000
4 2990
5 2800
dtype: int64
Or, use subtract.outer to avoid the conversion to array beforehand.
u2 = np.subtract.outer(*[df.A]*2)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
If you need the index as well, use
idx = np.tril_indices_from(u2, k=-1)
pd.DataFrame({
'val':u2[np.tril_indices_from(u2, k=-1)],
'row': idx[0],
'col': idx[1]
})
val row col
0 10 1 0
1 200 2 0
2 190 2 1
3 3000 3 0
4 2990 3 1
5 2800 3 2
Related
for the following dataframe
df = pd.DataFrame({'Rounds':[1000,1000,1000,1000,3000,3000,4000,5000,6000,6000]})
I would like to have a for loop that if the value already exists in previous rows, a fixed int, in this case 25, is added to the value and creates:
df = pd.DataFrame({'Rounds':[1000,1025,1050,1075,3000,3025,4000,5000,6000,6025]})
Initially I tried
for i in df.index:
if df.iat[i,1] == df.iloc[i-1,1]:
df.iat[i,1] = df.iat[i-1,1]+25
The problem is that it doesn't work for more than two similar values in a column and I would like to give column name "Rounds" instead of the index of column.
You need groupby.cumcount:
df['Rounds'] += df.groupby('Rounds').cumcount()*25
output:
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025
intermediate:
df.groupby('Rounds').cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 0
7 0
8 0
9 1
dtype: int64
Use groupby + cumcount:
df["Rounds"] += df.groupby(df["Rounds"]).cumcount() * 25
print(df)
Output
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025
Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!
If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0
I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': ['286a2', '17', '286a1', '373', '200b', '150'], 'B': range(6)})
A B
0 286a2 0
1 17 1
2 286a1 2
3 373 3
4 200b 4
5 150 5
which I want to sort according to A. When I do this using
df.sort_values(by='A')
I obtain
A B
5 150 5
1 17 1
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
which is almost correct: I would like to have 17 before 150 but don't know how to do this as those entries are not just values but actual strings consisting of numerical values and letters. Is there a way to do this?
EDIT
About the pattern of the entries:
It is always a numeric value first of arbitrary length, then it can be followed by characters, which can be followed by numerical values again.
You can use replace characters to . with cast to float with sort_index:
df.index = df['A'].str.replace('[a-zA-Z]+','.').astype(float)
df = df.sort_index().reset_index(drop=True)
print (df)
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
Another variant to jezrael's
In [1706]: df.assign(
A_=df.A.str.replace('[/\D]', '.').astype(float) # or '[a-zA-Z]+'
).sort_values(by='A_').drop('A_', 1)
Out[1706]:
A B
1 17 1
5 150 5
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
Or you can try , natsort
from natsort import natsorted, ns
df.set_index('A').reindex(natsorted(df.A, key=lambda y: y.lower())).reset_index()
Out[395]:
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
In python, given a list of ratings as:
import pandas as pd
path = 'ratings_ml100k.csv'
data = pd.read_csv(path,sep= ',')
print(data)
user_id item_id rating
28422 100 690 4
32020 441 751 4
15819 145 265 5
where the items are:
print(itemsTrain)
[ 690 751 265 ..., 1650 1447 1507]
For each item, I would like to compute the number of ratings. Is there anyway to do this without resorting to a Loop? All ideas are appreciated,
data is a pandas dataframe. The desire output should look like this:
pop =
item_id rating_count
690 120
751 10
265 159
... ...
Note that itemsTrain contain unique item_ids in the rating dataset data.
you can do it this way:
In [200]: df = pd.DataFrame(np.random.randint(0,8,(15,2)),columns=['id', 'rating'])
In [201]: df
Out[201]:
id rating
0 4 6
1 0 1
2 2 4
3 2 5
4 2 7
5 3 5
6 6 1
7 4 3
8 4 3
9 3 2
10 2 4
11 7 7
12 3 1
13 2 7
14 7 3
In [202]: df.groupby('id').rating.count()
Out[202]:
id
0 1
2 5
3 3
4 3
6 1
7 2
Name: rating, dtype: int64
if you want to have result as a DF (you can also name the count column as you wish):
In [206]: df.groupby('id').rating.count().to_frame('count').reset_index()
Out[206]:
id count
0 0 1
1 2 5
2 3 3
3 4 3
4 6 1
5 7 2
you can also count # of unique ratings:
In [203]: df.groupby('id').rating.nunique()
Out[203]:
id
0 1
2 3
3 3
4 2
6 1
7 2
Name: rating, dtype: int64
You can use the method df.groupby() to group items by item_id and then use the method count() to sum the ratings.
Do as follows :
# df is your dataframe
v # the method allows you to sum values of the previous feature
df.groupby('item_id').rating.count()
^ ^ # the feature you want to sum upon its values
^
# The method allows you to group the samples by the feature "item_id"
# which is supposed to be unique