I have following list in python
movie_list = [11, 21, 31, 41, 51, 62, 55]
and following movie dataframe
userId movieId
1 11
1 21
1 31
2 62
2 55
Now what I want to do is generate similar dataframe, where movieId is not in dataframe, but there in movie_list
My desired dataframe would be
userId movieId
1 41
1 51
1 62
1 55
2 11
2 21
2 31
2 41
2 51
How can I do it in pandas?
IIUC, we can do the agg with list , then find the different between the original value in df with the movie_list
s=df.groupby('userId').movieId.agg(list).\
map(lambda x : list(set(movie_list)-set(x))).explode().reset_index()
userId movieId
0 1 41
1 1 51
2 1 62
3 1 55
4 2 41
5 2 11
6 2 51
7 2 21
8 2 31
One approach would be to use itertools.product to create all combinations of userId & movieId, then concat and drop_duplicates:
from itertools import product
movie_list = [11, 21, 31, 41, 51, 62, 55]
df_all = pd.DataFrame(product(df['userId'].unique(), movie_list), columns=df.columns)
df2 = pd.concat([df, df_all]).drop_duplicates(keep=False)
print(df2)
[out]
userId movieId
3 1 41
4 1 51
5 1 62
6 1 55
7 2 11
8 2 21
9 2 31
10 2 41
11 2 51
prod = pd.MultiIndex.from_product([df.userId.unique().tolist(), movie_list]).tolist()
(
pd.DataFrame(set(prod).difference([tuple(e) for e in df.values]),
columns=['userId', 'movieId'])
.sort_values(by=['userId', 'movieId'])
)
userId movieId
7 1 41
6 1 51
2 1 55
8 1 62
5 2 11
4 2 21
3 2 31
1 2 41
0 2 51
I think you need:
df = df.groupby("userId")["movieId"].apply(list).reset_index()
df["movieId"] = df["movieId"].apply(lambda x: list(set(movie_list)-set(x)))
df = df.explode("movieId")
print(df)
Output:
userId movieId
0 1 41
0 1 51
0 1 62
0 1 55
1 2 41
1 2 11
1 2 51
1 2 21
1 2 31
Related
I have this example df
col1 = [1,1,1,2,2,1,1,1,2,2,2]
col2 = [20, 23, 12, 44, 14, 42, 44, 1, 42, 62, 11]
data = {"col1": col1, "col2": col2}
df = pd.DataFrame(data)
I need to add a column that adds up the col2 every time the col1 is 1 and then the same for when it is 2. I have tried grouping by col1 but this skips every time there is a 2 in between
The expected output would be this.
col1 col2 col3
1 20 55
1 23 55
1 12 55
2 44 58
2 14 58
1 42 87
1 44 87
1 1 87
2 42 115
2 62 115
2 11 115
Please let me know how to approach this
Use GroupBy.transform with helper Series for consecutive values generated by comapre shifted values for not equal and cumulative sum:
df['col3'] = df.groupby(df['col1'].ne(df['col1'].shift()).cumsum())['col2'].transform('sum')
print (df)
col1 col2 col3
0 1 20 55
1 1 23 55
2 1 12 55
3 2 44 58
4 2 14 58
5 1 42 87
6 1 44 87
7 1 1 87
8 2 42 115
9 2 62 115
10 2 11 115
You can do this by creating a column that will mark every time there is a change in col1 and then sum by groupby:
i = df.col1
df['Var3'] = i.ne(i.shift()).cumsum()
df['sums'] = df.groupby(['Var3'])['col2'].transform('sum')
which gives
col1 col2 Var3 sums
0 1 20 1 55
1 1 23 1 55
2 1 12 1 55
3 2 44 2 58
4 2 14 2 58
5 1 42 3 87
6 1 44 3 87
7 1 1 3 87
8 2 42 4 115
9 2 62 4 115
10 2 11 4 115
Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you
Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
My Data frame looks like -
id marital_status age city1 city2
1 Married 32 7 64
2 Married 34 8 39
3 Single 53 0 72
4 Divorce 37 2 83
5 Divorce 42 10 52
6 Single 29 3 82
7 Married 37 8 64
Size of the data frame is 22.4 Million records.
My objective is based on the conditional statement my final data frame looks like -
id marital_status age city1 city2 present
1 Married 32 12 64 1
2 Married 34 8 39 0
3 Single 53 0 72 0
4 Divorce 37 2 83 0
5 Divorce 42 10 52 0
6 Single 29 3 82 0
7 Married 37 8 64 1
What I have done so far -
test_df = pd.read_csv('city.csv')
condition = ((test_df['city1'] >= 5) &\
(test_df['marital_status'] == 'Married') &\
(test_df['age'] >= 32))
test_df.loc[:, 'present'] = test_df.where(condition, 1)
But got NA values in 'present' columns
Can anybody help me?
It is not np.where function, but DataFrame.where in your solution.
I think you need set values by condition:
test_df['present'] = np.where(condition, 1, 0)
Or cast True/False to 1/0 by Series.astype:
test_df['present'] = condition.astype(int)
print (test_df)
id marital_status age city1 city2 present
0 1 Married 32 12 64 1
1 2 Married 34 8 39 1
2 3 Single 53 0 72 0
3 4 Divorce 37 2 83 0
4 5 Divorce 42 10 52 0
5 6 Single 29 3 82 0
6 7 Married 37 8 64 1
I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22
You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42
We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22