let's say I have a dataframe like this
name time
a 10
b 30
c 11
d 13
now I want a new dataframe like this
name1 name2 time_diff
a a 0
a b -20
a c -1
a d -3
b a 20
b b 0
b c 19
b d 17
.....
.....
d d 0
nested for loops, lambda function can be used but as the number of elements go above 200, for loops just take too much time to finish or should I say, I always have to interrupt the process. Does someone know a panda query way or something quicker & easier. shape of my dataframe is 1600x2
Solution with itertools:
import itertools
d=pd.DataFrame(list(itertools.product(df.name,df.name)),columns=['name1','name2'])
dic = dict(zip(df.name,df.time))
d['time_diff']=d.name1.map(dic)-d.name2.map(dic)
print(d)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Use cross join first by merge with helper column, get difference and select only necessary columns:
df = df.assign(A=1)
df = pd.merge(df, df, on='A', suffixes=('1','2'))
df['time_diff'] = df['time1'] - df['time2']
df = df[['name1','name2','time_diff']]
print (df)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Another solution with MultiIndex.from_product and reindex by first and second level:
df = df.set_index('name')
mux = pd.MultiIndex.from_product([df.index, df.index], names=['name1','name2'])
df = (df['time'].reindex(mux, level=0)
.sub(df.reindex(mux, level=1)['time'])
.rename('time_diff')
.reset_index())
another way would be, df.apply
df=pd.DataFrame({'col':['a','b','c','d'],'col1':[10,30,11,13]})
index = pd.MultiIndex.from_product([df['col'], df['col']], names = ["name1", "name2"])
res=pd.DataFrame(index = index).reset_index()
res['time_diff']=df.apply(lambda x: x['col1']-df['col1'],axis=1).values.flatten()
O/P:
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Related
I have a DataFrame which contains more than 2000 rows.
Here is a part of my DataFrame:
In [2]: df
Out[2]:
A B C D
0 a b -1 3.5
1 a b -1 52
2 a b -1 2
3 a b -1 0
4 a b 0 15
5 a c -1 1612
6 a c 1 17
7 a e 1 52
8 a d -1 412
9 a d -1 532
I would like to find the index of the closest (next) value of the median value of D column grouping by A, B and C and also add a new column as Next_Med to label it.
Here is the expected result :
A B C D Next_Med
0 a b -1 3.5 1
1 a b -1 52 0
2 a b -1 2 0
3 a b -1 0 0
4 a b 0 15 1
5 a c -1 1612 1
6 a c 1 17 1
7 a e 1 52 1
8 a d -1 412 0
9 a d -1 532 1
For example for a, b and -1 combination, the median value is 2.75 so I'd like to label 3.5 as Next_Med.
Try this following one-liner with groupby and tranform with lambda:
>>> df['Next_Med'] = df.sort_values([*'ABC']).groupby([*'ABC'])['D'].transform(lambda x: x == min(x, key=lambda y: abs(y - x.median()))).astype(int).reset_index(drop=True)
>>> df
A B C D Next_Med
0 a b -1 3.5 1
1 a b -1 52.0 0
2 a b -1 2.0 0
3 a b -1 0.0 0
4 a b 0 15.0 1
5 a c -1 1612.0 1
6 a c 1 17.0 1
7 a e 1 52.0 1
8 a d -1 412.0 0
9 a d -1 532.0 1
>>>
I have a table with scores for each product that needed to be sold for 10 days and availability of each product (totally number of products = 10)
A B C D
20 56 12 65
80 13 76 51
24 81 56 90
67 12 65 87
45 23 67 50
62 32 23 75
76 34 67 67
23 45 32 98
24 67 34 12
56 53 32 78
Product availability
A 3
B 2
C 3
D 2
First I had to rank each product and prioritize what I need to sell for each day. I was able to do that by
import pandas as pd
df = pd.read_csv('test.csv')
new_df = pd.DataFrame()
num = len(list(df))
for i in range(1,num+1) :
new_df['Max'+str(i)] = df.T.apply(lambda x: x.nlargest(i).idxmin())
print(new_df)
That gives me
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
now comes the hard part how do i create a table that contains the product to be sold for each day looking at the Max1 column but also keeping track of the availability. If the product is not available then chose the next maximum. The final df should look like this.
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
Breaking my head over this. Any help is appreciated. Thanks.
import pandas as pd
df1=pd.read_csv('file1',sep='\s+',header=None,names=['product','available'])
print df1
df2=pd.read_csv('file2',sep='\s+')
print df2
maxy=[]
for i in range(len(df2)):
if df1['available'][df1['product']==df2['Max1'][i]].values[0]>0:
maxy.append(df2['Max1'][i])
df1['available'][df1['product']==df2['Max1'][i]]=df1['available'][df1['product']==df2['Max1'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max2'][i]].values[0]>0:
maxy.append(df2['Max2'][i])
df1['available'][df1['product']==df2['Max2'][i]]=df1['available'][df1['product']==df2['Max2'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max3'][i]].values[0]>0:
maxy.append(df2['Max3'][i])
df1['available'][df1['product']==df2['Max3'][i]]=df1['available'][df1['product']==df2['Max3'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max4'][i]].values[0]>0:
maxy.append(df2['Max4'][i])
df1['available'][df1['product']==df2['Max4'][i]]=df1['available'][df1['product']==df2['Max4'][i]].values[0]-1
else:
print ("Check")
pd.DataFrame(maxy)
Output:
product available
0 A 3
1 B 2
2 C 3
3 D 2
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
0
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
I was able to do that for any number of products through this
cols = list(df2)
maxy=[]
for i in range(len(df2)):
for x in cols:
if df1['available'][df1['product']==df2[x][i]].values[0]>0:
maxy.append(df2[x][i])
df1['available'][df1['product']==df2[x][i]]=df1['available'][df1['product']==df2[x][i]].values[0]-1
break
final=pd.DataFrame(maxy)
print(final)
Thanks
Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
I have a DF such as the following:
df =
vid pos value sente
1 a A 21
2 b B 21
3 b A 21
3 a A 21
1 d B 22
1 a C 22
1 a D 22
2 b A 22
3 a A 22
Now I want to combine all rows with the same value for sente and vid into one row with the values for value joined by an " "
df2 =
vid pos value sente
1 a A 21
2 b B 21
3 b a A A 21
1 d a a B C D 22
2 b A 22
3 a A 22
I suppose a modification of this should do the trick:
df2 = df.groupby["sente"].agg(lambda x: " ".join(x))
But I can't seem to figure out how to add the second column to the statement.
Groupers can be passed as lists. Furthermore, you can simplify your solution a bit by ridding your code of the lambda—it isn't needed.
df.groupby(['vid', 'sente'], as_index=False, sort=False).agg(' '.join)
vid sente pos value
0 1 21 a A
1 2 21 b B
2 3 21 b a A A
3 1 22 d a a B C D
4 2 22 b A
5 3 22 a A
Some other notes: specifying as_index=False means your groupers will be present as columns in the result (and not as the index, as is the default). Furthermore, sort=False will preserve the original order of the columns.
As of this edit, #cᴏʟᴅsᴘᴇᴇᴅ's answer is way better.
Fun Way! Only works because single char values
df.set_index(['sente', 'vid']).sum(level=[0, 1]).applymap(' '.join).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A
somewhat ok answer
df.set_index(['sente', 'vid']).groupby(level=[0, 1]).apply(
lambda d: pd.Series(d.to_dict('l')).str.join(' ')
).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A
not recommended
df.set_index(['sente', 'vid']).add(' ') \
.sum(level=[0, 1]).applymap(str.strip).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A
I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces
You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e