How to insert a column to a table in Pandas - python

I have read in some with Pandas and I want to add a column after the last column. After I did, the problem is that the values start from zero, and I want them to start from one.
I have 12800 rows and want the added column to start from 1 and go to 100 and the start over and go from 1 to 100. I want this pattern for all the rows. So basically I want this to cycle 128 times go from 1 to 100. Can anyone tell me how I can do this?
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df1=pd.DataFrame(df.values.reshape(12800, -1))
df1['10'] = df1.index
The included picture is not correct. I want the last column which is number 10 to start from one and have a pattern like I said above.

To repeat a pattern of 1..100 and assign to a column you can do:
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1)[:len(df)]
To add a pattern that is 100 per step you can do:
df['by_100'] = np.floor(df.index / 100)
Test Code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(2002, 3))
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1 )[:len(df)]
df['by_100'] = np.floor(df.index / 100)
print(df.head())
print(df.tail())
Results:
0 1 2 1_to_100 by_100
0 0.301862 0.824019 0.267810 1 0.0
1 0.568186 0.040328 0.799634 2 0.0
2 0.887218 0.407702 0.351990 3 0.0
3 0.871072 0.583761 0.498725 4 0.0
4 0.169657 0.026824 0.446667 5 0.0
0 1 2 1_to_100 by_100
1997 0.370640 0.662019 0.541747 98 19.0
1998 0.545908 0.682259 0.970764 99 19.0
1999 0.416177 0.665771 0.926145 100 19.0
2000 0.207109 0.762653 0.813754 1 20.0
2001 0.711998 0.236817 0.025387 2 20.0

Related

Improve efficiency of selecting values from dataframe by index

I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.
par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Multiply many columns pandas

I have a data frame like this, but with many more columns and I would like to multiply each two adjacent columns and state the product of the two in a new column beside it and call it Sub_pro and at the end have the total sum of all Sub_pro in a column called F_Pro and reduce the precision to 3 decimal places. I don't know how to get the Sub_pro columns. Below is my code.
import pandas as pd
df = pd.read_excel("C:dummy")
df['F_Pro'] = ("Result" * "Attribute").sum(axis=1)
df.round(decimals=3)
print (df)
Input
Id Result Attribute Result1 Attribute1
1 0.5621 0.56 536 0.005642
2 0.5221 0.5677 2.15 93
3 0.024564 5.23 6.489 8
4 11.564256 4.005 0.45556 5.25
5 0.6123 0.4798 0.6667 5.10
Desire Output
id Result Attribute Sub_Pro Result1 Attribute1 Sub_pro1 F_Pro
1 0.5621 0.56 0.314776 536 0.005642 3.024112 3.338888
2 0.5221 0.5677 0.29639617 2.15 93 199.95 200.2463962
3 0.024564 5.23 0.12846972 6.489 8 51.912 52.04046972
4 11.564256 4.005 46.31484528 0.45556 5.25 2.39169 48.70653528
5 0.6123 0.4798 0.29378154 0.6667 5.1 3.40017 3.69395154
Because you have several columns named kind of the same, here is one way using filter. To see how it works, on your df, you do df.filter(like='Result') and you get the columns where the name has Result in it:
Result Result1
0 0.562100 536.00000
1 0.522100 2.15000
2 0.024564 6.48900
3 11.564256 0.45556
4 0.612300 0.66670
You can create an array containing the columns 'Sub_Pro':
import numpy as np
arr_sub_pro = np.round(df.filter(like='Result').values* df.filter(like='Attribute').values,3)
and you get the values of the columns sub_pro such as arr_sub_pro:
array([[3.1500e-01, 3.0240e+00],
[2.9600e-01, 1.9995e+02],
[1.2800e-01, 5.1912e+01],
[4.6315e+01, 2.3920e+00],
[2.9400e-01, 3.4000e+00]])
Now you need to add them at the right position in the dataframe, I think a loop for is necessary
for nb, col in zip( range(arr_sub_pro.shape[1]), df.filter(like='Attribute').columns):
df.insert(df.columns.get_loc(col)+1, 'Sub_pro{}'.format(nb), arr_sub_pro[:,nb])
here I get the location of the column Attibut(nb) and insert the value from column nb of arr_sub_pro at the next position
To add the column 'F_Pro', you can do:
df.insert(len(df.columns), 'F_Pro', arr_sub_pro.sum(axis=1))
the final df looks like:
Id Result Attribute Sub_pro0 Result1 Attribute1 Sub_pro1 \
0 1 0.562100 0.5600 0.315 536.00000 0.005642 3.024
1 2 0.522100 0.5677 0.296 2.15000 93.000000 199.950
2 3 0.024564 5.2300 0.128 6.48900 8.000000 51.912
3 4 11.564256 4.0050 46.315 0.45556 5.250000 2.392
4 5 0.612300 0.4798 0.294 0.66670 5.100000 3.400
F_Pro
0 3.339
1 200.246
2 52.040
3 48.707
4 3.694
import pandas as pd
src = "/opt/repos/pareto/test/stack/data.csv"
df = pd.read_csv(src)
count = 0
def multiply(x):
res = x.copy()
keys_len = len(x)
idx = 1
while idx + 1 < keys_len:
left = x[idx]
right = x[idx + 1]
new_key = "sub_prod_{}".format(idx)
# Multiply and round to three decimal places.
res[new_key] = round(left * right,3)
idx = idx + 1
return res
res_df = df.apply(lambda x: multiply(x),axis=1)
It solve the problem but you need now order de columns you can iterate over the keys instead of make a deep copy of the full row. I hope that the code help you.
Here's one way using NumPy and a dictionary comprehension:
# extract NumPy array for relevant columns
A = df.iloc[:, 1:].values
n = int(A.shape[1] / 2)
# calculate products and feed to pd.DataFrame
prods = pd.DataFrame({'Sub_Pro_'+str(i): np.prod(A[:, 2*i: 2*(i+1)], axis=1) \
for i in range(n)})
# calculate sum of product rows
prods['F_Pro'] = prods.sum(axis=1)
# join to original dataframe
df = df.join(prods)
print(df)
Id Result Attribute Result1 Attribute1 Sub_Pro_0 Sub_Pro_1 \
0 1 0.562100 0.5600 536.00000 0.005642 0.314776 3.024112
1 2 0.522100 0.5677 2.15000 93.000000 0.296396 199.950000
2 3 0.024564 5.2300 6.48900 8.000000 0.128470 51.912000
3 4 11.564256 4.0050 0.45556 5.250000 46.314845 2.391690
4 5 0.612300 0.4798 0.66670 5.100000 0.293782 3.400170
F_Pro
0 3.338888
1 200.246396
2 52.040470
3 48.706535
4 3.693952

Append string of column index to DataFrame columns

I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]
1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000
1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000
1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000
Right now, I am successfully convert my data into this following format in Pandas.DataFrame.
10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...
The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)
I know I can append a string to columns by using this following command.
df['col'] = 'str' + df['col'].astype(str)
Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?
Any help would be appreciated.
I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:
print (df)
0 1 2 3 4 5 6 7 8 \
0 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
1 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
9
0 92.2826
1 92.2826
df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
0 1 2 3 4 5 6 7 \
0 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
1 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
8 9
0 7:1840.0 8:92.2826
1 7:1840.0 8:92.2826
You can simply concatenate the columns
df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)
This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.
You can convert the string to dictionary and then read it again as pandas dataframe.
import pandas as pd
import ast
df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )
def putquotes(x):
x1 = x.split(":")
return "'" + x1[0] +"':" + x1[1]
def putcommas(x):
x1 = x.split()
return "{" + ",".join([putquotes(t) for t in x1]) + "}"
import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)

Python Pandas: Get row by median value

I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].

Categories

Resources