right now I have a dataframe with two columns:
columnA columnB
:16R:AB NaN
:20C::XX S400500
:16X:AB NaN
:16R:AC NaN
:16X:AC NaN
:16R:AB NaN
:31X::BB Sema
:16R:AB Nan
I want to transpose the dataframe based on some sequences. The :16R:AB till the next 16X:AB is a sequence, then from 16R:ACtill 16X:AC and so on. I also want to add a counter/ID then, so that the finale dataframe looks like:
Index/Counter :16R:AB :20C::XX :16X:AB :16R:AC :16X:AC :31X:BB
1 NaN S400500 NaN NaN NaN Nan
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN Sema
Is there any trick to do it? Or is it a manuell loop?
So rebuilding aprototype of your example:
D = pd.DataFrame([np.random.randint(1,5,20), np.random.randn(20)]).T
D.columns = ["key", "value"]
key value
4.0 1.017081
4.0 -1.480929
3.0 -1.257809
1.0 -0.683207
...
now we can add a field, that counts the occurance of the same key
D["occurance"] = (D.key == 4.0).cumsum()
... and then we are able to pivot:
D.pivot(index="occurance", columns="key", values=["value"] )
Related
Given a dataframe with row and column multiindex, how would you copy a row index "object" and manipulate a specific index value on a chosen level? Ultimately I would like to add a new row to the dataframe with this manipulated index.
Taking this dataframe df as an example:
col_index = pd.MultiIndex.from_product([['A','B'], [1,2,3,4]], names=['cInd1', 'cInd2'])
row_index = pd.MultiIndex.from_arrays([['2010','2011','2009'],['a','r','t'],[45,34,35]], names=["rInd1", "rInd2", 'rInd3'])
df = pd.DataFrame(data=None, index=row_index, columns=col_index)
df
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
I would like to take the index of the first row, manipulate the "rInd2" value and use this index to insert another row.
Pseudo code would be something like this:
#Get Index
idx = df.index[0]
#Manipulate Value
idx[1] = "L" #or idx["rInd2"]
#Make new row with new index
df.loc[idx, slice(None)] = None
The desired output would look like this:
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
2010 L 45 NaN NaN NaN NaN NaN NaN NaN NaN
What would be the most efficient way to achieve this?
Is there a way to do the same procedure with column index?
Thanks
Low S0.0 S1.0 S2.0 S3.0 S4.0 S5.0 S6.0 S7.0 S8.0 S9.0 S10.0 S11.0
0 55 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 78 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 77 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the following code to check if any of the "S" columns are near to "close":
level=0.035
cond = np.isclose(df.Low, df['S0.0'], rtol=level) | np.isclose(df.Low, df['S1.0'], rtol=level) | ...
df['ST'] = np.where(cond, 100, 0)
But this looks too manual, is there some way to attribute all the S columns without specifically naming all of them? Also considering that these columns keep on changing so specifically calling every column sometimes gives an error. THANKS!
I think a solution can be as follows:
from itertools import repeat
from operator import or_
selected_columns = [c for c in df.columns if c.startswith('s')]
cond = None
for low_serie, sel_serie in zip(repeat(df.Low), [df[selected_column] for selected_column in selected_columns]):
if cond is None:
cond = np.isclose(low_serie, sel_serie, rtol=level)
continue
cond = or_(cond, np.isclose(low_serie, sel_serie, rtol=level))
You have to pay attention to the condition to select the columns names. I put as an example if c.startswith('s').
So I have a dataframe with NaN values and I tranfsform all the rows in that dataframe in a list which then is added to another list.
Index 1 2 3 4 5 6 7 8 9 10 ... 71 72 73 74 75 76 77 78 79 80
orderid
20000765 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000766 624380 nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
20000768 1305984 1305985 1305983 1306021 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
records = []
for i in range(0, 60550):
records.append([str(dfpivot.values[i,j]) for j in range(0, 10)])
However, a lot of rows contain NaN values which I want to delete from the list, before I put it in the list of lists. Where do I need to insert that code and how do I do this?
I thought that this code would do the trick, but I guess it looks only to the direct values in the 'list of lists':
records = [x for x in records if str(x) != 'nan']
I'm new to Python, so I'm still figuring out the basics.
One way is to take advantage of the fact that stack removes NaNs to generate the nested list:
df.stack().groupby(level=0).apply(list).values.tolist()
# [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
IF you want to keep rows with nans you can do it like this:
In [5457]: df.T.dropna(how='all').T
Out[5457]:
Index 1 2 3 4
0 20000765.000 624380.000 nan nan nan
1 20000766.000 624380.000 nan nan nan
2 20000768.000 1305984.000 1305985.000 1305983.000 1306021.000
if you don't want any columns with nans you can drop them like this:
In [5458]: df.T.dropna().T
Out[5458]:
Index 1
0 20000765.000 624380.000
1 20000766.000 624380.000
2 20000768.000 1305984.000
To create the array:
In [5464]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[5464]:
[[20000765.0, 624380.0],
[20000766.0, 624380.0],
[20000768.0, 1305984.0, 1305985.0, 1305983.0, 1306021.0]]
or
df.T[1:].apply(lambda x: x.dropna().tolist()).tolist()
Out[5471]: [[624380.0], [624380.0], [1305984.0, 1305985.0, 1305983.0, 1306021.0]]
depending on how you want the array
One way to do this would be with a nested list comprehension:
[[j for j in i if not pd.isna(j)] for i in dfpivot.values]
EDIT
it looks like you want strings - in which case,
[[str(j) for j in i if not pd.isna(j)] for i in dfpivot.values]
This is probably an easy question, but I couldn't find any simple way to do that. Imagine the following dataframe:
df = pd.DataFrame(index=range(10), columns=range(5))
and three lists that contain indices, columns, and values of the defined dataframe that I intend to change:
idx_list = [1,5,3,7] # the indices of the cells that I want to change
col_list = [1,4,3,1] # the columns of the cells that I want to change
value_list = [9,8,7,6] # the final value of whose cells`
I was wondering if there exist a function in pandas that does the following efficiently:
for i in range(len(idx_list)):
df.loc[idx_list[i], col_list[i]] = value_list[i]
Thanks.
Using .values
df.values[idx_list,col_list]=value_list
df
Out[205]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 9 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 7 NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 8
6 NaN NaN NaN NaN NaN
7 NaN 6 NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Or another way less efficient
updatedf=pd.Series(value_list,index=pd.MultiIndex.from_arrays([idx_list,col_list])).unstack()
df.update(updatedf)
try df.applymap() function, you can use lambda to do your required operations.
I'm trying to create a table that contains the percentage difference calculation for values in a column. I want to display this in a table, where the labels 0-15M, 15-25M...etc are on the left and top of the table and the values inside the table are the calculated percentage difference for each combination. The values I'm trying to use are below:
label value
0-15M 18.274490
15-25M 21.338270
25-35M 22.607708
35-45M 24.078338
45-55M 25.545316
55-65M 26.951005
65-75M 27.765658
I have tried using pivot tables to do this, however trying to calculate using only one column doesn't seem to work. The code I've tried is below:
pd.pivot_table(test, index='label', columns='label', aggfunc=[np.mean])
And this is the result:
0-15M 15-25M 25-35M 35-45M 45-55M 55-65M 65-75M
0-15M 18.27 NaN NaN NaN NaN NaN NaN
15-25M NaN 21.33 NaN NaN NaN NaN NaN
25-35M NaN NaN 22.60 NaN NaN NaN NaN
35-45M NaN NaN NaN 24.07 NaN NaN NaN
45-55M NaN NaN NaN NaN 25.54 NaN NaN
55-65M NaN NaN NaN NaN NaN 26.95 NaN
65-75M NaN NaN NaN NaN NaN NaN 27.76
75-85M NaN NaN NaN NaN NaN NaN NaN
85-95M NaN NaN NaN NaN NaN NaN NaN
So you can see its only calculating the values it already has, and not calculating new values from the combinations.
Any help with this would be great! Thanks.
Solution: I ended up using a loop to do this and appended lists together to form dataframes.
for x in df.iloc[2,]:
for y in df.iloc[2,]:
solution.append((x-y)/y)