python - Converting pandas Matrix to DataFrame - python

I have created a matrix:
items = [0, 1, 2, 3]
item_to_item = pd.DataFrame(index=items, columns=items)
I've put values in it so:
Its symetric
Its diagonal is all 0's
for example:
0 1 2 3
0 0 4 5 9
1 4 0 3 7
2 5 3 0 3
3 9 7 3 0
I want to create a data frame of all possible pairs (from [0, 1, 2, 3]) so that there wont be pairs of (x, x) and if (x, y) is in, I dont want (y, x) becuase its symetric and holds the same value.
In the end I will have the following Dataframe (or numpy 2d array)
item, item, value
0 1 4
0 2 5
0 3 9
1 2 3
1 3 7
2 3 3

Here's a NumPy solution with np.triu_indices -
In [453]: item_to_item
Out[453]:
0 1 2 3
0 0 4 5 9
1 4 0 3 7
2 5 3 0 3
3 9 7 3 0
In [454]: r,c = np.triu_indices(len(items),1)
In [455]: pd.DataFrame(np.column_stack((r,c, item_to_item.values[r,c])))
Out[455]:
0 1 2
0 0 1 4
1 0 2 5
2 0 3 9
3 1 2 3
4 1 3 7
5 2 3 3

numpy's np.triu gives you the upper triangle with all other elements set to zero. You can use that to construct your DataFrame and replace them with NaNs (so that they are dropped when you stack the columns):
pd.DataFrame(np.triu(df), index=df.index, columns=df.columns).replace(0, np.nan).stack()
Out:
0 1 4.0
2 5.0
3 9.0
1 2 3.0
3 7.0
2 3 3.0
dtype: float64
You can use reset_index at the end to convert indices to columns.
Another alternative would be resetting the index and stacking again but this time use a callable to slice the DataFrame:
df.stack().reset_index()[lambda x: x['level_0'] < x['level_1']]
Out:
level_0 level_1 0
1 0 1 4
2 0 2 5
3 0 3 9
6 1 2 3
7 1 3 7
11 2 3 3
This one requires pandas 0.18.0.

Related

How to drop duplicates in pandas but keep more than the first

Let's say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3
Use boolean indexing with groupby.cumcount:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
apply on consecutive repetitions
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
advantages of boolean indexing
You can use it for many other operations, such as setting values or masking:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3

how to add a DataFrame to some columns of another DataFrame

I want to add a DataFrame a (containing a loadprofile) to some of the columns of another DataFrame b (also containing one load profile per column). So some columns (load profiles) of b should be overlaid withe the load profile of a.
So lets say my DataFrames look like:
a:
P[kW]
0 0
1 0
2 0
3 8
4 8
5 0
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 4 4
4 2 2 2
5 2 2 2
Now I want to overlay some colums of b:
b.iloc[:, [1]] += a.iloc[:, 0]
I would expect this:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
but what I actually get:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 nan 2
1 3 nan 3
2 3 nan 3
3 4 nan 4
4 2 nan 2
5 2 nan 2
That's not exactly what my code and data look like, but the principle is the same as in this abstract example.
Any guesses, what could be the problem?
Many thanks for any help in advance!
EDIT:
I actually have to overlay more than one column.Another example:
load = [0,0,0,0,0,0,0]
data = pd.DataFrame(load)
for i in range(1, 10):
data[i] = data[0]
data
overlay = pd.DataFrame([0,0,0,0,6,6,0])
overlay
data.iloc[:, [1,2,4,5,7,8]] += overlay.iloc[:, 0]
data
WHAT??! The result is completely crazy. Columns 1 and 2 aren't changed at all. Columns 4 and 5 are changed, but in every row. Columns 7 and 8 are nans. What am I missing?
That is what I would expect the result to look like:
Please do not pass the column index '1' of dataframe 'b' as a list but as an element.
Code
b.iloc[:, 1] += a.iloc[:, 0]
b
Output
P1[kW] P2[kW] Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
Edit
Seems like this what we are looking for i.e to sum certain columns of data df with overlay df
Two Options
Option 1
cols=[1,2,4,5,7,8]
data[cols] = data[cols] + overlay.values
data
Option 2, if we want to use iloc
cols=[1,2,4,5,7,8]
data[cols] = data.iloc[:,cols] + overlay.iloc[:].values
data
Output
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 6 6 0 6 6 0 6 6 0
5 0 6 6 0 6 6 0 6 6 0
6 0 0 0 0 0 0 0 0 0 0

Padding and reshaping pandas dataframe

I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

python pandas shift next rows for values

I use pandas:
input:
import pandas as pd
a=pd.Series([0,0,1,0,0,0,0])
output:
0 0
1 0
2 1
3 0
4 0
5 0
6 0
I want to get data for next rows in same values:
output:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
use
a+a.shift(1)+a.shift(2)+a.shift(3)
I think this is not a smart solution
who have a smart solution for this
You can try this assuming index 6 should be value 1 too,
a=pd.Series([0,0,1,0,0,0,0])
a.eq(1).cumsum()
Out[19]:
0 0
1 0
2 1
3 1
4 1
5 1
6 1
dtype: int32
Updated : More than one value not equal to 0.
a=pd.Series([0,0,1,0,1,3,0])
a.ne(0).cumsum()
A=pd.DataFrame({'a':a,'Id':a.ne(0).cumsum()})
A.groupby('Id').a.cumsum()
Out[58]:
0 0
1 0
2 1
3 1
4 1
5 3
6 3
Or you can use ffill
a[a.eq(0)]=np.nan
a.ffill().fillna(0)
Out[64]:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 3.0
6 3.0
1 You could filter the series for "your" value (SearchValue).
2 Re-index the dataseries to a to-be-stated length (LengthOfIndex) and forward fill the "your" a given number of times (LengthOfFillRange)
3 Fill it with zeros again.
import pandas as pd
import numpy as np
a=pd.Series([0,0,1,0,0,0,0])
SearchValue = 1
LengthOfIndex = 7
LengthOfFillRange = 4
a=a[a==SearchValue]\
.reindex(np.linspace(1,LengthOfIndex,LengthOfIndex, dtype='int32'),
method='ffill',
limit=LengthOfFillRange)\
.fillna(0)
If you need repeat only 2 values Series by some limit use replace for NaNs, then ffill (fillna with method ffill) and last fillna for convert NaNs to original values (and if necessary convert to int):
a=pd.Series([0,0,1,0,0,0,0,1,0,0,0,])
print (a)
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
dtype: int64
b= a.replace(0,np.nan).ffill(limit=2).fillna(0).astype(a.dtype)
print (b)
0 0
1 0
2 1
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
dtype: int64

Categories

Resources