Reshaping DataFrame with pandas - python

So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.

This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN

Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN

I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to record the "least occuring" item in a pandas DataFrame?

I have the following pandas DataFrame, with only three columns:
import pandas pd
dict_example = {'col1':['A', 'A', 'A', 'A', 'A'],
'col2':['A', 'B', 'A', 'B', 'A'], 'col3':['A', 'A', 'A', 'C', 'B']}
df = pd.DataFrame(dict_example)
print(df)
col1 col2 col3
0 A A A
1 A B A
2 A A A
3 A B C
4 A A B
For the rows with differing elements, I'm trying to write a function which will return the column names of the "minority" elements.
As an example, in row 1, there are 2 A's and 1 B. Given there is only one B, I consider this the "minority". If all elements are the same, there's naturally no minority (or majority). However, if each column has a different value, I consider these columns to be minorities.
Here is what I have in mind:
col1 col2 col3 min
0 A A A []
1 A B A ['col2']
2 A A A []
3 A B C ['col1', 'col2', 'col3']
4 A A B ['col3']
I'm stumped how to computationally efficiently calculate this.
Finding the maximum number of items appears straightfoward, either with using pandas.DataFrame.mode() or one could find the maximum item in a list as follows:
lst = ['A', 'B', 'A']
max(lst,key=lst.count)
But I'm not sure how I could find either the least occurring items.
This solution is not simple - but I could not think of a pandas native solution without apply, and numpy does not seemingly provide much help without the below complex number trick for inner-row uniqueness and value counts.
If you are not fixed on adding this min column, we can use some numpy tricks to nan out the non-least-occuring entries. First, given your dataframe we can make a numpy array of integers to help.
v = pd.factorize(df.stack())[0].reshape(df.shape)
v = pd.factorize(df.values.flatten())[0].reshape(df.shape)
(should be faster, as stack is unecessary)
Then, using some tricks for numpy row-wise unique elements (using complex numbers to mark elements as unique in each row, find the least occurring elements, and mask them in). This method is mostly from user unutbu used in several answers.
def make_mask(a):
weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
b = a + weight[:, np.newaxis]
u, ind, c = np.unique(b, return_index=True, return_counts=True)
b = np.full_like(a, np.nan, dtype=float)
np.put(b, ind, c)
m = np.nanmin(b, axis=1)
# remove only uniques
b[(~np.isnan(b)).sum(axis=1) == 1, :] = np.nan
# remove lower uniques
b[~(b == m.reshape(-1, 1))] = np.nan
return b
m = np.isnan(make_mask(v))
df[m] = np.nan
Giving
col1 col2 col3
0 NaN NaN NaN
1 NaN B NaN
2 NaN NaN NaN
3 A B C
4 NaN NaN B
Hopefully this achieves what you want in a performant way (say if this dataframe is quite large). If there is a faster way to achieve the first line (without using stack), I would imagine this is quite fast for even very large dataframes.

Delete elements after it appeared a certain times

I have a data frame similar like this:
pd.DataFrame([['a','b'],
['c','a'],
['c','d'],
['a','e'],
['p','g'],
['d','a'],
['c', 'g']
], columns=['col1','col2'])
I need to delete rows after an element appeared a certain number of times. For example, say I want to keep each value appear maximum of 2 times in this dataframe (in both columns), the final dataframe can be like this:
[['a','b'],
['a','c'],
['c','d'],
['p','g']
]
The order of rows to delete doesn't matter here. I want to maintain the maximum times of a value appear in my dataframe.
Many Thanks!
IIUC, try:
n=2
s=df.stack()
s[(s.groupby(s).cumcount()+1).le(n)].unstack().dropna()
col1 col2
0 a b
1 a c
2 c d
4 p g
Here is one way using stack then cumcount with all
s=df.stack()
s=s.groupby(s).cumcount().unstack()
df[(s<=1).all(1)]
Out[206]:
col1 col2
0 a b
1 a c
2 c d
4 p g
You can stack the data, cumcount, and unstack back:
s = df.stack()
df[s.groupby(s).cumcount().unstack().lt(2).all(1)]
Output:
col1 col2
0 a b
1 a c
2 c d
4 p g

How to use an equality condition for manipulating a Pandas Dataframe based on another dataframe?

I have a dataframe in Python, say A, which has multiple columns, including columns named ECode and FG. I have another Pandas dataframe B, also with multiple columns, including columns named ECode,F Gping (note the space in column name for F Gping) and EDesc. What I would like to do is to create a new column called EDesc in dataframe A based on following conditions (Note that EDesc, FG and F Gping contain String type values (text), while the remaining columns are numeric/floating type. Also, dataframes A and B are of different dimensions (with differing rows and columns, and I want to check equality in specific values in the dataframe columns):-
First, for all rows in dataframe A, where value in ECode matches value ECode in dataframe B, then, in the new column EDesc to be created in dataframe A, add the same values as EDesc in B.
Secondly, for all rows in dataframe A where value in FG matches F Gping values, in the new column EDesc in A, add same values as EDesc in B.
After this, if the newly created EDesc column in A still has missing values/NaNs, then add the string value MissingValue to all the rows in the Dataframe A's EDesc column.
I have tried using for loops, as well as list comprehensions, but they don't help in accomplishing this. Moreover, the space within column name F Gping in B is created problems to access the same, as though I can access it like B['F Gping'], it isn't solving the very purpose. Any help in this regard is appreciated.
I'm assuming values are unique in B['ECode'] and B['F Gping'], otherwise we'll have to choose which value we give to A['EDesc'] when we find two matching values for ECode or FG.
There might be a smarter way but here's what I would do with joins:
Example DataFrames:
A = pd.DataFrame({'ECode': [1, 1, 3, 4, 6],
'FG': ['a', 'b', 'c', 'b', 'y']})
B = pd.DataFrame({'ECode': [1, 2, 3, 5],
'F Gping': ['b', 'c', 'x', 'x'],
'EDesc': ['a', 'b', 'c', 'd']})
So they look like:
A
ECode FG
0 1 a
1 1 b
2 3 c
3 4 b
4 6 y
B
ECode F Gping EDesc
0 1 b a
1 2 c b
2 3 x c
3 5 x d
First let's create A['EDesc'] by saying that it's the result of joining A and B on ECode. We'll temporarily use EDesc as index:
A.set_index('ECode', inplace=True, drop=False)
B.set_index('ECode', inplace=True, drop=False)
A['EDesc'] = A.join(B, lsuffix='A')['EDesc']
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG ECode F Gping EDesc
ECode
1 1 a 1.0 b a
1 1 b 1.0 b a
3 3 c 3.0 x c
4 4 b NaN NaN NaN
6 6 y NaN NaN NaN
Now let's fillna on A['EDesc'], using the match on FG. Same thing:
A.set_index('FG', inplace=True, drop=False)
B.set_index('F Gping', inplace=True, drop=False)
A['EDesc'].fillna(A.join(B, lsuffix='A')['EDesc'].drop_duplicates(), inplace=True)
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG EDescA ECode F Gping EDesc
a 1 a a NaN NaN NaN
b 1 b a 1.0 b a
b 4 b NaN 1.0 b a
c 3 c c 2.0 c b
y 6 y NaN NaN NaN NaN
Also we dropped the duplicates because as you can see there are two b's in our index.
Finally let's fillna with "Missing" and reset the index:
A['EDesc'].fillna('Missing', inplace=True)
A.reset_index(drop=True, inplace=True)

How to get non-NaN elements' index and the value from a DataFrame

I have a big data frame with lots of NaN, I want to store it into a smaller data frame which stores all the indexes and the values of the non-NaN, non-zero values.
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
The data frame may look like this:
A B C
0 NaN -2.268882 0.337074
1 NaN 0.000000 1.340350
2 -1.526945 0.000000 NaN
3 -1.223816 0.000000 -2.185926
I want a data frame looks like this:
0 B -2.268882
0 C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
4 C -2.185926
How can I do it quickly, as i have a relatively big data frame, thousands by thousands...
Many Thanks!
Replace 0 with np.nan and .stack() the result (see docs).
If there's a chance that you have all np.nan values in rows after .replace(), you could do .dropna(how='all') before .stack() to reduce the number of rows to pivot. If that could apply to columns do `.dropna(how='all', axis=1).
df.replace(0, np.nan).stack()
0 B -2.268882
C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
C -2.185926
Combine with .reset_index() as needed.
To select from a Series with MultiIndex use .loc[(level_0, level_1)]:
df.loc[(0, 'B')] = -2.268882
Details on slicing etc in the docs.
I've come up with a bit ugly way of achieving things, but hey, it works. But this solution has index going from 0 and does not preserve the original order of 'A', 'B', 'C' as in your question, if that matters.
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
dff.iloc[2,1] = np.nan
​
# mask to do logical and for two lists
mask = lambda y,z: list(map(lambda x: x[0] and x[1], zip(y,z)))
# create new frame
new_df = pd.DataFrame()
types = []
vals = []
# iterate over columns
for col in dff.columns:
# get the non empty and non zero values from current column
data = dff[col][mask(dff[col].notnull(), dff[col] != 0)]
# add corresponding original column name
types.extend([col for x in range(len(data))])
vals.extend(data)
# populate the dataframe
new_df['Types'] = pd.Series(types)
new_df['Vals'] = pd.Series(vals)
​
print(new_df)
# A B C
#0 NaN -1.167975 -1.362128
#1 NaN 0.000000 1.388611
#2 1.482621 NaN NaN
#3 -1.108279 0.000000 -1.454491
# Types Vals
#0 A 1.482621
#1 A -1.108279
#2 B -1.167975
#3 C -1.362128
#4 C 1.388611
#5 C -1.454491
I am looking forward for more pandas/python like answer myself!

Categories

Resources