Pandas how to reorder my table or dataframe - python

I have a dataframe in pandas where each column has different value range. For example:
My desired output is:

First, it makes 2-level index, then unstack based on multiindex and, finally, rename columns.
df = pd.DataFrame({'axis_x': [0, 1, 2, 0, 1, 2, 0, 1, 2], 'axis_y': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'data': ['diode', 'switch', 'coil', '$2.2', '$4.5', '$3.2', 'colombia', 'china', 'brazil']})
df = df.set_index(['axis_x', 'axis_y']).unstack().rename(columns={0: 'product', 1: 'price', 2: 'country'})
print(df)
Prints:
data
axis_y product price country
axis_x
0 diode $2.2 colombia
1 switch $4.5 china
2 coil $3.2 brazil

Related

if row in multiple column contains 1 how to add new column in dataframe containing column names where value is 1

my dataframe looks like below
My question is under Locations column how can i add column names where row value is 1 , for example against Japan/US , Newyork,Osaka should be printed under Locations column....Pls advice how to solve this in Python ?
Try this:
import pandas as pd
DF = pd.DataFrame
S = pd.Series
def construct_df() -> DF:
data = {
"Countries": ["Japan/US", "Australia & NZ", "America & India"],
"Portugal": [0, 0, 0],
"Newyork": [1, 0, 1],
"Delhi": [0, 0, 0],
"Osaka": [1, 0, 1],
"Bangalore": [0, 0, 0],
"Sydney": [0, 0, 0],
"Mexico": [0, 0, 0],
}
return pd.DataFrame(data)
def calc_locations(x: DF) -> S:
x__location_cols_only = x.select_dtypes("integer")
x__ones_as_location_col_name = x__location_cols_only.apply(
lambda ser: ser.replace({0: "", 1: ser.name})
)
location_cols = x__location_cols_only.columns.tolist()
ret = x__ones_as_location_col_name[location_cols[0]]
for colname in location_cols[1:]:
col = x__ones_as_location_col_name[colname]
ret = ret.str.cat(col, sep=",")
ret = ret.str.replace(r",+", ",").str.strip(",")
return ret
df_final = construct_df().assign(Locations=calc_locations)
assert df_final["Locations"].tolist() == ["Newyork,Osaka", "", "Newyork,Osaka"]
You can do:
data = {
"Countries": ["Japan/US", "Australia & NZ", "America & India"],
"Portugal": [0, 0, 0],
"Newyork": [1, 0, 1],
"Delhi": [0, 0, 0],
"Osaka": [1, 0, 1],
"Bangalore": [0, 0, 0],
"Sydney": [0, 0, 0],
"Mexico": [0, 0, 0],
}
df=pd.DataFrame(data)
df1=df.iloc[:,1:]*df.iloc[:,1:].columns
df['Location']=df1.values.tolist()
df['Location']=df['Location'].apply(lambda x:','.join([y for y in x if len(y)>1]))
Lets say your data is like below
import pandas as pd
data = {'Countries': ['JP/US', 'Aus/NZ', 'America/India'],
'Portugal': [0, 0, 0],
'Newyork': [1, 0, 1],
'Delhi': [0, 0, 1],
'Osaka': [1, 0, 0],
'Sydney': [0, 0, 0],
'Mexico': [0, 0, 0],
}
data_df = pd.DataFrame(data)
DF looks like (request you to provide above set of data to build a DF for us to provide you results):
Countries Delhi Mexico Newyork Osaka Portugal Sydney
0 JP/US 0 0 1 1 0 0
1 Aus/NZ 0 0 0 0 0 0
2 America/India 1 0 1 0 0 0
If you execute below statements
data_df = data_df.set_index('Countries')
data_df['Locations'] = data_df.apply(lambda x: ", ".join(x[x!=0].index.tolist()), axis=1)
Your output will look like below
Delhi Mexico Newyork Osaka Portugal Sydney Locations
Countries
JP/US 0 0 1 1 0 0 Newyork, Osaka
Aus/NZ 0 0 0 0 0 0
America/India 1 0 1 0 0 0 Delhi, Newyork

Create new column (pandas dataframe) when duplicate ids have a payment date

I have a pandas dataframe:
pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'payment_count': 1, 2, 1, 2, 1, 2,
'payment_date': ['2/2/2020', '4/6/2020', '3/20/2020', '3/29/2020', '5/1/2020', '5/30/2020']})
I want to take max('payment_count') by each 'id' and create a new column with the associated 'payment_date'. Desired output:
pd.DataFrame({'id': [1, 2, 3],
'payment_date_1': ['2/2/2020', '3/20/2020', '5/1/2020'],
'payment_date_2': ['4/6/2020', '3/29/2020', '5/30/2020']})
You can try with pivot, add_prefix, rename_axis and reset_index
df.pivot(index='id',columns='payment_count',values='payment_date_')\
.rename_axis(None, axis = 1)\
.add_prefix('payment_date')\
.reset_index()
Output:
id payment_date_1 payment_date_2
0 1 2/2/2020 4/6/2020
1 2 3/20/2020 3/29/2020
2 3 5/1/2020 5/30/2020
Another way using groupby.
df['paydate'] = df.groupby('id')['payment_date'].cumcount()+1
df['paydate'] = 'payment_date' + df['paydate'].astype(str)
df = df.set_index(['paydate','id'])['payment_date']
df = df.unstack(0).rename_axis(None)
Ugly but it does what you asked. pivot sounds better though.
groups = df.groupby('id')
args = {group[0]:group[1].payment_count.argsort() for group in groups}
records = []
for k,v in args.items():
payments = {f'payment_{i}':date
for i,date in enumerate(df.payment_date[v])}
payments['id'] = k
records.append(payments)
_df = pd.DataFrame(records)

Pandas DataFrame - How to count consecutive values in rows across columns while ignoring NaNs

A bit stumped here and hoping the collective can assist!
Given the following DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'machine': ['A','B','C','D','E'],
'test1': [1, 1, 0, np.nan, np.nan],
'test2': [0, 0, 1, 1, np.nan],
'test3': [1, 0, 1, np.nan, 0],
'test4': [1, 1, np.nan, 1, 1],
'test5': [1, 1, np.nan, 0, 0]
})
Imagine a 1 is a pass and a 0 is a fail, NaN means the machine was untested
I would like to append two new columns to the end:
First - Maximum consecutive "1" values found, ignoring NaNs (NaN != 0, they are just ignored and would allow consecutive "1" values to continue through them.
Expected result:
max-cons-pass
3
2
2
2 (note how this ignores the NaN in-between the 1's)
1
Second - I would like the current number of consecutive "1" values starting from the last column (test5 in this case) and going backwards, again ignoring NaNs.
Expected Results:
cur-cons-pass
3
2
2 (note how this ignores the NaNs in test4 and test5)
0
0

creating columns based on values in the hierarchical row index

I have a pandas dataframe with a hierarchical row index
def stack_example():
i = pd.DatetimeIndex([ '2011-04-04',
'2011-04-06',
'2011-04-12', '2011-04-13'])
cols = pd.MultiIndex.from_product([['milk', 'honey'],[u'jan', u'feb'], [u'PRICE','LITERS']])
df = pd.DataFrame(np.random.randint(12, size=(len(i), 8)), index=i, columns=cols)
df.columns.names = ['food', 'month', 'measure']
df.index.names = ['when']
df = df.stack('food', 'columns')
df= df.stack('month', 'columns')
df['constant_col'] = "foo"
df['liters_related_col'] = df['LITERS']*99
return df
I can add new columns to this dataframe based on constants or based on calculations involving other columns.
I would like to add new columns based in part on calculations involving the index.
For example, just repeat the food name twice:
df.index
MultiIndex(levels=[[2011-04-04 00:00:00, 2011-04-06 00:00:00, 2011-04-12 00:00:00, 2011-04-13 00:00:00], [u'honey', u'milk'], [u'feb', u'jan']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'when', u'food', u'month'])
df.index.values[4][1]*2
'honeyhoney'
But I can't figure out the syntax for creating something like this:
df['xcol'] = df.index.values[2]*2
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2519, in __setitem__
self._set_item(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2585, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2760, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\series.py", line 3080, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
I've also tried variations like df['xcol'] = df.index.values[:][2]*2
In the case of df.index.values[4][1] * 2, where the value is a string (honeyhoney), it's fine to assign that to a column:
df['col1'] = df.index.values[4][1] * 2
df.col1
when food month
2011-04-04 honey feb honeyhoney
jan honeyhoney
milk feb honeyhoney
jan honeyhoney
In your second example, though, the one with the error, you're not actually performing an operation on a single value:
df.index.values[2]*2
(Timestamp('2011-04-04 00:00:00'),
'milk',
'feb',
Timestamp('2011-04-04 00:00:00'),
'milk',
'feb')
You could still smush all that into a string, or into some other format, depending on your needs:
df['col2'] = ''.join([str(x) for x in df.index.values[2]*2])
But the main issue is that the output of df.index.values[2]*2 gives you a multi-dimensional structure, which doesn't map to the existing structure of df.
New columns in df can either be a single value (in which case it's replicated automatically to fit the number of rows in df), or they can have the same number of entries as len(df).
UPDATE
per comments
IIUC, you can use get_level_values() to apply an operation to an entire level of a MultiIndex:
df.index.get_level_values(1).values*2
array(['honeyhoney', 'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney',
'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney',
'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney', 'milkmilk',
'milkmilk'], dtype=object)

Filter a pandas Dataframe based on specific month values and conditional on another column

I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.

Categories

Resources