Strip index as Pandas column - python

I have a table with a single index column looking like this:
1 2 3
Monday_0 NaN NaN NaN
Monday_1 NaN NaN NaN
Tuesday_2 NaN NaN NaN
Tuesday_3 NaN NaN NaN
I want to keep the index, but want the first part of the index into a new column. In other words, it should look like this:
1 2 3 Day
Monday_0 NaN NaN NaN Monday
Monday_1 NaN NaN NaN Monday
Tuesday_2 NaN NaN NaN Tuesday
Tuesday_3 NaN NaN NaN Tuesday
So I have tried a number of different solutions:
df = df.reset_index()
df['Day'] = str(df['index']).split('_')
This give me the whole series per row.
df['Day'] = str(df.index.split('_')[0])
Doesn't work as index does not have a split function
df['Day'] = df.index.as_type('str').split('_')[0]
Doesn't work as index does not have a as_type function
df.index.set_levels(df.index.get_level_values(level = 1).str.split('_')[0],
level = 1, inplace=True)
Doesn't work as 'Index' object has no attribute 'set_levels'. I guess it only works with multi index?
And with that I am all out of ideas

Try str.split
df['Day']=df.index.str.split('_').str[0]
df
Out[219]:
1 2 3 Day
Monday_0 NaN NaN NaN Monday
Monday_1 NaN NaN NaN Monday
Tuesday_2 NaN NaN NaN Tuesday
Tuesday_3 NaN NaN NaN Tuesday

Related

How to pull any cells from a table/dataframe into a column if they contain specific string?

I am using Python in CoLab and I am trying to find something that will allow me to move any cells from a subset of a data frame into a new/different column in the same data frame OR sort the cells of the dataframe into the correct columns.
The original column in the CSV looked like this:
and using
Users[['Motorbike', 'Car', 'Bus', 'Train', 'Tram', 'Taxi']] = Users['What distance did you travel in the last month by:'].str.split(',', expand=True)
I was able to split the column into 6 new series to give this
However, now I would like all the cells with 'Motorbike' in the motorbike column, all the cells wih 'Car' in the Car column and so on, without overwriting any other cells OR if this cannot be done, to just assign any occurances of Motorbike, Car etc into the new columns 'Motorbike1', 'Car1' etc. that I have added to the dataframe as shown below. Can anyone help please?
new columns
I have tried to copy the cells in original columns to the new columns and then get rid of values containing say not 'Car' However repeating for the next original column into the same first new column it overwrites.
There are no repeats of any mode of transport in any row. i.e there is only one or less occurrence of each mode of transport in every row.
You can use a regex to extract the xxx (yyy)(yyy) parts, then reshape:
out = (df['col_name']
.str.extractall(r'([^,]+) (\([^,]*\))')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0)
)
output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN (km)(20) NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN (km)(1000) NaN NaN NaN NaN
3 NaN (km)(100) NaN NaN (km)(20) NaN
4 (km)(150) NaN NaN (km)(25) (km)(700) NaN
5 (km)(40) (km)(0) (km)(0) NaN NaN NaN
6 NaN (km)(300) NaN (km)(100) NaN NaN
7 NaN (km)(300) NaN NaN NaN NaN
8 NaN NaN NaN NaN (km)(80) (km)(300)
9 (km)(50) (km)(700) NaN NaN NaN (km)(50)
If you only need the numbers, you can change the regex:
(df['col_name'].str.extractall(r'([^,]+)\s+\(km\)\((\d+)\)')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0).rename_axis(columns=None)
)
Output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN 20 NaN NaN
1 NaN 500 500 NaN NaN NaN
2 NaN 1000 NaN NaN NaN NaN
3 NaN 100 NaN NaN 20 NaN
4 150 NaN NaN 25 700 NaN
5 40 0 0 NaN NaN NaN
6 NaN 300 NaN 100 NaN NaN
7 NaN 300 NaN NaN NaN NaN
8 NaN NaN NaN NaN 80 300
9 50 700 NaN NaN NaN 50
Use list comprehension with split for dictionaries, then pass to DataFrame constructor:
L = [dict([y.split() for y in x.split(',')])
for x in df['What distance did you travel in the last month by:']]
df = pd.DataFrame(L)
print (df)
Taxi Motorbike Car Train Bus Tram
0 (km)(20) NaN NaN NaN NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN NaN (km)(1000) NaN NaN NaN
3 NaN NaN (km)(100) (km)(20) NaN NaN
4 (km)(25) NaN NaN (km)(700) (km)(150) NaN
5 NaN (km)(0) (km)(0) NaN (km)(40) NaN
6 (km)(100) NaN (km)(300) NaN NaN NaN
7 NaN NaN (km)(300) NaN NaN NaN
8 NaN NaN NaN (km)(80) NaN (km)(300)
9 NaN NaN (km)(700) NaN (km)(50) (km)(50)

How to manipulate the value of a pandas multiindex on a specific level?

Given a dataframe with row and column multiindex, how would you copy a row index "object" and manipulate a specific index value on a chosen level? Ultimately I would like to add a new row to the dataframe with this manipulated index.
Taking this dataframe df as an example:
col_index = pd.MultiIndex.from_product([['A','B'], [1,2,3,4]], names=['cInd1', 'cInd2'])
row_index = pd.MultiIndex.from_arrays([['2010','2011','2009'],['a','r','t'],[45,34,35]], names=["rInd1", "rInd2", 'rInd3'])
df = pd.DataFrame(data=None, index=row_index, columns=col_index)
df
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
I would like to take the index of the first row, manipulate the "rInd2" value and use this index to insert another row.
Pseudo code would be something like this:
#Get Index
idx = df.index[0]
#Manipulate Value
idx[1] = "L" #or idx["rInd2"]
#Make new row with new index
df.loc[idx, slice(None)] = None
The desired output would look like this:
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
2010 L 45 NaN NaN NaN NaN NaN NaN NaN NaN
What would be the most efficient way to achieve this?
Is there a way to do the same procedure with column index?
Thanks

Forward fill column one year after last observation

I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks
If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0
One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.

Transpose a dataframe based on column entries with sequences

right now I have a dataframe with two columns:
columnA columnB
:16R:AB NaN
:20C::XX S400500
:16X:AB NaN
:16R:AC NaN
:16X:AC NaN
:16R:AB NaN
:31X::BB Sema
:16R:AB Nan
I want to transpose the dataframe based on some sequences. The :16R:AB till the next 16X:AB is a sequence, then from 16R:ACtill 16X:AC and so on. I also want to add a counter/ID then, so that the finale dataframe looks like:
Index/Counter :16R:AB :20C::XX :16X:AB :16R:AC :16X:AC :31X:BB
1 NaN S400500 NaN NaN NaN Nan
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN Sema
Is there any trick to do it? Or is it a manuell loop?
So rebuilding aprototype of your example:
D = pd.DataFrame([np.random.randint(1,5,20), np.random.randn(20)]).T
D.columns = ["key", "value"]
key value
4.0 1.017081
4.0 -1.480929
3.0 -1.257809
1.0 -0.683207
...
now we can add a field, that counts the occurance of the same key
D["occurance"] = (D.key == 4.0).cumsum()
... and then we are able to pivot:
D.pivot(index="occurance", columns="key", values=["value"] )

assigning multiple values to different cells in a dataframe

This is probably an easy question, but I couldn't find any simple way to do that. Imagine the following dataframe:
df = pd.DataFrame(index=range(10), columns=range(5))
and three lists that contain indices, columns, and values of the defined dataframe that I intend to change:
idx_list = [1,5,3,7] # the indices of the cells that I want to change
col_list = [1,4,3,1] # the columns of the cells that I want to change
value_list = [9,8,7,6] # the final value of whose cells`
I was wondering if there exist a function in pandas that does the following efficiently:
for i in range(len(idx_list)):
df.loc[idx_list[i], col_list[i]] = value_list[i]
Thanks.
Using .values
df.values[idx_list,col_list]=value_list
df
Out[205]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 9 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 7 NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 8
6 NaN NaN NaN NaN NaN
7 NaN 6 NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Or another way less efficient
updatedf=pd.Series(value_list,index=pd.MultiIndex.from_arrays([idx_list,col_list])).unstack()
df.update(updatedf)
try df.applymap() function, you can use lambda to do your required operations.

Categories

Resources