Resetting index of columns in pivot table - python

I have written a code to convert rows into columns for a particular order. Everything runs fine but the index of the columns is not right. I am adding the code:
import pandas as pd
df = pd.read_csv('UNT_Data.csv', low_memory=False)
df.columns = df.columns.str.replace(' ', '_')
#making index for every change period
df['idx'] = df.groupby('GR_Key').cumcount()
#converting index column name to Change_Period_Start_
df['date_idx'] = 'Change_Period_Start_' + df.idx.astype(str)
#converted the columns to one row for one GR Key
date = df.pivot_table(index='GR_Key', columns='date_idx', values='Change_Period_Start', aggfunc='first')
Here is the screenshot of the same:
Image

First remove converting column to strings with prefix:
df['date_idx'] = 'Change_Period_Start_' + df.idx.astype(str)
And then change columns to idx and add DataFrame.add_prefix:
date = (df.pivot_table(index='GR_Key',
columns='idx',
values='Change_Period_Start',
aggfunc='first')
.add_prefix('Change_Period_Start_'))

Related

replace NaNs with 0 for df columns where column name contains specific string (pandas)

I have a dataframe as a result of a pivot which has several thousand columns (representing time-boxed attributes). Below is a much shortened version for resemblance.
d = {'incount - 14:00': [1,'NaN', 1,1,'NaN','NaN','NaN','NaN',1],
'incount - 15:00': [2,1,2,'NaN','NaN','NaN',1,4,'NaN'],
'outcount - 14:00':[2,'NaN',1,1,1,1,2,2,1]
'outcount - 15:00':[2,2,1,1,'NaN',2,'NaN',1,1]}
df = pd.DataFrame(data=d)
I want to replace the NaNs in columns that contain "incount" with 0 (leaving other columns untouched). I have tried the following but predictably it does not recognise the column name.
df['incount'] = df_all['incount'].fillna(0)
I need the ability to search the column names and only impact those containing a defined string.
try this:
m = df.columns[df.columns.str.startswith('incount')]
df.loc[:, m] = df.loc[:, m].fillna(0)
print(df)
you can use:
loop_cols = list(df.columns[df.columns.str.contains('incount',na=False)]) #get columns containing incount as a list
#or
#loop_cols = [col for col in df.columns if 'incount' in col]
print(loop_cols)
'''
['incount - 14:00', 'incount - 15:00']
'''
for i in loop_cols:
df[i]=df[i].fillna(0)

Formatting the data which increases monotonically in Python

I have formatted the data according to the need. Now my final data or dataframe is not monotonically increasing whereas the input data is increasing monotonically according to the 1st column field (freq). Here is the link for Data_input_truncated.txt. My python code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_input.txt', sep="\s+", names=['freq','v'])
#boolean mask for identify columns of new df
m = df['v'].str.endswith(')')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
cols = df['g'].unique()
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t')
Now the created target.txt is not monotonic. Here is the link for target.txt. How can I make it monotonic before saving as a file?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
The problem is that your data is str and not a float, and while pivoting, it is reorder with alphabetical order. One option could be to change the type of the freq column to float, and then if the formatting as scientific number is important, you can set the float_format parameter during to_csv:
### same code before
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
# convert to float
df['freq']= df['freq'].astype(float)
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t', float_format='%.17E' ) # add float_format='%.17E'
Note float_format='%.17E' means scientific notation with 17 numbers after the . as in your input, but you can change this number to anyone you want if they are not important.
EDIT: I get this result in target.txt (first 5 rows and 3 columns)
freq R1:1 R1:2
0.00000000000000000E+00 4.07868642871600962E0 3.12094533520232087E-13
1.00000000000000000E+06 4.43516799439728793E0 4.58503433913467795E-3
2.00000000000000000E+06 4.54224931058591253E0 1.21517855438593236E-2
3.00000000000000000E+06 4.63952376349496909E0 2.10017318391844077E-2
4.00000000000000000E+06 4.74002677709486608E0 3.05258806632440871E-2

Set up MultiIndex DataFrame from multiple CSV files in DateTime series

I have a list of time series price data in CSV format that is read as follows:
asxList = ['ANZ', 'NAB', 'WBC']
for asxCode in asxList:
ohlcData = pd.DataFrame.from_csv(asxCode+'.CSV', header=0)
Example output:
How do I assemble all the ohlcData in particular order, firstly by DateTime index, and secondly by the asxList ['ANZ', 'NAB', 'WBC'] index, then followed by the data columns?
Create a list of dataframes, add a code column to each dataframe:
dfs = []
for asxCode in asxList:
df = pd.DataFrame.from_csv(asxCode+'.CSV', header=0)
df['code'] = asxCode
dfs.append(df)
Concatenate the dataframes, add the code column to the index:
pd.concat(dfs).reset_index().set_index(['index', 'code'])
Almost same with Dyz, just using keys from concat
asxList = ['ANZ', 'NAB', 'WBC']
l=[]
for asxCode in asxList:
l.append(pd.DataFrame.from_csv(asxCode+'.CSV', header=0))
pd.concat(l,keys=asxList)

Creating an empty Pandas DataFrame column with a fixed first value then filling it with a formula

I'd like to create an emtpy column in an existing DataFrame with the first value in only one column to = 100. After that I'd like to iterate and fill the rest of the column with a formula, like row[C][t-1] * (1 + row[B][t])
very similar to:
Creating an empty Pandas DataFrame, then filling it?
But the difference is fixing the first value of column 'C' to 100 vs entirely formulas.
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B','C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0)
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df['B'] = df['A'].pct_change()
df['C'] = df['C'].shift() * (1+df['B'])
## how do I set 2016-10-03 in Column 'C' to equal 100 and then calc consequtively from there?
df
Try this. Unfortunately, something similar to a for loop is likely needed because you will need to calculate the next row based on the prior rows value which needs to be saved to a variable as it moves down the rows (c_column in my example):
c_column = []
c_column.append(100)
for x,i in enumerate(df['B']):
if(x>0):
c_column.append(c_column[x-1] * (1+i))
df['C'] = c_column

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources