How to reshape dataframe with multi year data in Python - python

I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']

Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!

Related

Merging dataframes with pandas with two keys

I have two datasets, one with individual reports and one with regional conditions. There are many more individual rows than regional, but I want to append the regional data onto each individual. The problem I am facing is that I must merge using two primary keys, e.g.
Individual - 5000 rows
Code | Time | Data1 | Data2 | Data3
Regional - 100 rows
Code | Time | RData1 | RData2
--I have attemped and failed using:
df = individual.merge(regional, how='left', on=['Code', 'Time'])
--Which leaves RData1,2 as null values in the new df, which does, to its credit look like
df - 5000 rows
Code | Time | Data1 | Data2 | Data3 | RData1 | RData2
but the null values don't help me...
Example Data
What I am seeing
Data
Generate random df
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Time': rng, 'data1': np.random.randn(len(rng)),'code':[201, 897,345, 70,879] })
df.set_index(['Time','code'], inplace=True)
df
Generate random df1
df1 = pd.DataFrame({ 'Time': rng, 'data1': np.random.randn(len(rng)),'code':[201, 30,345, 70,879] })
df1.set_index(['Time','code'], inplace=True)
df1
merge on indexes can be done as follows
result =df1.merge(df, left_index=True, right_index=True, suffixes=('_Left','_Right'))
result
Or better
result =pd.merge(df, df1,left_index=True, right_index=True, suffixes=('_Left','_Right'))
result

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

DataFrame index is clearly a DateTimeIndex but .plot() can't graph it properly?

I made a very simple two-column excel file with date and random float to test some other stuff. I read_excel() with parse_dates=True on the first column, and verified type(df.index) = .
When I go to .plot(), the x-axis of the graph is a bunch of random years like 2016 then 1987 then 1678...and none of my values show up of course (y-axis is correct though).
What did I miss?
The xlsx:
The column A is formatted as Short-Date in excel; B is just General
_____| A | B |
0 | 2019-01-01 12.87
1 | 2019-01-02 15.20
..
90 | 2019-03-31 10.12
The code is:
fakedf = pd.read_excel('sampledata.xlsx', index_col=0, parse_dates=True, header=None)
fakedf.index.name = 'Date'
fakedf.columns = ['Price']
fakedf.head()
From there there is a decent plot but when I do this:
df = pd.DataFrame({'Date': [np.nan], 'Price': [np.nan]})
df.index = df.Date
del df['Date']
for x in range(len(fakedf)):
df = df.append(fakedf.iloc[x])
df.plot()
It messes up. I had the for loop there because I was trying to test something relating to time.sleep()

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Pandas: Storing Dataframe in Dataframe

I am rather new to Pandas and am currently running into a problem when trying to insert a Dataframe inside a Dataframe.
What I want to do:
I have multiple simulations and corresponding signal files and I want all of them in one big DataFrame. So I want a DataFrame which has all my simulation parameters and also my signals as an nested DataFrame. It should look something like this:
SimName | Date | Parameter 1 | Parameter 2 | Signal 1 | Signal 2 |
Name 1 | 123 | XYZ | XYZ | DataFrame | DataFrame |
Name 2 | 456 | XYZ | XYZ | DataFrame | DataFrame |
Where SimName is my Index for the big DataFrame and every entry in Signal 1 and Signal 2 is an individuall DataFrame.
My idea was to implement this like this:
big_DataFrame['Signal 1'].loc['Name 1']
But this results in an ValueError:
Incompatible indexer with DataFrame
Is it possible to have this nested DataFrames in Pandas?
Nico
The 'pointers' referred to at the end of ns63sr's answer could be implemented as a class, e.g...
Definition:
class df_holder:
def __init__(self, df):
self.df = df
Set:
df.loc[0,'df_holder'] = df_holder(df)
Get:
df.loc[0].df_holder.df
the docs say that only Series can be within a DataFrame. However, passing DataFrames seems to work as well. Here is an exaple assuming that none of the columns is in MultiIndex:
import pandas as pd
signal_df = pd.DataFrame({'X': [1,2,3],
'Y': [10,20,30]} )
big_df = pd.DataFrame({'SimName': ['Name 1','Name 2'],
'Date ':[123 , 456 ],
'Parameter 1':['XYZ', 'XYZ'],
'Parameter 2':['XYZ', 'XYZ'],
'Signal 1':[signal_df, signal_df],
'Signal 2':[signal_df, signal_df]} )
big_df.loc[0,'Signal 1']
big_df.loc[0,'Signal 1'][X]
This results in:
out1: X Y
0 1 10
1 2 20
2 3 30
out2: 0 1
1 2
2 3
Name: X, dtype: int64
In case nested dataframes are not properly working, you may implement some sort of pointers that you store in big_df that allow you to access the signal dataframes stored elsewhere.
Instead of big_DataFrame['Signal 1'].loc['Name 1'] you should use
big_DataFrame.loc['Name 1','Signal 1']

Categories

Resources