I have a pandas dataframe with these columns (important part is that i have every month from 1996-04 to 2016-08)
Index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank',
'1996-04', '1996-05', '1996-06', '1996-07',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=251)
I need to group columns by three to represent financial quarters, eg:
| 1998-01 | 1999-02 | 1999-03 |
| 2 | 4 | 7 |
Needs to become
| 1998q1 |
|avg(2,4,7)|
Any hint about the right approach?
First convert all non dates columns to index, convert them to quarter period and aggregate by columns with mean:
df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])
df.columns = pd.to_datetime(df.columns).to_period('Q').strftime('%Yq%q')
df = df.groupby(level=0, axis=1).mean().reset_index()
Related
More adventures in dataframes :)
So, I've pretty much have all the basics, however, this one is stumping me. I have two dataframes (pictures below). The first (techIndicator) has a ton of columns and rows, all filled properly. The second dataframe (social) has multiple columns, but only one row.
I need to add the columns (working as per the SS), but I want to duplicate the social dataframe's row all the way down to "fill in the nan's".
Here is the code that I'm using to concatenate all of the dataframes into a single one (all work except for social):
techIndicator = pd.concat([inter_day, macd, rsi, ema, vwap, adx, dmi, social], axis = 1)
techIndicator.sort_index(ascending=False, inplace=True)
techIndicator.dropna()
techIndicator.reset_index(drop=True)
As per the SS's below, the first threerows should look like:
datetime1 | 1 | 2 | 3 | 4 | ......| 9 | 8| 7 | 6
datetime2 | 2 | 1 | 4 | 3 | ......| 9 | 8| 7 | 6
datetime3 | 3 | 4 | 1 | 2 | ......| 9 | 8| 7 | 6
Instead, the concatenate above adds the columns, but deletes the values (I already checked the data types, they're all float64's)
Please help =) My google-fu isn't working for this >.<
With help from Alex below, I was able to solve the many issues that I was having!
dfTemp = pd.concat([inter_day, macd, rsi, ema, vwap, adx, dmi], axis = 1)
dfTemp.sort_index(ascending=False, inplace=True)
dfTemp.dropna()
dfTemp.reset_index(inplace = True)
long_social = social
for a in range(dfTemp.shape[0] - 1):
long_social=pd.concat([long_social, social])
long_social.reset_index(inplace = True)
long_social.drop(columns = ['index'], inplace = True)
techIndicator = pd.concat([dfTemp, long_social], axis = 1)
techIndicator.rename(columns={'date': 'Date',
'1. open': 'Open',
'2. high': 'High',
'3. low': 'Low',
'4. close': 'Close',
'5. volume': 'Volume',
'DX': 'DMI'}, inplace=True)
techIndicator.dropna(inplace=True)
techIndicator.reset_index(drop=True, inplace=True)
techIndicator.set_index('Date', inplace=True)
techIndicator.sort_index(ascending=False, inplace=True)
So i have a solution for you which does not use concat to add the columns to your main dataset, but it gets the job done.
The flow of it is since the two dataframes are not of even size, we make them even sized and the just loop through to add the columns by naming them.
# first create a copy of your social_df which we will append it to for as long as your main df is
long_soial=social_df
for a in range(main_df.shape[0]):
social_long=pd.concat([social_long, social])
# now you have a long_social_df with the same length as your main df
social_vars=list(social.columns) # Get the column names from social for naming them as we add to the main df
for i, var in enumerate(social_vars):
main_df[var]=list[social_long[i] # add the columns by creating a new empty column with the desired name and adding the social info as a list to the named column
I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']
Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!
I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.
I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.
Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+
I have 3 dataframes for yearly data (one for 2014, 2015 and 2016), each having 3 columns named, 'PRACTICE', 'BNF NAME', 'ITEMS'.
BNF NAME refers to drug names and I am picking out 3 Ampicillin, Amoxicillin and Co-Amoxiclav. This column has different strengths/dosages (e.g Co-Amoxiclav 200mg or Co-Amoxiclav 300mg etc etc) that I want to ignore, so I have used str.contains() to select these 3 drugs.
ITEMS is the total number of prescriptions written for each drug.
I want to create a stacked bar chart with the x axis being year (2014, 2014, 2015) and the y axis being total number of prescriptions, and each of the 3 bars to be split up into 3 for each drug name.
I am assuming I need to use df.groupby() and select a partial string maybe, however I am unsure how to combine the yearly data and then how to group the data to create the stacked bar chart.
Any guidance would be much appreciated.
This is the line of code I am using to select the rows for the 3 drug names only.
frame=frame[frame['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]
This is what each of the dataframes resembles:
PRACTICE | BNF NAME | ITEMS
Y00327 | Co-Amoxiclav_Tab 250mg/125mg | 23
Y00327 | Co-Amoxiclav_Susp 125mg/31mg/5ml S/F | 10
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml S/F | 6
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml | 1
Y00327 | Co-Amoxiclav_Tab 500mg/125mg | 50
There are likely going to be a few different ways in which you could accomplish this. Here's how I would do it. I'm using a jupyter notebook, so your matplotlib imports may be different.
import pandas as pd
%matplotlib
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
df = pd.DataFrame({'PRACTICE': ['Y00327', 'Y00327', 'Y00327', 'Y00327', 'Y00327'],
'BNF NAME': ['Co-Amoxiclav_Tab 250mg/125mg', 'Co-Amoxiclav_Susp 125mg/31mg/5ml S/F',
'Co-Amoxiclav_Susp 250mg/62mg/5ml S/F', 'Ampicillin 250mg/62mg/5ml',
'Amoxicillin_Tab 500mg/125mg'],
'ITEMS': [23, 10, 6, 1, 50]})
Out[52]:
BNF NAME ITEMS PRACTICE
0 Co-Amoxiclav_Tab 250mg/125mg 23 Y00327
1 Co-Amoxiclav_Susp 125mg/31mg/5ml S/F 10 Y00327
2 Co-Amoxiclav_Susp 250mg/62mg/5ml S/F 6 Y00327
3 Ampicillin 250mg/62mg/5ml 1 Y00327
4 Amoxicillin_Tab 500mg/125mg 50 Y00327
To simulate your three dataframes:
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
Set a column indicating what year the dataframe represents.
df1['YEAR'] = 2014
df2['YEAR'] = 2015
df3['YEAR'] = 2016
Combining the three dataframes:
combined_df = pd.concat([df1, df2, df3], ignore_index=True)
To set what drug each row represents:
combined_df['parsed_drug_name'] = "" # creates a blank column
amp_bool = combined_df['BNF NAME'].str.contains('Ampicillin', case=False)
combined_df.loc[amp_bool, 'parsed_drug_name'] = 'Ampicillin' # sets the row to amplicillin, if BNF NAME contains 'ampicillin.'
amox_bool = combined_df['BNF NAME'].str.contains('Amoxicillin', case=False)
combined_df.loc[amox_bool, 'parsed_drug_name'] = 'Amoxicillin'
co_amox_bool = combined_df['BNF NAME'].str.contains('Co-Amoxiclav', case=False)
combined_df.loc[co_amox_bool, 'parsed_drug_name'] = 'Co-Amoxiclav'
Finally, perform a pivot on the data, and plot the results:
combined_df.pivot_table(index='YEAR', columns='parsed_drug_name', values='ITEMS', aggfunc='sum').plot.bar(rot=0, stacked=True)