Adding a column to pandas dataframe conditionally - python

I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?

Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.

Related

How to find sales in previous n months using groupby

I have a dataframe of daily sales:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017']
sales = [1,2,3,4,1,2]
ym = [201701,201701,201701,201701,201702,201702]
prev_1_ym = [201612,201612,201612,201612,201701,201701]
prev_2_ym = [201611,201611,201611,201611,201612,201612]
df_test = pd.DataFrame({'date': date,'ym':ym,'prev_1_ym':prev_1_ym,'prev_2_ym':prev_2_ym,'sales':sales})
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
I am trying to find total sales in the previous 1m, previous 2m etc..
My current approach is to use a list comprehension:
df_test[prev_1m_sales] = [ sum(df_test.loc[df_test['ym'] == x].sales) for x in df_test[prev_1_ym] ]
However, this proves to be very slow.
Is there a way to speed it up by using .groupby()?
you can use the date column to group your data, first change its data-type to pandas TimeStamps,
df['dates']=pd.to_datetime(df['dates'])
then you can use it directly in grouping for example
df.groupby(df.data.month).sales.sum().cumsum()

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

I have a lisit of DataFrames that come from the census api, i had stored each year pull into a list.
So at the end of my for loop i have a list with dataframes per year and a list of years to go along side the for loop.
The problem i am having is merging all the DataFrames in the list while also taging them with a list of years.
So i have tried using the reduce function, but it looks like it only taking 2 of the 6 Dataframes i have.
concat just adds them to the dataframe with out tagging or changing anything
# Dependencies
import pandas as pd
import requests
import json
import pprint
import requests
from census import Census
from us import states
# Census
from config import (api_key, gkey)
year = 2012
c = Census(api_key, year)
for length in range(6):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E",
"B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
data_df = pd.DataFrame(data)
data_df = data_df.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E":"Median Home Value",
"B25064_001E":"Median Rent",
"B15003_022E":"Bachelor Degrees",
"B19013_001E":"Median Income"})
data_df = data_df.astype({'Zipcode':'int64'})
filtervalue = data_df['Median Home Value']>0
filtervalue2 = data_df['Median Rent']>0
filtervalue3 = data_df['Median Income']>0
cleandata = data_df[filtervalue][filtervalue2][filtervalue3]
cleandata = cleandata.dropna()
yearlst.append(year)
datalst.append(cleandata)
year += 1
so this generates the two seperate list one with the year and other with dataframe.
So my output came out to either one Dataframe with missing Dataframe entries or it just concatinated all without changing columns.
what im looking for is how to merge all within a list, but datalst[0] to be tagged with yearlst[0] when merging if at all possible
No need for year list, simply assign year column to data frame. Plus avoid incrementing year and have it the iterator column. In fact, consider chaining your process:
for year in range(2012, 2019):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E", "B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
cleandata = (pd.DataFrame(data)
.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E": "Median_Home_Value",
"B25064_001E": "Median_Rent",
"B15003_022E": "Bachelor_Degrees",
"B19013_001E": "Median_Income"})
.astype({'Zipcode':'int64'})
.query('(Median_Home_Value > 0) & (Median_Rent > 0) & (Median_Income > 0)')
.dropna()
.assign(year_column = year)
)
datalst.append(cleandata)
final_data = pd.concat(datalst, ignore_index = True)

Merge Data Frames By Date With Unequal Dates

My process is this:
Import csv of data containing dates, activations, and cancellations
subset the data by activated or cancelled
pivot the data with aggfunc 'sum'
convert back to data frames
Now, I need to merge the 2 data frames together but there are dates that exist in one data frame but not the other. Both data frames start Jan 1, 2017 and end Dec 31, 2017. Preferably, the output for any observation in which the index month needs to be filled with have a corresponding value of 0.
Here's the .head() from both data frames:
For reference, here's the code up to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
%matplotlib inline
#import data
directory1 = "C:\python\Contracts"
directory_source = os.path.join(directory1, "Contract_Data.csv")
df_source = pd.read_csv(directory_source)
#format date ranges as times
#df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
#df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
#subset the data based on status
df_active = df_source[df_source["Order Status"]=="Active"]
df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
#remove activations outside 2017 and cancellations outside 2017
df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') &
(df_cancelled['Cancellation_Day'] <= '2017-12-31')]
df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') &
(df_active['Activation_Day'] <= '2017-12-31')]
#pivot the data to aggregate by day
df_active_aggregated = df_active.pivot_table(index='Activation_Day',
values='Event_Value',
aggfunc='sum')
df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
values='Event_Value',
aggfunc='sum')
#convert pivot tables back to useable dataframes
activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
#rename the time columns so they can be referenced when merging into one DF
activations_aggregated.columns = ["index_month", "Activations"]
#activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
cancellations_aggregated.columns = ["index_month", "Cancellations"]
#cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
I'm aware there are many posts that address issues similar to this but I haven't been able to find anything that has helped. Thanks to anyone that can give me a hand with this!
You can try:
activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

Categories

Resources