I am working on a COVID-19 dataset with total cases and total deaths at the last day of each month for each city since march. But I would like to create a column which tells me the number of new cases for every city in each of these months.
My logic is: if the value in the cell from the 'city_ibge_code' column in position p is the same as the value in position p-1, it should create a new column that is the difference between the number of cases in two months. And if the values are different (that shows that are different cities), just pass that value to the new column.
casos_full: is the dataframe with the cities and the number of cases and deaths in march, april, may, june, july, august and semptember.
city_ibge_code: is the code for each city in the dataframe - each city has a unique code.
And there also is a "date" column - which represents the last day of the month
for rows in casos_full:
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
rows['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
rows['New Cases'] = rows['last_available_confirmed']
rows here is a view of the line. You need to update the actual dataframe. If I understood your problem correctly.
for i, rows in enumerate(casos_full):
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
casos_full[i]['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
casos_full[i]['New Cases'] = rows['last_available_confirmed']
Please give more precision on your problem so we can help.
Related
I have a set of 11 data frames on Python, Pandas, with salary information of a group of employees, for 11 Years, which I have concatenated, with the same column 'Salary weighted'.
['Year'] ['ID Person] ['Weighted Salary']
This column contains a Salary that is weighted 50/50 with the present and previous year, if it exists, or two years prior on the contrary. If no previous data exist, the Actual Salary and the weighted would be the same.
What I want to do is to create a new column with the Actual salary for every year and employee.
For this, I need to iter through the rows of the Salary Weighted, check if there's a previous salary for the specific person 1 or 2 years prior, and return the Actual Salary based on the previous values.
The formula for the Actual Salary would be:
df.Actual_Salary = 2 * df.Weighted current year - df.Weighted year-1
I know the Weighted values.
The condition I've created is:
if len(row.Salary_Today[((df.ID_Person== ID_Person)&(df.Year_int==Year-1))]) >= 1:
df.Actual_Salary = 2 * df.Weighted current year - df.Weighted year-1
elif len(row.Salary_Today[((df.ID_Person== ID_Person)&(df.Year_int==Year-2))]) >= 1:
df.Actual_Salary = 2 * df.Weighted current year - df.Weighted year-2
else : df.Actual_Salary = df.Weighted current year
My problem is that I don't know how to properly iterate through each value and check the previous
year's information for that person in order to pass it through the conditions and calculate the new column... Any suggestions?
Problem:
my dataframe shows forward contract settlements on a daily basis. I want to create a column that returns all the January values. So in November, M0 is November contract and I want to return M2, which is the January contract, then in December I would return M1. The normal code for for doing this would be:
df['Jan'] = df.loc[(df.Month==12), ['M12]]
This only works for one value of the month though and I want to loop through the values in the Month column of the dataframe to pull the right 'Mx' column for the given month and have a single column with values for January.
I have tried various loops but continually get errors. The latest I have below with the dataframe:
DataFrame
Code
Error Message
Any help appreciated, I have googled all day. Happy to hear any better ways of doing this
You can pass the whole dataframe and then access the columns within your function.
Try this.
def create_jan(x):
if x['Month']== 11:
return x['M1']
elif x['month']==12:
return x['M2']
else:
return 0
df['jan']= df.apply(create_jan, axis = 1)
I have a dataset like this:
SKU,Date,Inventory,Sales,Incoming
2010,2017-01-01 0:00:00,85,126,252
2010,2017-02-01 0:00:00,382,143,252
2010,2017-03-01 0:00:00,414,139,216
2010,2017-04-01 0:00:00,468,120,216
7770,2017-01-01 0:00:00,7,45,108
7770,2017-02-01 0:00:00,234,64,216
7770,2017-03-01 0:00:00,160,69,36
7770,2017-04-01 0:00:00,150,50,72
7870,2017-01-01 0:00:00,41,29,36
7870,2017-02-01 0:00:00,95,18,36
7870,2017-03-01 0:00:00,112,16,36
7870,2017-04-01 0:00:00,88,19,0
Inventory Quantity is the "actual" recorded quantity, which may differ from the hypothetical remaining quantity, which is what I am trying to calculate.
Sales Quantity actually extends much longer into the future. In those rows, the other two columns will have NA.
I want to create the following:
Take only the first Inventory value of each SKU
Use the first value to calculate the hypothetical remaining quantity by using a recursive formula [Earliest inventory] - [Sales for that month] - [Incoming qty for that month] (Note: Earliest inventory is a fixed quantity for each SKU). Store the output in a column called "End of Month Part 1".
Create another column called "Buy Quantity" with the following criteria: If remaining quantity is less than 50, then create a new column that indicates the buy amount (let's say it's 30 for all 3 SKUs) (i.e. increase the quantity by 30). If the remaining quantity is more than 50, then the buy amount is zero.
Create another column called "End of Month Part 2" that adds "End of Month Part 1" with "Buy Quantity"
I am able to obtain the first quantity of each SKU using the following code, and merge it into a column called first_qty into the dataset
first_qty_series = dataset.groupby(by=['SKU']).nth(0)['Inventory']
first_qty_series = pd.DataFrame(dataset).reset_index().rename(columns={'Inventory': 'Earliest inventory'})
dataset = pd.merge(dataset, pd.DataFrame(first_qty_series), on='SKU' )
As for the remainder quantity, I thought of using cumsum() on the two columns dataset['Sales'] and dataset['Incoming'] but I think it won't work because the cumsum() will sum across ALL SKUs.
That's why I think I need to perform the calculation in groupby. But I don't know what else to do.
(Edit:) Expected output is:
Thank you guys!
Here is a way to do the 4 columns you want.
1 - Another method, using loc and drop_duplicates to fill the first row for each 'SKU' with the value from 'Inventory', and then use ffill to fill the following rows, but your method is good.
dataset.loc[dataset.drop_duplicates(['SKU']).index,'Earliest inventory'] = dataset['Inventory']
dataset['Earliest inventory'] = dataset['Earliest inventory'].ffill().astype(int)
2 - Indeed you need cumsum and groupby to create the column 'End of Month Part 1', not on the column 'Earliest inventory' as the value is the same on every row for a same 'SKU'. Note: according to your result (and logic), I change the - with + before the column 'Incoming', and if I misunderstood the problem, just change the sign.
dataset['End of Month Part 1'] = (dataset['Earliest inventory']
- dataset.groupby('SKU')['Sales'].cumsum()
+ dataset.groupby('SKU')['Incoming'].cumsum())
3 - The column 'Buy Quantity' can be implemented using loc again meeting the condition on value less than 50 in column 'End of Month Part 1', then fillna with 0
dataset.loc[dataset['End of Month Part 1'] <= 50, 'Buy Quantity'] = 30
dataset['Buy Quantity'] = dataset['Buy Quantity'].fillna(0).astype(int)
4 - Finally the last column is just adding the two lasts created
dataset['End of Month Part 2'] = dataset['End of Month Part 1'] + dataset['Buy Quantity']
If I understood well the 4 points, you should get the dataset with the new columns
My dataframe1 contains the day column which has numeric data from 1 to 7 for each day of the week. 1 - Monday, 2 - Tuesday...etc.
This day column is the day of Departure of a flight.
I need to create a new column dayOfBooking in a second dataframe2 which finds day of the week based on the number of days before a person books a flight and the day of departure of the flight.
For that I've written this function:
def findDay(dayOfDeparture, beforeDay):
beforeDay = int(beforeDay)
beforeDay = beforeDay % 7
if((dayOfDeparture - beforeDay) > 0):
dayAns = currDay - beforeDay;
else:
dayAns = 7 - abs(dayOfDeparture - beforeDay)
return(dayAns)
I want something like:
dataframe2["dayOfBooking"] = findDay(dataframe1["day"], i)
where i is the scalar value.
I can see that findDay takes the entire column day of dataframe1 instead of taking a single value for each row.
Is there an easy way to accomplish this like when we want a third column to be the sum of two other columns for each row, we can just write this:
dataframe["sum"] = dataframe2["val1"] + dataframe2["val2"]
EDIT: Figured it out. Answer and explanation below.
df2["colname"] = df.apply(lambda row: findDay(row['col'], i), axis = 1)
We have to use the apply function if we want to extract each row value of a particular column and pass it to a user defined function.
axis = 1 denotes that every row value is being taken for that column.
I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393