Getting data from API and sum 2 values for each year - python

I want to get data from API where the structure looks like:
district - passed - gender - year - value
This api shows how many people from country passed exam, but it has 2 rows for the same year but gender are different (male and female).
I want to sum value for each year, for example:
CountryA passed male 2019 12
CountryA passed female 2019 30
So result should be 42.
Unfortunately my method return only female's value.
I was trying something like this:
def task2(self,data,district=None):
passed = {stat.year: stat.amount for stat in data
if self.getPercentageAmountOfPassed(stat.status,stat.district,stat.year)}
def getPercentageAmountOfPassed(self,status,district,year):
return status == 'passed' and district == 'CountryA' and year <= 2019
Im pretty sure that I got data from api, because I was solving other examples with other parameters
EXAMPLE OF DATA:
[year][disctrict][amount][passed][gender]
[Data(2010.0, Polska,160988.0,przystąpiło,mężczyźni),
Data(2010.0, Polska,205635.0,przystąpiło,kobiety),
Data(2011.0, Polska,150984.0,przystąpiło,mężczyźni)]

This happens because you overwrite previous entries when you have the same year. You need to sum them instead:
from collections import defaultdict
def sum_per_year(data):
passed = defaultdict(int)
for stat in data:
if getPercentageAmountOfPassed(stat.status, stat.district, stat.year):
passed[int(stat.year)] += stat.amount
return passed

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Trying to get sums in lists to print out with strings in output

I have been working on a Python project analyzing a CSV file and cannot get the output to show me sums with my strings, just lists of the numbers that should be summed.
Code I'm working with:
import pandas as pd
data = pd.read_csv('XML_projectB.csv')
#inserted column headers since the raw data doesn't have any
data.columns = ['name','email','category','amount','date']
data['date'] = pd.to_datetime(data['date'])
#Calculate the total budget by cateogry
category_wise = data.groupby('category').agg({'amount':['sum']})
category_wise.reset_index(inplace=True)
category_wise.columns = ['category','total_budget']
#Determine which budget category people spent the most money in
max_budget = category_wise[category_wise['total_budget']==max(category_wise['total_budget'])]['category'].to_list()
#Tally the total amounts for each year-month (e.g., 2017-05)
months_wise = data.groupby([data.date.dt.year, data.date.dt.month])['amount'].sum()
months_wise = pd.DataFrame(months_wise)
months_wise.index.names = ['year','month']
months_wise.reset_index(inplace=True)
#Determine which person(s) spent the most money on a single item.
person = data[data['amount'] == max(data['amount'])]['name'].to_list()
#Tells user in Shell that text file is ready
print("Check your folder!")
#Get all this info into a text file
tfile = open('output.txt','a')
tfile.write(category_wise.to_string())
tfile.write("\n\n")
tfile.write("The type with most budget is " + str(max_budget) + " and the value for the same is " + str(max(category_wise['total_budget'])))
tfile.write("\n\n")
tfile.write(months_wise.to_string())
tfile.write("\n\n")
tfile.write("The person who spent most on a single item is " + str(person) + " and he/she spent " + str(max(data['amount'])))
tfile.close()
The CSV raw data looks like this (there are almost 1000 lines of it):
Walker Gore,wgore8i#irs.gov,Music,$77.98,2017-08-25
Catriona Driussi,cdriussi8j#github.com,Garden,$50.35,2016-12-23
Barbara-anne Cawsey,bcawsey8k#tripod.com,Health,$75.38,2016-10-16
Henryetta Hillett,hhillett8l#pagesperso-orange.fr,Electronics,$59.52,2017-03-20
Boyce Andreou,bandreou8m#walmart.com,Jewelery,$60.77,2016-10-19
My output in the txt file looks like this:
category total_budget
0 Automotive $53.04$91.99$42.66$1.32$35.07$97.91$92.40$21.28$36.41
1 Baby $93.14$46.59$31.50$34.86$30.99$70.55$86.74$56.63$84.65
2 Beauty $28.67$97.95$4.64$5.25$96.53$50.25$85.42$24.77$64.74
3 Books $4.03$17.68$14.21$43.43$98.17$23.96$6.81$58.33$30.80
4 Clothing $64.07$19.29$27.23$19.78$70.50$8.81$39.36$52.80$80.90
year month amount
0 2016 9 $97.95$67.81$80.64
1 2016 10 $93.14$6.08$77.51$58.15$28.31$2.24$12.83$52.22$48.72
2 2016 11 $55.22$95.00$34.86$40.14$70.13$24.82$63.81$56.83
3 2016 12 $13.32$10.93$5.95$12.41$45.65$86.69$31.26$81.53
I want the total_budget column to be the sum of the list for each category, not the individual values you see here. It's the same problem for months_wise, it gives me the individual values, not the sums.
I tried the {} .format in the write lines, .apply(str), .format on its own, and just about every other Python permutation of the conversion to string from a list I could think of, but I'm stumped.
What am I missing here?
As #Barmar said, the source has $XX so it is not treated as numbers. You could try following this approach to parse the values as integers/floats instead of strings with $ in them.

Improve running time when using inflation method on pandas

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

Automatically calculating percentage and store into variables

I am currently on a demographic project.
I have data from 3 different countries and their birth statistics in each month.
My question:
I want to calculate the percentage people born in each month and plot it for each country. (x = Month, y = Percentage born)
Therefore, I want to calculate the percentages first.
I want to do this per iteration over all months to improve my code. So far:
EU = df2["European Union - 28 countries (2013-2020)"]
CZE = df2["Czechia"]
GER = df2["Germany including former GDR"]
EU_1 = EU[1] / EU[0] *100
EU_2 = EU[2] / EU[0] *100
etc.
for each month and 3 countries.
How can I calculate all automatically by changing Country [i] and store every value separately (Function, for loop?)
Thank you very much!
You could do something like this:
EU = df2["European Union - 28 countries (2013-2020)"]
monthly_percentages = [num_born / EU[0] * 100 for num_born in EU[1:]]
Assuming that the first element of EU is the total births that year and the rest are the births per each month. If you wanted to do all the countries automatically, you could loop through each country and calculate the birth percentage for each month, then store it somewhere. It would look something like:
country_birth_percentages = []
for country in countries:
country_birth_percentages.append([num_born / country[0] * 100 for num_born in country[1:]])

adding new permanent column to data frame python

I am currently building a fake dataset to play with. I have one dataset, called patient_data that has the patient's info:
patient_data = pd.DataFrame(np.random.randn(100,5),columns='id name dob sex state'.split())
This gives me a sample of 100 observations, with variables like name, birthday, etc.
Clearly, some of these (like name sex and state) are categorical variables, and makes no sense to have random numbers attached to it.
So for "sex" column, I created a function that will turn every random number <0 to read "male" and everything else to read "female." I would like to create a new variable called "gender" and store this inside this variable:
def malefemale(x):
if x < 0:
print('male')
else:
print('female')
And then I wrote a code to apply this function into the data frame to officially create a new variable "gender."
patient_data.assign(gender = patient_data['sex'].apply(malefemale))
But when I type "patient_data" in the jupiter notebook, I do not see the data frame updated to include this new variable. Seems like nothing was done.
Does anybody know what I can do to permanently add this new gender variable into my patient_data dataframe, with the function properly working?
I think you need assign back and for new values use numpy.where:
patient_data = patient_data.assign(gender=np.where(patient_data['sex']<0, 'male', 'female'))
print(patient_data.head(10))
id name dob sex state gender
0 0.588686 1.333191 2.559850 0.034903 0.232650 female
1 1.606597 0.168722 0.275342 -0.630618 -1.394375 male
2 0.912688 -1.273570 1.140656 -0.788166 0.265234 male
3 -0.372272 1.174600 0.300846 1.959095 -1.083678 female
4 0.413863 0.047342 0.279944 1.595921 0.585318 female
5 -1.147525 0.533511 -0.415619 -0.473355 1.045857 male
6 -0.602340 -0.379730 0.032407 0.946186 0.581590 female
7 -0.234415 -0.272176 -1.160130 -0.759835 -0.654381 male
8 -0.149291 1.986763 -0.675469 -0.295829 -2.052398 male
9 0.600571 -1.577449 -0.906590 1.042335 -2.104928 female
You need to change your custom function as
def malefemale(x):
if x < 0:
return "Male"
else:
return "female"
then simply apply the custom function
patient_data['gender'] = patient_data['sex'].apply(malefemale)

Categories

Resources