print the average of each year from txt.file - python

I need to print the average per year for an assignment. I have the following:
a text file that is like this with over 2000 lines:
Unit 42;2017;7.0
Love Your Garden;2011;8.0
Limmy's Show;2010;8.3
Nazi Megastructures;2013;8.0
Omniscient;2020;6.3
Green Frontier;2019;7.4
Los BriceƱo;2019;8.4
Aftermath;2014;
Sugar;2006;
Beyond Stranger Things;2017;
Men on a Mission;2018;
Click for Murder;2017;
As you can see some movies don't have a grade so these need to be ignored
Now i need to output it like this:
2000: 1,1111
2001: 2,2222
etc up until 2020
Now I made the following code to extract the right parts from the txt file
I tried the following:
file = open("tv_shows.txt", "r", encoding='utf8')
#content = file.read()
result = {}
for line in file:
year, number = line.split(';')[1], line.split(';')[2]
if len(number) <3:
continue
year = int(year)
number = float(number)
try:
result[year].append(number)
except KeyError:
result[year] = [number]
for k, v in sorted(result.items()):
print('{}: {:.4f}'.format(k, sum(v) / len(v)))
it gives me this, which is a lot better, but now it raises a new question for me. How can i remove the redundant zero's in the average numbers.
2000: 7.7000
2001: 7.4000
2002: 7.1000
2003: 7.0091
2004: 7.6667
2005: 7.7333
2006: 7.2579
2007: 7.5080
2008: 7.1630
2009: 7.3884
2010: 7.3904
2011: 7.3507
2012: 7.0787
2013: 7.0418
2014: 7.2427
2015: 7.2462
2016: 7.1730
2017: 7.1478
2018: 7.0034
2019: 7.1191
2020: 6.8130

If you are not allowed to use pandas,
file = open("tv_shows.txt", "r", encoding='utf8')
years = {}
for a in file:
_, year, number = a.split(';')
if len(number) <3:
continue
year = int(year)
number = float(number)
if year not in years:
years[year] = [] # Add a new list to the years dict
years[year].append(number) # Append the current number to the correct list.
avgyears = {}
for year, numberlist in years.items():
# iterate over the dict, find the mean of each list
avgyears[year] = sum(numberlist) / len(numberlist)
The question was edited while I was writing my answer. The modified question asks "How can I remove the redundant zero's in the average numbers?"
The extra zeros are added because you ask Python to format your number to four decimal places. To remove the zeros from the right side of the string, you can simply use str.rstrip()
for year, numberlist in years.items():
# iterate over the dict, find the mean of each list
avgyears[year] = sum(numberlist) / len(numberlist)
num = f"{avgyears[year]:.4f}".rstrip("0")
print(f"{year}: {num}")

If you are allowed to use pandas then
df = pd.read_csv("tv_show.txt", delimiter=";", header=None,
names=['name', 'year', 'rating'])
df = df.dropna()
df.groupby(['year'])['rating'].mean().reset_index()

How about you keep a dictionary whose keys are years and values are lists of scores in that year? Populate the dict as you loop (dont forget to convert str to float). Then at the end you can just average each list.

Related

Looking for any matching terms from file

I have a file that has a large list of Countries, years, and ages of living expectancies. I cannot figure out how to make sure the user is only allowed to input a year that actually exists. After figuring this out, I will need to call only those years (with corresponding country name, code, and living expectancies. How can I do this?
import pathlib
cwd = pathlib.Path(__file__).parent.resolve()
data_file = f'{cwd}/life-expectancy.csv'
with open(data_file) as f:
while True:
user_year = input('Enter the year of interest: ')
for lines in f:
cat = lines.strip().split(',')
country = cat[0]
code = cat[1]
year = cat[2]
age = cat[3]
if any( [year in user_year for year in cat[2]] ):
print(f'Your year is {user_year}. That is one of our known years.')
print(year)
print()
continue
else:
print('Please enter a valid year (1751-2019)')
print('test')
Solution 1
If all the dates from 1751 to 2019 are in your file, then you don't need to read your file to check that, you can simply do that:
# Ask the user for the year
prompt_text = "Enter the year of interest: "
user_year = int(input(prompt_text))
while not 1751 <= user_year <= 2019:
print("Please enter a valid year (1751-2019)")
user_year = int(input(prompt_text))
After that you can read your file and store the data only if the years are matching:
# Get the data for the asked year
# Example of final data: [("France", "FR", 45), ("Espagne", "ES", 29)]
data = []
with open(data_file, "r", encoding="utf-8") as file:
for line in file:
country, code, year, age = line.strip().split(",")
if int(year) == user_year:
data.append((country, code, int(age)))
Solution 2
If you really need to check the year in your file, e.g. because 1845 is not in it, then read the file once and store all the data in a dictionary indexed by the year and return the data of the asked year if it is present:
data = {}
with open(data_file, "r", encoding="utf-8") as file:
for line in file:
country, code, year, age = line.strip().split(",")
year = int(year)
if year in data:
data[year].append((country, code, int(age)))
else:
data[year] = [(country, code, int(age))]
prompt_text = "Enter the year of interest: "
user_year = int(input(prompt_text))
while user_year not in data:
print("The year is not present in the file")
user_year = int(input(prompt_text))
print(data[user_year])
One could use DataFrames to handle such cases. To know more information on dataframe, take a look into Pandas.DataFrame
To select specific column contents from the dataframe: df[[<col_1>, <col_2>]]
Considering the data fetched could produce the following.
import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
year = int(input("Enter the year of interest: "))
df = df[["Country", "Year", "Life expectancy "]]
if year in df["Year"].values:
print(f'Your year is {year}. That is one of our known years.')
display(df.loc[df["Year"] == year])
else:
print("Please enter a valid year (2000-2015)")
Your question includes two questions.
1. Question and answer
I cannot figure out how to make sure the user is only allowed to
input a year that actually exists.
Your range of accepted years is 1751-2019. You could create a list with these integers and check that the user input is within that range. E.g.
allowed_answers = list(range(1751, 2019, 1))
There are multiple ways to check the user input and the one you want to use depends on how you want the user interaction to be. Here are few examples:
1.Stop the program immediately if user input is invalid
user_year = input('Enter the year of interest: ')
allowed_answers = list(range(1751, 2019, 1))
assert user_year in allowed_answers, "User input is invalid"
...
2.Ask user to input number until it is accepted
allowed_answers = list(range(1751, 2019, 1))
user_year = 0
while int(user_year) not in allowed_answers:
print('Please enter a valid year (1751-2019)')
user_year = input('Enter the year of interest: ')
3.Combining the two solutions to have a limit of prompts.
allowed_answers = list(range(1751, 2019, 1))
user_year = 0
for i in range(0,5):
print('Please enter a valid year (1751-2019)')
user_year = input('Enter the year of interest: ')
if int(user_year) in allowed_answers:
input_valid = True
break
else:
input_valid = False
assert input_valid, "No correct input after five tries."
Note that all these solutions only handle inputs that can be converted into integer. To go around that, you might need some try... except clauses for the data transformation from string to integer, or transform the list items of allowed_answers into strings.
2. Question and answer
After figuring this out, I will need to call only those years (with corresponding country name, code, and living expectancies. How can I do this?
I would read the file only once a make it into a dictionary. Then you only need to do the indexing once and search from there as long as your program is running. See https://docs.python.org/3/tutorial/datastructures.html#dictionaries .
With these suggestions I would do the data reading and transformation into dictionary outside (and before) your while loop.

How to split the textfile

04-05-1993:1.068
04-12-1993:1.079
04-19-1993:1.079
06-06-1994:1.065
06-13-1994:1.073
06-20-1994:1.079
I have text file for date-year-price for gas and i want to calculate the avg gas prices for year. So i tried to split,
with open('c:/Gasprices.txt','r') as f:
fullfile=[x.strip() for x in f.readlines()]
datesprices=[(x.split('-')[0], x.split(':')[1]) for x in fullfile]
print(datesprices)
But I can't get year and price data but data like this.
('04', '1.068'), ('04', '1.079')
please let me know what should i know.
and plus, please let me know how to use split data to calculate the avg price per year using a dictionary if you can.
I see no need to split the input lines as they have a fixed format for the date - i.e., its length is known. Therefore we can just slice.
with open('gas.txt') as gas:
td = dict()
for line in gas:
year = line[6:10]
price = float(line[11:])
td.setdefault(year, []).append(price)
for k, v in td.items():
print(f'{k} {sum(v)/len(v):.3f}')
Output:
1993 1.075
1994 1.072
Note:
There is no check here for blank lines. It is assumed that there are none and that the sample shown in the question is malformed.
Also, no need to strip the incoming lines as float() is impervious to leading/trailing whitespace
As it was already mentioned, to get the year you should use a bit more complex split. But your format seems to be very consistent, you could probably go for:
datesprices=[(x[6:10], x[11:]) for x in fullfile]
but how to get average of it? You need to store list for specific year somewhere.
from statistics import mean
my_dict = {} # could be defaultdict too
for year, price in datesprices:
if year not in my_dict:
my_dict[year] = []
my_dict[year].append(price)
for year, prices in my_dict.items():
print(year, mean(prices))
TRY THIS
with open('c:/Gasprices.txt','r') as f:
fullfile=[x.strip() for x in f.readlines()]
datesprices=[(x.split('-')[0],x.split('-')[-1].split(':')[0], x.split(':')[1]) for x in fullfile]
print(datesprices)
OUTPUT
[('04', '1993', '1.068'), ('04', '1993', '1.079'), ('04', '1993', '1.079'), ('06', '1994', '1.065'), ('06', '1994', '1.073'), ('06', '1994', '1.079')]
OR
with open('c:/Gasprices.txt','r') as f:
fullfile=[x.strip() for x in f.readlines()]
datesprices=[(x.split('-')[-1].split(':')[0], x.split(':')[1]) for x in fullfile]
print(datesprices)
OUTPUT
[('1993', '1.068'), ('1993', '1.079'), ('1993', '1.079'), ('1994', '1.065'), ('1994', '1.073'), ('1994', '1.079')]
txt = ['04-05-1993:1.068', '04-12-1993:1.079', '04-19-1993:1.079', '06-06-1994:1.065', '06-13-1994:1.073', '06-20-1994:1.079']
price_per_year = {}
number_of_years = {}
for i in txt:
x = txt.split(':')
Date = x[0]
Price = x[1]
year = date.split('-')[2]
if year ~in price_per_year.keys:
price_per_year.update({year:Price})
number_of_years.update({year:1})
else:
price_per_year[year] += Price
number_of_years[year] += 1
av_price_1993 = price_per_year[1993] / number_of_years[1993]
av_price_1994
= price_per_year[1994] / number_of_years[1994]

How to find specific items in a CSV file using inputs?

I'm still new to python, so forgive me if my code seems rather messy or out of place. However, I need help with an assignment for university. I was wondering how I am able to find specific items in a CSV file? Here is what the assignment says:
Allow the user to type in a year, then, find the average life expectancy for that year. Then find the country with the minimum and the one with the maximum life expectancies for that year.
import csv
country = []
digit_code = []
year = []
life_expectancy = []
count = 0
lifefile = open("life-expectancy.csv")
with open("life-expectancy.csv") as lifefile:
for line in lifefile:
count += 1
if count != 1:
line.strip()
parts = line.split(",")
country.append(parts[0])
digit_code.append(parts[1])
year.append(parts[2])
life_expectancy.append(float(parts[3]))
highest_expectancy = max(life_expectancy)
country_with_highest = country[life_expectancy.index(max(life_expectancy))]
print(f"The country that has the highest life expectancy is {country_with_highest} at {highest_expectancy}!")
lowest_expectancy = min(life_expectancy)
country_with_lowest = country[life_expectancy.index(min(life_expectancy))]
print(f"The country that has the lowest life expectancy is {country_with_lowest} at {lowest_expectancy}!")
It looks like you only want the first and fourth tokens from each row in your CSV. Therefore, let's simplify it like this:
Hong Kong,,,85.29
Japan,,,85.03
Macao,,,84.68
Switzerland,,,84.25
Singapore,,,84.07
You can then process it like this:
FILE = 'life-expectancy.csv'
data = []
with open(FILE) as csv:
for line in csv:
tokens = line.split(',')
data.append((float(tokens[3]), tokens[0]))
hi = max(data)
lo = min(data)
print(f'The country with the highest life expectancy {hi[0]:.2f} is {hi[1]}')
print(f'The country with the lowest life expectancy {lo[0]:.2f} is {lo[1]}')

Convert list of different element types to list of integers

I have a list of different element types (extracted from a column from a dataframe) that I would like to convert to the same element type (integers). The dataframe looks like this:
Because some rows under column "Systemic Banking Crisis (starting date)" only have one year, while others have several, the extracted list ends up looking like this:
[1994,
1990,
nan,
'1980, 1989, 1995, 2001',
1994,
nan,
2008,
1995,
1987,
nan,
1995,
2008,
nan,...]
The countries that have multiple years (multiple banking crises) are in a string, while the countries with only one year are a integer. I would like to turn the data into panel data by looping through each country and making a dummy variable running from 1970 to 2019 that takes the value 1 if there is a banking crisis and 0 if not. To do this I have run the following code:
data_banking = data['Systemic Banking Crisis (starting date)'].to_list()
data_currency = data['Currency Crisis (year)'].to_list()
countries = data['Country'].to_list()
#making lists
years = [1970]
for i in range(1971, 2020):
years.append(i)
banking_crisis = []
currency_crisis = []
countries_long = []
for i in countries:
country = [i for x in range(50)]
countries_long.extend(country)
years_long = []
for i in range(166):
years_long.extend(years)
for i in data_banking:
for y in years:
if y==i:
banking_crisis.append(1)
else:
banking_crisis.append(0)
banking = pd.DataFrame(list(zip(countries_long, years_long, banking_crisis)))
This works for all the countries with only one banking crisis and returns a dataframe that looks like this:
However, for the countries with multiple banking crises, python doesn't understand the code because the years are in one string. How do I fix this?
I have tried to convert the list data_banking to a list of lists, convert all list elements to strings, then split the strings and convert each string element to integers, so that I could loop through each element in each (country)list of the data_banking list, but it won't work.
These are the different variations of what I have tried:
def list_of_lists(lst):
list_1 = [[el] for el in lst]
#listToStr = ' '.join(map(str, lists))
return list_1
#list_1 = listToString(lists)
#for string in list_values:
# list_values = list_1.split(",")
# string = int(string)
#return list_1
data_banking = list_of_lists(data_banking)
for lists in data_banking:
for item in lists:
item = float(item)
# lists = [str(x) for x in lists]
What should I do?
I'd do this entire operation in two steps. (1) First, I iterate over the dataset and store a list of dictionaries containing the country and each singular year its associated with (dropping NaNs), via some string formatting. (2) I then compile these results into a new data frame, making sure that the year column is numeric. Here's the code:
# Step 1
bank = 'Systemic Banking Crisis (starting date)'
rows = []
for _, row in data.iterrows():
country = row['Country']
years = row[bank]
if pd.isna(years):
continue
for year in years.split(','):
rows.append({'Country': country, bank:pd.to_numeric(year)})
# Step 2
df = pd.DataFrame(rows)
df[bank] pd.to_numeric(df[bank])
Let me know if this doesn't work for you.

Multiple lines of data, want to get an index or figure out how to get it in order

reads and stores the data in this file.
User for two integers corresponding to start and end years, and finds and lists the year of publication, title, author, in that order, of all books published during that period.
It repeats the previous step till the user enters -1 when prompted for the start year.
This is what I have so far (see picture)
def main():
file = open("resources.txt","r")
myList = []
year1 = int(input("Enter the first year:"))
year2 = int(input("Enter the second year: "))
for x in range(year1, year2):
print(yearofpublication,title, author)
and the file is 1000 lines
I need help with #2 mainly.
Thank you
Here is a solution that doesn't uses Pandas. I have put comments to break down the code according to the steps you requested. Step 1 imports the text file, gets rid of all tabs and newline characters and splits each line on the semicolon to create a list of lists.
Step 2 iterates through all the books and compares index 3 (year) of each book to the specified years. Step 3 creates an infinite loop and breaks it only when the user enters -1.
#step 1
data = open('resources.txt', 'r')
book_list = []
for line in data:
new_line = line.rstrip('\n').replace('\t', '').split(';')
book_list.append(new_line)
#step 3
while True:
year1 = int(input("Enter the first year:"))
if year1 == -1:
break
year2 = int(input("Enter the second year: "))
#step2
for book in book_list:
if year1 <= int(book[3]) <= year2:
print(f'Publication Year: {book[3]}, Title: {book[1]}, Author: {book[2]}')
Assuming you have a txt file like below that is ; separated with a consistent format and no headers.
1 ; A ; X ;1220
2 ; B ; Y ;1245
You can load the file using pandas which will allow you to easily filter the data on conditions.
import pandas
df = pandas.read_csv("data.txt", sep=";", names=["id", "author", "title", "year"])
Then for your step 2, you can filter the dataframe based on year1 and year2
df[(df['year'] > year1) & (df['year'] < year2)]
print(df.head())

Categories

Resources