I am new to Python and I am trying to figure out how to get my program to find the minimum after it reads specific columns and each rows from the file. Can anyone help me with this?
This is how an example of how my text file looks like:
05/01 80 2002 5 1966 19 2000 45 2010
06/22 77 1980 4 1945 22 1986 58 2000
---------------------------------------------------------------------------
Day Max Year Min Year Max Year Min Year
---------------------------------------------------------------------------
08/01 79 2002 8 1981 28 1900 54 1988
08/02 79 1989 5 1971 31 1994 60 1998
This is my code(below) that I have so far.
def main ()
file = open ('file.txt', 'r')
for num in file.read().splitlines():
i = num.split()
if len(i) > 5:
print('Day:{}\n' .format(i[0]))
print('Year:{}\n' .format(i[2]))
print('Lowest Temperature:{}'.format(i[1]))
This is the output I get from my code. (it prints out text as well) :
Day:Day
Year:Year
Lowest Temperature:Max
Day: 3/11
Year:1920
Lowest Temperature:78
Day:11/02
Year:1974
Lowest Temperature:80
Day:11/03
Year:1974
Lowest Temperature:74
I am trying to find the lowest temperature from my text file and print out the day and the year associated with that temp. My output should look like this. Thanks to everyone who is willing to help me with this.
Day:10/02
Year:1994
Lowest Temperature:55
You can use your current method to read the file into lines, then split each line into individual columns.
You can then make use of min(), using the column containing the minimum temperature (in this case column 3) as the key to min().
with open('test.txt') as f:
data = f.read().splitlines()
data = [i.split() for i in data if any(j.isdigit() for j in i)]
data = min(data, key=lambda x: int(x[3]))
print('Day: {}\nYear: {}\nLowest Temperature: {}' .format(data[0], data[2], data[3]))
Output for your sample file:
Day: 06/22
Year: 1980
Lowest Temperature: 4
Related
Here is my code:
l_names = [ ]
for l in links:
l_names.append(l.get_text())
df = [ ]
for u in urls:
req = s.get(u)
req_soup = BeautifulSoup(req.content,'lxml')
req_tables = req_soup.find_all('table', {'class':'infobox vevent'})
req_df = pd.read_html(str(req_tables), flavor='bs4', header=0)
dfr = pd.concat(req_df)
dfr = dfr.drop(index=0)
dfr.columns = range(dfr.columns.size)
dfr[1] = dfr[1].str.replace(r"([A-Z])", r" \1").str.strip().str.replace(' ', ' ')
dfr = dfr[~dfr[0].isin(remove_list)]
dfr = dfr.dropna()
dfr = dfr.reset_index(drop=True)
dfr.insert(loc=0, column='Title', value='Change')
df.append(dfr)
Here is some info about l_names and df:
len(l_names)
83
len(df)
83
display(df)
[ Title 0 1
0 Change Genre Melodrama Revenge
1 Change Written by Kwon Soon-won Park Sang-wook
2 Change Directed by Yoon Sung-sik
3 Change Starring Park Si-hoo Jang Hee-jin
4 Change No. of episodes 16
5 Change Running time 60 minutes
6 Change Original network T V Chosun
7 Change Original release January 27 – March 24, 2019,
Title 0 1
0 Change Genre Romance Comedy
1 Change Written by Jung Do-yoon Oh Seon-hyung
2 Change Directed by Lee Jin-seo Lee So-yeon
3 Change Starring Jang Na-ra Choi Daniel Ryu Jin Kim Min-seo
4 Change No. of episodes 20
5 Change Running time Mondays and Tuesdays at 21:55 ( K S T)
6 Change Original network Korean Broadcasting System
7 Change Original release 2 May –5 July 2011,
Title 0 1
0 Change Genre Mystery Thriller Suspense
1 Change Directed by Kim Yong-soo
2 Change Starring Cho Yeo-jeong Kim Min-jun Shin Yoon-joo ...
3 Change No. of episodes 4
4 Change Running time 61-65 minutes
5 Change Original network K B S2
6 Change Original release March 14 – March 22, 2016,
Title 0 1
0 Change Genre Melodrama Comedy Romance
1 Change Written by Yoon Sung-hee
2 Change Directed by Lee Joon-hyung
3 Change Starring Ji Chang-wook Wang Ji-hye Kim Young-kwang P...
4 Change No. of episodes 24
5 Change Running time Wednesdays and Thursdays at 21:20 ( K S T)
6 Change Original network Channel A
7 Change Original release December 21, 2011 – March 8, 2012,
I want to replace 'Change' with TV show names which are stored in l_names.
For this example, only four TV shows will be given but I have 83 in total.
print(l_names)
['Babel', 'Baby Faced Beauty', 'Babysitter', "Bachelor's Vegetable Store"]
But when I try to plug in l_names in my for loop code as my values. I get an error.
dfr.insert(loc=0, column='Title', value=l_names)
df.append(dfr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [96], in <cell line: 19>()
29 dfr = dfr.dropna()
30 dfr = dfr.reset_index(drop=True)
---> 31 dfr.insert(loc=0, column='Title', value=l_names)
32 df.append(dfr)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4444, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4441 if not isinstance(loc, int):
4442 raise TypeError("loc must be int")
-> 4444 value = self._sanitize_column(value)
4445 self._mgr.insert(loc, column, value)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4535, in DataFrame._sanitize_column(self, value)
4532 return _reindex_for_setitem(value, self.index)
4534 if is_list_like(value):
-> 4535 com.require_length_match(value, self.index)
4536 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )
ValueError: Length of values (83) does not match length of index (8)
I also tried adding a for loop in my for loop.
for x in l_names:
dfr.insert(loc=0, column='Title', value=x)
df.append(dfr)
I get this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [97], in <cell line: 19>()
30 dfr = dfr.reset_index(drop=True)
31 for x in l_names:
---> 32 dfr.insert(loc=0, column='Title', value=x)
33 df.append(dfr)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4440, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4434 raise ValueError(
4435 "Cannot specify 'allow_duplicates=True' when "
4436 "'self.flags.allows_duplicate_labels' is False."
4437 )
4438 if not allow_duplicates and column in self.columns:
4439 # Should this be a different kind of error??
-> 4440 raise ValueError(f"cannot insert {column}, already exists")
4441 if not isinstance(loc, int):
4442 raise TypeError("loc must be int")
ValueError: cannot insert Title, already exists
I also added allow_duplicates = True and all that did was just make the Titles and names repeat over and over again.
I also have tried other methods to add in the title name.
But my lack of skill in using pandas has led me to this dead end.
Thanks again for your help and expertise.
Solution 1: After you create the df with 83 dataframe in it, you can loop df and update Title column values.
for i,dfr in enumerate(df):
dfr['Title'] = l_names[i]
Solution 2: In your loop, you don't need an extra loop, just use the index i to get the title and insert it.
for i,u in enumerate(urls):
...
dfr.insert(loc=0,column="Title",value=l_names[i])
df.append(dfr)
I have a csv file in the following format:
30 1964 1 1
30 1962 3 1
30 1965 0 1
31 1959 2 1
31 1965 4 1
33 1958 10 1
33 1960 0 1
34 1959 0 2
34 1966 9 2
34 1958 30 1
34 1960 1 1
34 1961 10 1
34 1967 7 1
34 1960 0 1
35 1964 13 1
35 1963 0 1
The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years)
I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.
I was able to find an answer where I had to analyze just the first row.
import csv
import matplotlib.pyplot as plt
import numpy as np
df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]
for row in csv_df:
a.append(row[0])
b.append(row[3])
print('The age that has maximum reported incidents of cancer is '+ mode(a))
I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written
I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.
In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.
import operator
survival_map = {}
with open('Dataset.csv', 'rb') as in_f:
for row in in_f:
row = row.rstrip() #to remove the end line character
items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
age = int(items[0])
survival_rate = int(items[3])
if survival_rate == 1:
if age in survival_map:
survival_map[age] += 1
else:
survival_map[age] = 1
Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:
sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]
UPDATE:
For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:
maximum = max(dict, key=dict.get)
print(maximum, dict[maximum])
For multiple max values
max_keys = []
max_value = 0
for k,v in survival_map.items():
if v > max_value:
max_keys = [k]
max_value = v
elif v == max_value:
max_keys.append(k)
print [(x, max_value) for x in max_keys]
Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.
I currently have a massive set of datasets. I have a set for each year in the 2000's. I take a combination of three years and run a code on that to clean.
The problem is that due to the size I can't run my cleaning code on it as my Memory runs out.
I was thinking about splitting the data using something like:
df.ix[1,N/x]
Where N is the total amount of rows in my dataframe. I think I should replace the dataframe to clear up the memory being used. This does mean I have to load in the dataframe first for each chunk I create.
There are several problems:
How do I get N when N can be different for each year ?
The operation requires that groups of data stay together.
Is there a way to make x vary with the size of N?
Is all of this highly inefficient/is there an efficient inbuild function for this?
Dataframe looks like:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
2 c 2005
2 c 2007
3 d 2005
What I need is for all the same ID's to stay together.
The data to be cut in managable chunks, dependent on a yearly changing total amount of data.
In this case it would be:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
ID Location year other variables
2 c 2005
2 c 2007
3 d 2005
The data originates from a csv by year. So all 2005 data comes from 2005csv, 2006 data from 2006csv etc.
The csv's are loaded into memory and concatenated to form one set of three years.
The individual csv files have the same setup as indicated above. So each observation is stated with an ID, location and year, followed by a lot of other variables.
Running it just on a group by group bases would be a bad idea, as there are thousands, if not millions of these ID's. They can have dozens of locations and a maximum of three years. All of this needs to stay together.
Loops on this many rows take ages in my experience
I was thinking maybe something along the lines of:
create a variable that counts the number of groups
use the maximum of this count variable and divide it by 4 or 5.
cut the data up in chunks this way
Not sure if this would be efficient, even less sure how to execute it.
One way to achieve this would be like as follows:
import pandas as pd
# generating random DF
num_rows = 100
locs = list('abcdefghijklmno')
df = pd.DataFrame(
{'id': np.random.randint(1, 100, num_rows),
'location': np.random.choice(locs, num_rows),
'year': np.random.randint(2005, 2007, num_rows)})
df.sort_values('id', inplace=True)
print('**** sorted DF (first 10 rows) ****')
print(df.head(10))
# chopping DF into chunks ...
chunk_size = 5
chunks = [i for i in df.id.unique()[::chunk_size]]
chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
print('**** first chunk ****')
print(df_chunks[0])
Output:
**** sorted DF (first 10 rows) ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
**** first chunk ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
6 16 k 2006
82 16 g 2005
Use chunked pandas by importing Blaze.
Instructions from http://blaze.readthedocs.org/en/latest/ooc.html
Naive use of Blaze triggers out-of-core systems automatically when called on large files.
d = Data('my-small-file.csv')
d.my_column.count() # Uses Pandas
d = Data('my-large-file.csv')
d.my_column.count() # Uses Chunked Pandas
How does it work?
Blaze breaks up the data resource into a sequence of chunks. It pulls one chunk into memory, operates on it, pulls in the next, etc.. After all chunks are processed it often has to finalize the computation with another operation on the intermediate results.
So, I have a .csv file which looks like this:
station_id year january february ... december
210018 1916 nodata 221 417a
210018 1917 17b 98 44
....
210252 1910 54e 110 nodata
210252 1911 99d 24i 77
...
I need to extract letters from a to i (a-i) from the data. These letters mean numbers of missing days per month: a means 1 day and i means 9 missing days. Right now I don't care about 'nodata' cells. After extracting letters from data cells, I want to calculate total number of missing days per month:
station_id year january february ... december N_missingdays
210018 1916 nodata 221 417 1(a)
210018 1917 17 98 44 11(b+i)
....
210252 1910 54 110 nodata 8(e+c)
210252 1911 99 24 77 13(d+i)
Probably, the best way to do it, is to create a dictionary with station_id, year and number of missing days. Here is what I was trying to do:
with open('filepath') as file:
file_reader = reader(file)
for i,row in enumerate(file_reader):
for j,item in enumerate(row):
if item[len(item)-1]=='a':
file_reader[i][j]=''
print file_reader
But this function just deleting letters from the file and it doesn't work correctly. I don't know exactly how to extract letters from the .csv file and calculate their meaning.
The other thing I was trying to do is this:
with open('filepath') as file:
file_reader = reader(file)
next(file_reader)
letters_dict={}
for row in file_reader:
station_id,year,months = row[1],row[2],row[4:]
letters_list[station_id,year] = months.count('[0-9][a]') + ... + months.count('[0-9][i]') + letters_dict.get(year, 0) + letters_dict.get(station_id,0)
But this code writes in a dictionary only zeroes.
I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.