Analyze data using python - python

I have a csv file in the following format:
30 1964 1 1
30 1962 3 1
30 1965 0 1
31 1959 2 1
31 1965 4 1
33 1958 10 1
33 1960 0 1
34 1959 0 2
34 1966 9 2
34 1958 30 1
34 1960 1 1
34 1961 10 1
34 1967 7 1
34 1960 0 1
35 1964 13 1
35 1963 0 1
The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years)
I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.
I was able to find an answer where I had to analyze just the first row.
import csv
import matplotlib.pyplot as plt
import numpy as np
df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]
for row in csv_df:
a.append(row[0])
b.append(row[3])
print('The age that has maximum reported incidents of cancer is '+ mode(a))

I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written
I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.
In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.
import operator
survival_map = {}
with open('Dataset.csv', 'rb') as in_f:
for row in in_f:
row = row.rstrip() #to remove the end line character
items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
age = int(items[0])
survival_rate = int(items[3])
if survival_rate == 1:
if age in survival_map:
survival_map[age] += 1
else:
survival_map[age] = 1
Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:
sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]
UPDATE:
For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:
maximum = max(dict, key=dict.get)
print(maximum, dict[maximum])
For multiple max values
max_keys = []
max_value = 0
for k,v in survival_map.items():
if v > max_value:
max_keys = [k]
max_value = v
elif v == max_value:
max_keys.append(k)
print [(x, max_value) for x in max_keys]
Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.

Related

Extracting Data from non-Structured String in Python

everyone. I´m working with a Job Board dataset and I intend to turn the "Salary Offered" column into something I can use to make calculations, comparisons and predictions. I have five different cases for the data in the column:
-Yearly Salaries within a range (YSWR)
IE:£15,000 - £17,000 per annum
-Hourly salaries within a range (HSWR)
IE:£22.00 - £26.00 per hour
-Yearly salaries with specific values (YSWSV)
IE:£18,323 per annum
Hourly salaries with specific values (HSWSV)
IE £26.00 per hour
-Salary not Specified/Salary Negotiable/Competitive Salary
I need to preprocess this field into:
-One column that indicates either the salary is yearly or hourly
-Two columns indicating the minimun/maximun salary(0 for non specified values and equal values for the cases that are not in a range)
Any Idea where to start? .I am working with python and PANDAS. I am a begginer when it comes to data preprocessing .
Thanks in advance.
Felix
You can use regular expression to get the values and then implement the logic using if=else
import re
import pandas as pd
df = pd.DataFrame([['£16,000 per annum'],
['£25.0 per annum'],
['£19,000 per annum'],
['£26.0 per annum'],
['Salary not specified'],
['Competetive salary']], columns=['salary_offered'])
def apply_conditions(s):
v = re.findall(r'^£(\d+,?\d+.?\d+)', s)
if(len(v) == 0): # that means salary not specified
return [0,0,'not specified'] #[min, max, salary]
else:
v = v[0]
# replace ',' with '' so that we can parse
v = v.replace(',', '')
v = float(v)
if(15000 < v < 17000):
return [15000, 17000, v]
elif(22 < v < 26):
return [22, 26, v]
elif(v == 18323):
return [18323, 18323, v]
elif(v == 26):
return [26, 26, v]
else:
return [0, 0, 'not in range']
df['salary_offered'] = df['salary_offered'].apply(lambda x: apply_conditions(x))
df = pd.DataFrame(df['salary_offered'].to_list(), columns=['minimum', 'maximum', 'value'])
from
salary_offered
0 £16,000 per annum
1 £25.0 per annum
2 £19,000 per annum
3 £26.0 per annum
4 Salary not specified
5 Competetive salary
to
minimum maximum value
0 15000 17000 16000
1 22 26 25
2 0 0 not in range
3 26 26 26
4 0 0 not specified
5 0 0 not specified

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

How can I search for and extract a particular value from a dataframe in python?

I have a dataframe called "oat" - here is a piece of it:
Name Age Year T Neigh One Neigh Two
0 Carl P 31 1998 0.1 5454 657
1 Tyler A 26 2012 3.9 578 98
2 Antoine G 20 1997 1.7 17 9878
3 Travis A 23 2008 3.2 199 398
4 Geoff H 22 1980 -0.3 901 7650
5 David C 28 2014 4.5 8001 72
6 Antoine G 21 1998 2.3 5678 9800
7 Tyler A 25 2011 3.1 2245 450
I'm trying to run a for loop through each row. The values in column "Neigh One" refer to the index of another row, from which, based on particular variables, will lead to another row from which I'd like to extract a variable.
Here's what I've tried:
for index, row in oat.iterrows():
indice = row['Neigh One']
name = oat.iloc[indice]["Name"]
age = oat.iloc[indice]["Age"]
age_plus_one = age + 1
new = oat.loc[(oat.Name == name) & (oat.Age == age_plus_one),'T'].tolist()[0]
print(new)
I am getting an error message from the last variable I try, "new." Basically I am looping through each row, and based on the "Neigh One" value, it will go to that index, and extract the name and age and then add 1. From there, I am looking to find the new row with that same name, but with one added to the age.
Note: There is either zero rows that will match this, or only one row. It would be impossible to have more than one match.
All I want to do is, for each loop, simply return the value of 'T' that comes back based on my boolean filter.
I have also tried the following for the final variable, with the error messages that each returns:
new= oat[(oat['Name'] == name) & (oat['Age'] == age_plus_one)].T.item()
ValueError: can only convert an array of size 1 to a Python scalar
new = oat[(oat['Name'] == name) & (oat['Age'] == age_plus_one),'T'].values[0]
not an error, but returns a True of False boolean list for the entire dataframe rather than the actual values.
new = oat.loc[(oat.Name == name) & (oat['Age'] == age_plus_one),'T'].values[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
new = oat.loc[(oat.name == name) & (oat.Age == age_plus_one),'T'].tolist()[0]
IndexError: list index out of range
for index, row in oat.iterrows():
indice = row['Neigh One']
name = oat.iloc[indice]["Name"]
age = oat.iloc[indice]["Age"]
age_plus_one = age + 1
#--------below is revised---------
mask = (oat.Name == name) & (oat.Age == age_plus_one)
if sum(mask) == 0:
new = None
else:
new = oat.loc[mask,'T'].tolist()[0]
print(new)
As you mentioned, there might be no match for (oat.Name == name) & (oat.Age == age_plus_one). So a if-else will help to switch case.

find minimum from text file

I am new to Python and I am trying to figure out how to get my program to find the minimum after it reads specific columns and each rows from the file. Can anyone help me with this?
This is how an example of how my text file looks like:
05/01 80 2002 5 1966 19 2000 45 2010
06/22 77 1980 4 1945 22 1986 58 2000
---------------------------------------------------------------------------
Day Max Year Min Year Max Year Min Year
---------------------------------------------------------------------------
08/01 79 2002 8 1981 28 1900 54 1988
08/02 79 1989 5 1971 31 1994 60 1998
This is my code(below) that I have so far.
def main ()
file = open ('file.txt', 'r')
for num in file.read().splitlines():
i = num.split()
if len(i) > 5:
print('Day:{}\n' .format(i[0]))
print('Year:{}\n' .format(i[2]))
print('Lowest Temperature:{}'.format(i[1]))
This is the output I get from my code. (it prints out text as well) :
Day:Day
Year:Year
Lowest Temperature:Max
Day: 3/11
Year:1920
Lowest Temperature:78
Day:11/02
Year:1974
Lowest Temperature:80
Day:11/03
Year:1974
Lowest Temperature:74
I am trying to find the lowest temperature from my text file and print out the day and the year associated with that temp. My output should look like this. Thanks to everyone who is willing to help me with this.
Day:10/02
Year:1994
Lowest Temperature:55
You can use your current method to read the file into lines, then split each line into individual columns.
You can then make use of min(), using the column containing the minimum temperature (in this case column 3) as the key to min().
with open('test.txt') as f:
data = f.read().splitlines()
data = [i.split() for i in data if any(j.isdigit() for j in i)]
data = min(data, key=lambda x: int(x[3]))
print('Day: {}\nYear: {}\nLowest Temperature: {}' .format(data[0], data[2], data[3]))
Output for your sample file:
Day: 06/22
Year: 1980
Lowest Temperature: 4

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Categories

Resources