Group rows in a CSV by blocks of 25 - python

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...

You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.

Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict

You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Related

write rows in pandas dataframe and append it to existing dataframe

I have the output of my script as year and the count of word from an article in that particular year :
abcd
2013
118
2014
23
xyz
2013
1
2014
45
I want to have each year added as a new column to my existing dataframe which contains only words.
Expected output:
Terms 2013 2014 2015
abc 118 76 90
xyz 23 0 36
The input for my script was a csv file :
Terms
xyz
abc
efg
The script I wrote is :
df = pd.read_csv('a.csv', header = None)
for row in df.itertuples():
term = (str(row[1]))
u = "http: term=%s&mindate=%d/01/01&maxdate=%d/12/31"
print(term)
startYear = 2013
endYear = 2018
for year in range(startYear, endYear+1):
url = u % (term.replace(" ", "+"), year, year)
page = urllib.request.urlopen(url).read()
doc = ET.XML(page)
count = doc.find("Count").text
print(year)
print(count)
The df.head is :
0
0 1,2,3-triazole
1 16s rrna gene amplicons
Any help will be greatly appreciated, thanks in advance !!
I would read the csv with numpy in an array, then reshape it also with numpy and then the resulting matrix/2D array to a DataFrame
Something like this should do it:
#!/usr/bin/env python
def mkdf(filename):
def combine(term, l):
d = {"term": term}
d.update(dict(zip(l[::2], l[1::2])))
return d
term = None
other = []
with open(filename) as I:
n = 0
for line in I:
line = line.strip()
try:
int(line)
except Exception as e:
# not an int
if term: # if we have one, create the record
yield combine(term, other)
term = line
other = []
n = 0
else:
if n > 0:
other.append(line)
n += 1
# and the last one
yield combine(term, other)
if __name__ == "__main__":
import pandas as pd
import sys
df = pd.DataFrame([r for r in mkdf(sys.argv[1])])
print(df)
usage: python scriptname.py /tmp/IN ( or other file with your data)
Output:
2013 2014 term
0 118 23 abcd
1 1 45 xyz

find minimum from text file

I am new to Python and I am trying to figure out how to get my program to find the minimum after it reads specific columns and each rows from the file. Can anyone help me with this?
This is how an example of how my text file looks like:
05/01 80 2002 5 1966 19 2000 45 2010
06/22 77 1980 4 1945 22 1986 58 2000
---------------------------------------------------------------------------
Day Max Year Min Year Max Year Min Year
---------------------------------------------------------------------------
08/01 79 2002 8 1981 28 1900 54 1988
08/02 79 1989 5 1971 31 1994 60 1998
This is my code(below) that I have so far.
def main ()
file = open ('file.txt', 'r')
for num in file.read().splitlines():
i = num.split()
if len(i) > 5:
print('Day:{}\n' .format(i[0]))
print('Year:{}\n' .format(i[2]))
print('Lowest Temperature:{}'.format(i[1]))
This is the output I get from my code. (it prints out text as well) :
Day:Day
Year:Year
Lowest Temperature:Max
Day: 3/11
Year:1920
Lowest Temperature:78
Day:11/02
Year:1974
Lowest Temperature:80
Day:11/03
Year:1974
Lowest Temperature:74
I am trying to find the lowest temperature from my text file and print out the day and the year associated with that temp. My output should look like this. Thanks to everyone who is willing to help me with this.
Day:10/02
Year:1994
Lowest Temperature:55
You can use your current method to read the file into lines, then split each line into individual columns.
You can then make use of min(), using the column containing the minimum temperature (in this case column 3) as the key to min().
with open('test.txt') as f:
data = f.read().splitlines()
data = [i.split() for i in data if any(j.isdigit() for j in i)]
data = min(data, key=lambda x: int(x[3]))
print('Day: {}\nYear: {}\nLowest Temperature: {}' .format(data[0], data[2], data[3]))
Output for your sample file:
Day: 06/22
Year: 1980
Lowest Temperature: 4

Analyze data using python

I have a csv file in the following format:
30 1964 1 1
30 1962 3 1
30 1965 0 1
31 1959 2 1
31 1965 4 1
33 1958 10 1
33 1960 0 1
34 1959 0 2
34 1966 9 2
34 1958 30 1
34 1960 1 1
34 1961 10 1
34 1967 7 1
34 1960 0 1
35 1964 13 1
35 1963 0 1
The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years)
I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.
I was able to find an answer where I had to analyze just the first row.
import csv
import matplotlib.pyplot as plt
import numpy as np
df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]
for row in csv_df:
a.append(row[0])
b.append(row[3])
print('The age that has maximum reported incidents of cancer is '+ mode(a))
I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written
I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.
In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.
import operator
survival_map = {}
with open('Dataset.csv', 'rb') as in_f:
for row in in_f:
row = row.rstrip() #to remove the end line character
items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
age = int(items[0])
survival_rate = int(items[3])
if survival_rate == 1:
if age in survival_map:
survival_map[age] += 1
else:
survival_map[age] = 1
Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:
sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]
UPDATE:
For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:
maximum = max(dict, key=dict.get)
print(maximum, dict[maximum])
For multiple max values
max_keys = []
max_value = 0
for k,v in survival_map.items():
if v > max_value:
max_keys = [k]
max_value = v
elif v == max_value:
max_keys.append(k)
print [(x, max_value) for x in max_keys]
Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

forming a new data frame using zip, but getting an error

I have a numpy_array called playoff_teams:
playoff_teams = np.sort(playoff_seeds['team'])
playoff_teams[:7]
array([1115, 1124, 1139, 1140, 1143, 1155, 1165], dtype=int64)
I have a data_frame called reg:
season daynum wteam wscore lteam lscore wloc numot
108122 2010 7 1143 75 1293 70 H 0
108123 2010 7 1314 88 1198 72 H 0
108124 2010 7 1326 100 1108 60 H 0
108125 2010 7 1393 75 1107 43 H 0
108126 2010 9 1143 95 1178 61 H 0
I then loop over the teams and perform the following action:
for teams in playoff_teams:
games = reg[(reg['wteam'] == teams) | (reg['lteam']== teams)]
last_six = sum(games.tail(6)['wteam'] == teams)
zipped = zip(team, last_six)
I get an error
TypeError: zip argument #1 must support iteration
I need to form a new data frame in following format:
col_1 col_2
team_1 last_six
team_2 last_six
team_3 last_six
How do I do that?
sum() returns a number, not something that you can iterate over while zip() needs iterables so I think your problem is there.
last_six = sum(games.tail(6)['wteam'] == teams) # Number
zipped = zip(team, last_six) # Error because last_six is not iterable
You could store the results in a list (that might be a dict too) for instance :
new_data = []
for teams in playoff_teams:
games = reg[(reg['wteam'] == teams) | (reg['lteam']== teams)]
last_six = sum(games.tail(6)['wteam'] == teams)
new_data.append((teams, last_six))
Then build your data frame using DataFrame.from_items or DataFrame.from_dict (if you chose a dict and not a list).

Categories

Resources