Making various groupings - python

My data set is a list of people either working together or alone.
I have a row for each project, and columns with names of all the people who worked on that project. If column 2 is the first empty column in a row, it was a solo job. If column 4 is the first empty column in a row, there were 3 people working together.
I have the code to find all pairs. In the output data set, a square N x N is created with every actor labelling columns and rows. Cells (A,B) and (B,A) contain how many times that pair worked together. A working with B is treated the same as B working with A.
An example of the input data, in a comma delimited fashion:
A,.,.
A,B,.
B,C,E
B,F,.
D,F,.
A,B,C
D,B,.
E,C,B
X,D,A
F,D,.
B,.,.
F,.,.
F,X,C
C,F,D
I am using Python 3.2. The code that does this:
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
My questions are:
If I had a column with months and years, is it possible to make a matrix spreadsheet for each month year? (i.e., for 2011, I would have 12 matrices)?
For whatever breakdown I use, is it possible to make a variable such that the variable name is a combo of all the people who worked together? e.g. 'ABD' would mean a project Person A, Person B, and Person D worked together on and would equal how many times ABD worked as a group of three, in whatever order. Projects can hold up to 20 people so it would have to be able to make groups of 2 to 20. Also, it would be easiest if the variables should be in alphabetical order.

1) Sort your projects by month & year, then create a new 'grid' for every month. e.g.:
Pull the month & year from every row. Remove month & year from the row, then add the remaining data to a dictionary. In the end you get something like {(month, year): [line, line, ...]} . From there, it's easy to loop through each month/year and create a grid, output spreadsheet, etc.
2) ''.join(sorted(list)).replace('.','') gives you the persons who worked together listed alphabetically.
import csv
import collections
import itertools
grids = dict()
groups = dict()
with open("connect.csv", "r") as fp:
reader = csv.reader(fp)
for line in reader:
# extract month/year from the last column
date = line.pop(-1)
month,year = date.split('/')
# clean empty names
line = [name.strip() for name in line if name.strip()]
# generate group name
group = ''.join(sorted(line)).replace('.','')
#increment group count
if group in groups:
groups[group]+=1
else:
groups[group]=1
#if grid exists for month, update else create
if (month,year) in grids:
grid = grids[(month,year)]
else:
grid = collections.Counter()
grids[(month,year)] = grid
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
for date,grid in grids.items():
actors = sorted(set(pair[0] for pair in grid))
#Filename from date
filename = "connection_grid_%s_%s.csv" % date
with open(filename, "w") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
with open('groups.csv','w') as fp:
writer = csv.writer(fp)
for item in sorted(groups.items()):
writer.writerow(item)

Related

How to get a running total for a column in a csv file while depending on a unique variable in a different column?

import csv
def getDataFromFile(filename, dataList):
file = open(filename, "r")
csvReader = csv.reader(file)
for aList in csvReader:
dataList.append(aList)
file.close()
def getTotalByYear(expendDataList):
total = 0
for row in expendDataList:
expenCount = float(row[2])
total += expenCount**
Rtotal = input(print("Enter 'every' or a particular year. "))
if Rtotal == 'every' or == 'Every':
print(expenCount)
As you can see I got the running total for column 2 if you type every or Every but I don't understand how to do a running total for column 2 while dependent on a certain variable in column one.
In this case my CSV file has three columns of data. A year field, an item field, and an expenditure field. How do I get a running total of the expenditure field based on a certain year?
expendDataList = []
fname = "expenditures.csv"
getDataFromFile(fname, expendDataList)
getTotalByYear(expendDataList)
Producing a running total is good task for a generator function. This example uses the filter built-in function to filter out unwanted years (a generator expression/ list comprehension could be used instead). Then it iterates over the selected rows to produce the results.
import csv
def running_totals(year):
with open('year-item-expenditure.csv') as f:
reader = csv.DictReader(f)
predicate = None if year.lower() == 'every' else lambda row: row['Year'] == year
total = 0
for row in filter(predicate, reader):
total += float(row['Expenditure'])
yield total
totals = running_totals('2019')
for total in totals:
print(total)
Another approach would be to use itertools.accumulate, though you still have to perform all of the filtering, so there's not much benefit to doing this unless you need performance.
import csv
import itertools
def running_totals(year):
with open('year-item-expenditure.csv') as f:
reader = csv.DictReader(f)
predicate = None if year.lower() == 'every' else lambda row: row['Year'] == year
# Create a generator expression that yields expenditures as floats
expenditures = (float(row['Expenditure']) for row in filter(predicate, reader))
for total in itertools.accumulate(expenditures):
yield total

Split a row into multiple cells and keep the maximum value of second value for each gene

I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.
Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])
First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)
This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]

Counting a ncaa basketball teams wins

I am trying to count the wins of certain college basketball teams, I have a csv file containing that data. When I run this code no matter what I have tried it always returns 0.
import csv
f = open("data.csv", 'r')
data = list(csv.reader(f))
def ncaa(team):
count = 0
for row in data:
if row[2] == team:
count += 1
return count
airforce_wins = ncaa("Air force")
akron_wins = ncaa("Akron")
print(akron_wins)
This will give you "1".
import csv
f = open("C:\\users/alex/desktop/data.csv", 'r')
data = list(csv.reader(f))
def ncaa(team):
count = 0
for row in data:
if row[1] == team: #corrected index here
count += 1
return count
airforce_wins = ncaa("Air force")
akron_wins = ncaa("Akron")
print(akron_wins)
However, I don't think you are counting the wins correctly. You are counting occurrences of a row in the file but, since each team only has one row, you will always get "1" for any team. Perhaps, your wins are in another column and that's the value you need to look up when you find your team.
Try this instead before the function definition:
import csv
with open("data1.csv", 'r') as f:
data = csv.reader(f,delimiter=',')
I don't think using list(reader_object) is correct.

How can I read and sort a csv file in python?

I'm new to python and have a csv file with names and scores in like this:
Andrew,10
Steve,14
Bob,17
Andrew,11
I need to know how to read this file, and the file must display the two entries with the same name, for instance Andrew,10 and Andrew,11 as Andrew,10,11. I also need to be able to sort by either name, highest score, or average score. If possible, I'd also like it to use the last 3 entries for each name only.
This is the code i've tried to use to read and sort by name:
with open("Class1.csv", "r") as f:
Reader = csv.reader(f)
Data = list(Reader)
Data.sort()
print(Data)
Pandas is very nice for it
import pandas as pd
df = pd.read_csv("<pathToFileIN>",index_col=None,header=None)
df.columns = ["name","x"]
n = df.groupby("name").apply(lambda x: ",".join([str(_) for _ in x["x"].values[-3:]])).values
df.drop_duplicates(subset="name",inplace=True)
df["x"] = n
df.sort("name",inplace=True)
df.to_csv("<pathToFileOUT>",index=None,sep=";")
To combine scores use collections.defaultdict:
scores_by_name = collections.defaultdict(list)
for row in Reader:
name = row[0]
score = int(row[1])
scores_by_name[name].append(score)
To keep the last three scores take a 3 item slice:
scores_by_name = {name: scores[-3:] for name, score in scores_by_name.items()}
To iterate alphabetically:
for name, scores in sorted(scores_by_name.items()):
... # whatever
To iterate by highest score:
for name, scores in sorted(scores_by_name.items(), key=(lambda item: max(item[1]))):
...

Selecting rows in cvs file and write them in another csv file

I have a csv file with 2 columns (titles are value, image). The value list contains values in ascending order (0,25,30...), and the image list contains pathway to images (e.g. X.jpg). Total lines are 81 including the titles (that is, there are 80 values and 80 images)
What I want to divide this list 4-ways. Basically the idea is to have a spread of pairs of images.
In the first group I took the image part of every two near rows (2+3, 4+5....), and wrote them in a new csv file. I write each image in a different column. Here's the code:
import csv
f = open('random_sorted.csv')
csv_f = csv.reader(f)
i = 0
prev = ""
#open csv file for writing
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
for row in csv_f:
if i%2 == 0 and i!=0:
#print prev + "," + row[1]
csv_writer.writerow([prev] + [row[1]])
else:
prev = row[1]
i = i+1
Here's the output of this:
I want to keep the concept similar with the rest 3 groups(write into a new csv file the paired images and having two columns), but just increase the spread. That is, pair together every 5 rows (i.e. 2+7 etc.), every 7 (i.e. 2+9 etc.), and every 9 rows together. Would love to get some directions as to how to execute it. I was lucky with the first group (just learned about the remainder/divider option in the CodeAcademy course, but can't think of ideas for the other groups.
First collect all the rows in the csv file in a list:
with open('random_sorted.csv') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
headers = next(csv_reader)
rows = [row for row in csv_reader]
Then set your required step size (5, 7 or 9) and identify the rows on the basis of their index in the list of rows:
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
step_size = 7 # set step size here
seen = set() # here we remember images we've already seen
for x in range(0, len(rows)-step_size):
img1 = rows[x][1]
img2 = rows[x+step_size][1]
if not (img1 in seen or img2 in seen):
csv_writer.writerow([img1, img2])
seen.add(img1)
seen.add(img2)

Categories

Resources