I'm stuck in a process of creating list of columns. I tried to avoid using defaultdict.
Thanks for any help!
Here is my code:
# Read CSV file
with open('input.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
#-----------#
row_list = []
column_list = []
year = []
suburb = []
for each in reader:
row_list = row_list + [each]
year = year + [each[0]]#create list of years
suburb = suburb + [each[2]]#create list of suburb
for (i,v) in enumerate(each[3:-1]):
column_list[i].append(v)
#print i,v
#print column_list[0]
My error message:
19 suburb = suburb + [each[2]]#create list of suburb
20 for i,v in enumerate(each[3:-1]):
---> 21 column_list[i].append(v)
22 #print i,v
23 #print column_list[0]
IndexError: list index out of range
printed result of (i,v):
0 10027
1 14513
2 3896
3 23362
4 77966
5 5817
6 24699
7 9805
8 62692
9 33466
10 38792
0 0
1 122
2 0
3
4 137
5 0
6 0
7
8
9 77
10
Basically, I want to have lists to look like this.
column[0]=['10027','0']
column[1]=['14513','122']
A sample of my csv file:
enter image description here
Yes Like Alex mentioned the problem is indeed due to trying to access the index before creating/initializing it as an alternative solution you can also consider this.
for (i,v) in enumerate(each[3:-1]):
if len(column_list) < i+1:
column_list.append([])
column_list[i].append(v)
hope It may Help !
The error happens because column_list is empty and so you can't access column_list[i] because it doesn't exist. It doesn't matter that you want to append to it because you can't append to something nonexistent, and appending doesn't create it from scratch.
column_list = defaultdict(list) would indeed solve this but since you don't want to do that, the simplest is to make sure that column_list starts out with plenty of empty lists to append to. Like this:
column_list = [[] for _ in range(size)]
where size is the number of columns, the length of each[3:-1], which is apparently 11 according to your output.
Related
I have a text file that contains the following:
n 1 id 10 12:17:32 type 6 is transitioning
n 2 id 10 12:16:12 type 5 is active
n 2 id 10 12:18:45 type 6 is transitioning
n 3 id 10 12:16:06 type 6 is transitioning
n 3 id 10 12:17:02 type 6 is transitioning
...
I need to sort these lines in Python by the timestamp. I can read line by line, collect all timestamps, then sort them using sorted(timestamps) but then I need to arrange the lines according to sorted timestamp.
How to get the index of sorted timestamps?
Is there some more elegant solution (I'm sure there is)?
import time
nID = []
mID = []
ts = []
ntype = []
comm = []
with open('changes.txt') as fp:
while True:
line = fp.readline()
if not line:
break
lx = line.split(' ')
nID.append(lx[1])
mID.append(lx[3])
ts.append(lx[4])
ntype.append(lx[6])
comm.append(lx[7:])
So, now I can use sorted(ts) to sort the timestamp, but I don't get the index of sorted timestamp values.
I have a list of tuples in this format:
[("25.00", u"A"), ("44.00", u"X"),("17.00", u"E"),("34.00", u"Y")]
I want to count the number of time we have each letter.
I already created a sorted list with all the letter and now I want to count them.
First of all I have a problem with the u before the second item of each tuple, I don't know how to delete it, I guess it's something about enconding.
Here is my code
# coding=utf-8
from collections import Counter
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
groupes = []
students = []
group_of_each_letter = []
number_of_students_per_group = []
final_list = []
def print_a_list(list):
for items in list:
print(items)
for i in df.index:
groupes.append(df['GROUPE'][i])
students.append(df[u'ÉTUDIANT'][i])
groupes = groupes[1:]
students = students[1:]
group_of_each_letter = list(set(groupes))
group_of_each_letter = sorted(group_of_each_letter)
z = zip(students, groupes)
z = list(set(z))
final_list = list(zip(*z))
for j in group_of_each_letter:
number_of_students_per_group.append(final_list.count(j))
print_a_list(number_of_students_per_group)
Group of each letter is a list with the group letters without duplicate.
The problem is that I got the right number of value with the for loop at the end but the list is filled with '0'.
The screenshot below is a sample of the excel file. The column "ETUDIANT" means "Student number" but I cant edit the file, I have to deal with it. GROUPE means GROUP obviously. The goal is to count the number of student per group. I think I'm on the right way even if there is easier ways to do that.
Thanks in advance for your help even if I know that my question is a bit ambiguous
Building off of kerwei's answer:
Use groupby() and then nunique()
This will give you the number of unique Student IDs in each Group.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of unique students by group
student_group = df.groupby('GROUPE')[u'ÉTUDIANT'].nunique()
I think a groupby.count() should be sufficient. It'll count the number of occurrences of your GROUPE letter in the dataframe.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of students by group
sub_student_group = df.groupby(['GROUPE','ETUDIANT']).count().reset_index()
>>>sub_student_group
GROUPE ETUDIANT
0 B 29
1 L 88
2 N 65
3 O 27
4 O 29
5 O 34
6 O 35
7 O 54
8 O 65
9 O 88
10 O 99
11 O 114
12 O 122
13 O 143
14 O 147
15 U 122
student_group = sub_student_group.groupby('GROUPE').count()
>>>student_group
ETUDIANT
GROUPE
B 1
L 1
N 1
O 12
U 1
I am trying to extract elements from list.
I've looked up a lot of data, but I do not know..
this is my test.txt (text file)
[ left in the table = time, right in the table = value ]
0 81
1 78
2 76
3 74
4 81
5 79
6 80
7 81
8 83
9 83
10 83
11 82
.
.
22 81
23 80
If the current time is equal to the time in the table, i want to extract the value of that time.
this is my demo.py (python file)
import datetime
now = datetime.datetime.now())
current_hour = now.hour
with open('test.txt') as f:
lines = f.readlines()
time = [int(line.split()[0]) for line in lines]
value = [int(line.split()[1]) for line in lines]
>>>time = [0,1,2,3,4,5,....,23]
>>>value = [81,78,76,......,80]
You could make a loop where you iterate over the list, looking for the current hour at every position on the list.
Starting at position 0, it will compare it with the current hour. If it's the same value, it will assign the value at the position it was found in "time" to the variable extractedValue, then it will break the loop.
If it isn't the same value, it will increase by 1 the pos variable, which we use to look into the list. So it will keep searching until the first if is True or the list ends.
pos=0
for i in time:
if(current_hour==time[pos]):
extractedValue=value[pos]
break
else:
pos+=1
pass
Feel free to ask if you don't understand something :)
Assuming unique values for the time column:
import datetime
with open('text.txt') as f:
lines = f.readlines()
#this will create a dictionary with time value from test.txt as the key
time_data_dict = { l.split(' ')[0] : l.split(' ')[1] for l in lines }
current_hour = datetime.now().hour
print(time_data_dict[current_hour])
import datetime
import csv
data = {}
with open('hour.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
k, v = row
data[k] = v
hour = str(datetime.datetime.now().hour)
print(data[str(hour)])
I want to read data from database and convert it into list of dictionaries to put it in to a XLS File for reporting.
I tried python code for report since it's easier for me write code with minimum programming knowledge
I want to Write the list of dictionaries within list of dictionaries to an XLS File.
I try to generate the xls file but not getting the result correctly
data1 = [{'a':1,'b':2,'c':[{'d':4,'e':5},{'d':8,'e':9}]},{'a':5,'b':3,'c':[{'d':8,'e':7},{'d':1,'e':3}]}]
#Output need to be print like this in excel
A B D E
1 2
4 5
8 9
5 3
8 7
1 3
Here is code i tried
try:
import xlwt
except Exception, e:
raise osv.except_osv(_('User Error'), _('Please Install xlwt Library.!'))
filename = 'Report.xls'
string = 'enquiry'
worksheet = wb.add_sheet(string)
data1 = [{'a':1,'b':2,'c':[{'d':4,'e':5},{'d':8,'e':9}]},{'a':5,'b':3,'c':[{'d':8,'e':7},{'d':1,'e':3}]}]
i=0;j=0;m=0;
if data1:
columns = sorted(list(data1[0].keys()))
worksheet.write_merge(0, 0, 0, 9, 'Report')
worksheet.write(2,0,"A")
worksheet.write(2,1,"B")
worksheet.write(2,2,"D")
worksheet.write(2,3,"E")
for i, row in enumerate(data1,3):
for j, col in enumerate(columns):
if type(row[col]) != list:
worksheet.write(i+m, j, row[col], other_tstyle1)
else:
#if list then loop and group it in new cell
if row[col] != []:
row_columns = sorted(list(row[col][0].keys()))
for k, row1 in enumerate(row[col],1):
for l, col1 in enumerate(row_columns):
worksheet.write(k+m+1, l+3, row1[col1])
#iteration of m for new row
m+=1
#m+=1
I got output like this
A B D E
1 2
4
5
8
9
5 3
8
7
1
3
I think it's because you have
m += 1
inside your inner for loop. So, for every element in c, you are putting it down one more row. (Your commented out line at the end was right.)
By the way, it's better to use meaningful variable names than just letters for variables (e.g. row_offset).
I have a file with AA sequences in column 1, and in column two, the number of times they appear, which I created using Counter(). In column three I have numerical values, which are all different. The items in col 1 and col 2 can be identical.
Ex. Input file:
ADVAEDY 28 0.17805
ADVAEDY 28 0.17365
ADVAEDY 28 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148
ARYLGYNSNWYPFDY 23 3.17716
ARYLGYNSNWYPFDY 23 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038
ARHLGYNSAWYPFDY 21 2.3498
ARHLGYNSAWYPFDY 21 1.68818
...
AGIAFDY 20 0.457553
AGIAFDY 20 0.416321
AGIAFDY 20 0.286349
...
ATIEDH 4 2.45283
ATIEDH 4 0.553351
ATIEDH 4 0.441266
So there is 197 lines in this file. There are only 48 unique AA sequences from col 1. The code that generated this file:
input_fh = sys.argv[1] # File containing all CDR(x)
cdr_spec = sys.argv[2] # File containing CDR(x) in one column and specificities in the second
with open(input_fh, "r") as f1:
cdr = [line.strip() for line in f1]
with open(cdr_spec, "r") as f2:
cdr_spec_list = [line.strip().split() for line in f2]
cdr_spec_out = open("CDR" + c + "_counts_spec.txt", "w")
counter_cdr = Counter(cdr)
countermc_cdr = counter_cdr.most_common()
print len(countermc_cdr)
#This one might work:
for k,v in countermc_cdr:
for x,y in cdr_spec_list:
if k == x:
print >> cdr_spec_out, k, '\t', v, '\t', y
cdr_spec_out.close()
The output I want to generate is,using the example above by removing duplicates in col 1 and 2 but keeping all mtaching values in col 3 on one line:
ADVAEDY 28 0.17805, 0.17365, 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148, 3.17716, 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038, 2.3498, 1.68818
...
AGIAFDY 20 0.457553, 0.416321, 0.286349
...
ATIEDH 4 2.45283, 0.553351, 0.441266
Also, for each comma separated value for the "new" col 3 I would need them to be in order of largest to smallest. I would prefer to stay away from modules, as I'm still learning python and the "pythonic" way of doing things.
Any help is appreciated.
What causes the same AA to be printed additional times is the second for loop:
for x,y in cdr_spec_list:
try to load the cdr_spec_list from the start as a dictionary:
with open(cdr_spec, "r") as f2:
cdr_spec_dic = defaultdict(list) #a dictionary with the default value of list
for ln in f2:
k,v = ln.strip().split()
cdr_spec_dic[k].append(v)
Now you have a dictionary from each AA sequence to the numerical values you're presenting.
So now, we don't need the second for loop, and we can also sort while we're there.
for k,v in countermc_cdr:
print >> cdr_spec_out, k, '\t', v, '\t', ' '.join(sorted(cdr_spec_dic[k]))