Python: iterating in multiple levels - python

-------------2000--------------
1 17824
2 20131125192004.9
3 690714s1969 dcu 000 0 eng
4 a 75601809
4 a DLC
4 b eng
4 c DLC
5 a WA 750
-------------2001--------------
1 3224
2 20w125192004.9
3 690714s1969 dcu 000 0 eng
5 a WA 120
-------------2002--------------
2 2013341524626245.9
3 484914s1969 dcu 000 0 eng
4 a 75601809
4 a eng
4 c DLC
5 a WA 345
I want to iterate through both the years and the fields under each year (e.g. 1, 2, 3, 4, and 5). a, b, and other alphabet letters after some fields are subfields.
The lines with dashes in my code indicates the year of the entry. Each record group starts at ---year--- and ends at the line before ---year---.
Also, fields is a list:
fields=["1", "2", "3,", "4", "5"].
I'm eventually trying to retrieve the values next to the fields for each entry/year. For example, if my current field is 1, which is equivalent to fields[0], I would iterate through all the years (2000, 2001, and 2002) to get the values for the field 1. The output would be
17824
3224
(Blank space for Year 2002)
How can I iterate through the years (indicated by the dashes)? I can't seem to think of a code to generate the desired output.

You can first use regex to split your text then use itertools.izip_longest within a nested list comprehension to get your expected columns :
>>> import re
>>> blocks=re.split(r'-+\d+-+',s)
>>> from itertools import izip_longest
>>> z=[list(izip_longest(*[k for k in sub if k])) for sub in izip_longest(*[[j.split() for j in i.split('\n')] for i in blocks])]
[[], [('1', '1', '2'), ('17824', '3224', '2013341524626245.9')], [('2', '2', '3'), ('20131125192004.9', '20w125192004.9', '484914s1969'), (None, None, 'dcu'), (None, None, '000'), (None, None, '0'), (None, None, 'eng')], [('3', '3', '4'), ('690714s1969', '690714s1969', 'a'), ('dcu', 'dcu', '75601809'), ('000', '000', None), ('0', '0', None), ('eng', 'eng', None)], [('4', '5', '4'), ('a', 'a', 'a'), ('75601809', 'WA', 'eng'), (None, '120', None)], [('4', '4'), ('a', 'c'), ('DLC', 'DLC')], [('4', '5'), ('b', 'a'), ('eng', 'WA'), (None, '345')], [('4',), ('c',), ('DLC',)], [('5',), ('a',), ('WA',), ('750',)], []]
each sub list represent a specific line in each block for example the first sub list is first lines in each block :
>>> z=[i for i in z if i] # remove the empty lists
>>> z[0]
[('1', '1', '2'), ('17824', '3224', '2013341524626245.9')]
>>> z[0][1]
('17824', '3224', '2013341524626245.9')

So I'm writing a pretty involved answer that uses a helper function, but I think you'll find it pretty flexible. It uses an iterutil type helper function that I wrote called groupby. The groupby function accepts a key function to specify which group each item belongs to. In your case the key function was a little fancy because it had to maintain state to know which year each element belonged to. The code below is totally runnable. Just copy and paste into a script and let me know what you think.
EDIT
Turns out the groupby function is already implemented in the itertools module and I've been missing it forever. I edited the code to use the itertools version
#!/usr/bin/env python
import io
import re
import itertools as it
data = '''-------------2000--------------
1 17824
2 20131125192004.9
3 690714s1969 dcu 000 0 eng
4 a 75601809
4 a DLC
4 b eng
4 c DLC
5 a WA 750
-------------2001--------------
1 3224
2 20w125192004.9
3 690714s1969 dcu 000 0 eng
5 a WA 120
-------------2002--------------
2 2013341524626245.9
3 484914s1969 dcu 000 0 eng
4 a 75601809
4 a eng
4 c DLC
5 a WA 345'''
def group_year():
'''
A stateful closure to group the year blobs together
'''
# Hack to update a variable from the closure
g = [0]
def closure(e):
if re.findall(r'-----[0-9]{4}------', e):
g[0] += 1
return g[0]
return closure
if __name__ == "__main__":
f = io.BytesIO(data)
gy = group_year()
for k,group in it.groupby(f, key=gy):
# group is now an iter of lines for each year group in the data
# Now you can iterate on each group like so:
for line in group:
rec = line.strip().split()
if rec[0] == '1':
print rec[1]
# You could also use nested groupby's at this point to perform
# further grouping on the different columns or whatever

Related

How to find duplicates from a Pandas dataframe based upon the values in other columns?

I have a Pandas Df-
A=
[period store item
1 32 'A'
1 34 'A'
1 32 'B'
1 34 'B'
2 42 'X'
2 44 'X'
2 42 'Y'
2 44 'Y']
I need to implement something like this:
If an item has the same set of stores as any other item for that particular period then those items are duplicate.
So in this case A and B are duplicates as they have the same stores for the respective periods.
I have tried converting this into a nested dictionary using this:
dicta = {p: g.groupby('items')['store'].apply(tuple).to_dict()
for p, g in mkt.groupby('period')}
Which is returning me a dictionary like this:
dicta = {1: {'A': (32, 34),'B': (32, 34)}, 2: {'X': (42, 44),'Y': (42, 44)}}
...
So in the end I want a dictionary like this.
{1:(A,B),2:(X,Y)}
Although, I am not able to find any logic how to find the duplicate items.
Is there any other method that can be done to find those duplicate items
You can simply use .duplicated. Make sure to pass ['period', 'store'] as subset and keep as False so all the rows will be returned.
print(A[A.duplicated(subset=['period', 'store'], keep=False)])
Outputs
period store item
0 1 32 A
1 1 34 A
2 1 32 B
3 1 34 B
4 2 42 X
5 2 44 X
6 2 42 Y
7 2 44 Y
Note that according to the logic you specified all the rows are duplicates.
EDIT After OP elaborated on the expected format, I suggest
duplicates = A[A.duplicated(subset=['period', 'store'], keep=False)]
output = {g: tuple(df['item'].unique()) for g, df in duplicates.groupby('period')}
Then output is {1: ('A', 'B'), 2: ('X', 'Y')}.

How to enumerate a list within a list that was enumerated

The lists are made up as I do not have my code in front of me.
I have one list that has two items: listA = [Region1, Region2]
Within those lists (they are objects with embedded information originally) they each have a set of items. frame1 has three items; let's say layer1 = [VA, GA, NC] and frame2 has two items; layer2 = [WI, MI].
So, hierarchically:
Region1:
VA
GA
NC
Region2:
WI
MI
What I am trying to do is enumerate the parent layers to be:
0 Region1
1 Region2
While also enumerating from each respective sublist so the result would look something like:
0 Region1
0 VA
1 GA
2 NC
1 Region2
0 WI
1 MI
The first enumeration is simple, but I am lost on how to enumerate properly using any nested loop. I keep getting all 5 layers for each frame no matter how I've gone about it.
Any tips, tricks, ideas? I specifically need this to work with the basic enumerate function; not the Enum module.
Following the example shared by #Reblochon Masque,
>>> listA = [['abc', 'def'], ['ghi', 'klm'], ['nop', 'qrs']] >>> [(index,region) for index,region in enumerate(listA)]
[(0, ['abc', 'def']), (1, ['ghi', 'klm']), (2, ['nop', 'qrs'])]
>>> [(index,[(ind,it) for ind,it in enumerate(region)]) for index,region in enumerate(listA)]
[(0, [(0, 'abc'), (1, 'def')]), (1, [(0, 'ghi'), (1, 'klm')]), (2, [(0, 'nop'), (1, 'qrs')])]
>>> [("Region %d"%index,[(ind,it) for ind,it in enumerate(region)]) for index,region in enumerate(listA)]
[('Region 0', [(0, 'abc'), (1, 'def')]), ('Region 1', [(0, 'ghi'), (1, 'klm')]), ('Region 2', [(0, 'nop'), (1, 'qrs')])]
>>> format printing the list comprehension
>>> print "\n".join("Region %d \n%s"%(index,"\n".join("%d %s"%(ind,it) for ind,it in enumerate(region))) for index,region in enumerate(listA))
Region 0
0 abc
1 def
Region 1
0 ghi
1 klm
Region 2
0 nop
1 qrs
You iterate over the outer, and the inner enumerations:
listA = [['abc', 'def'], ['ghi', 'klm'], ['nop', 'qrs']]
for idx, inner in enumerate(listA):
print('Region', idx)
for jdx, elt in enumerate(inner):
print(jdx, elt)
output:
Region 0
0 abc
1 def
Region 1
0 ghi
1 klm
Region 2
0 nop
1 qrs

Tuple to dictionary in python by using text file

I have a code which returns a dictionary with the
names as the keys and the corresponding values which are tuples of
numbers. The first number after the name controls how many numbers are
in the corresponding tuple (the numbers included in the tuple are taken
from the left to right). For example, the line of text "Ali 6 7 6 5 12 31 61 9" has 6 as the first number after the name and this line of text
becomes the dictionary entry with the keyword "Ali" and the
corresponding value is a tuple made up of the next six integers "Ali":
(7, 6, 5, 12, 31, 61).
This is the film I'm taking the code from
Bella 5 2 6 2 2 30 4 8 9 2
Gill 2 9 7 54 67
Jin 3 26 51 3 344 23
Elmo 4 3 8 6 8
Ali 6 7 6 5 12 31 61 9
the expected output is
Ali : (7, 6, 5, 12, 31, 61)
Bella : (2, 6, 2, 2, 30)
Elmo : (3, 8, 6, 8)
Gill : (9, 7)
Jin : (26, 51, 3)
so i've done like this
def get_names_num_tuple_dict(filename):
file_in = open(filename, 'r')
contents = file_in.read()
file_in.close()
emty_dict = {}
for line in contents:
data = line.strip().split()
key = data[0]
length = int(data[1])
data = tuple(data[2:length + 2])
emty_dict[key] = data
return emty_dict
But I'm having this error
length = int(data[1])
IndexError: list index out of range
Can anyone please help? That will be really helpful. I'm a bit weak with the dictionary as learning for the first time.
Use following code:
def get_names_num_tuple_dict(filename):
emty_dict = {}
with open(filename) as f:
for line in f:
data = line.strip().split()
key = data[0]
length = int(data[1])
data = tuple(data[2:length + 2])
emty_dict[key] = data
return emty_dict
print(get_names_num_tuple_dict('my_filename'))
Output:
{'Bella': ('2', '6', '2', '2', '30'), 'Gill': ('9', '7'), 'Jin': ('26', '51', '3'), 'Elmo': ('3', '8', '6', '8'), 'Ali': ('7', '6', '5', '12', '31', '61')}
Here is what happens:
contents = file_in.read()
Reads your file into string. When you loop over this string it will go character by character and give you IndexError: list index out of range.
Basically try using:
for line in file_in:
data = line.strip().split()
...

How to create tuples from a single list with alpha-numeric chacters?

I have the following list with 2 elements:
['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
I need to make a list or zip file such that each alphabet corresponds to its number further in the list. For example in list[0] the list/zip should read
{"A":"6", "G":"6", "C":"35","T":"25","T":"10"}
Can I make a list of such lists/zips that stores the corresponding vales for list[0], list[1],...list[n]?
Note: The alphabets can only be A,G,C or T, and the numbers can take anyvalue
Edit 1: Previously, I thought I could use a dictionary. But several members pointed out that this cannot be done. So I just want to make a list or zip or anything else recommended to pair the Alphabet element to its corresponding number.
Use tuples splitting once to get the pairs, then split the second element of each pair, zip together:
l =['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
pairs = [zip(a,b.split()) for a,b in (sub.split(None,1) for sub in l]
Which would give you:
[[('A', '6'), ('G', '6'), ('C', '35'), ('T', '25'), ('T', '10')], [('A', '7'), ('G', '7'), ('G', '28'), ('G', '29'), ('T', '2')]]
Of using a for loop with list.append:
l = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
out = []
for a,b in (sub.split(None,1) for sub in l ):
out.append(zip(a,b))
If you want to convert any letter to Z where the digit is < 10, you just need another loop where we check the digit in each pairing:
pairs = [[("Z", i ) if int(i) < 10 else (c, i) for c,i in zip(a, b.split())]
for a,b in (sub.split(None, 1) for sub in l)]
print(pairs)
Which would give you:
[[('Z', '6'), ('Z', '6'), ('C', '35'), ('T', '25'), ('T', '10')], [('Z', '7'), ('Z', '7'), ('G', '28'), ('G', '29'), ('Z', '2')]]
To break it into a regular loop:
pairs = []
for a, b in (sub.split(None, 1) for sub in l):
pairs.append([("Z", i) if int(i) < 10 else (c, i) for c, i in zip(a, b.split())])
print(pairs)
[("Z", i) if int(i) < 10 else (c, i) for c, i in zip(a, b.split())] sets the letter to Z if the corresponding digit i is < 10 or else we just leave the letter as is.
if you want to get back to the original pairs after you just need to transpose with zip:
In [13]: l = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
In [14]: pairs = [[("Z", i) if int(i) < 10 else (c, i) for c, i in zip(a, b.split())] for a, b in
....: (sub.split(None, 1) for sub in l)]
In [15]: pairs
Out[15]:
[[('Z', '6'), ('Z', '6'), ('C', '35'), ('T', '25'), ('T', '10')],
[('Z', '7'), ('Z', '7'), ('G', '28'), ('G', '29'), ('Z', '2')]]
In [16]: unzipped = [["".join(a), " ".join(b)] for a, b in (zip(*tup) for tup in pairs)]
In [17]: unzipped
Out[17]: [['ZZCTT', '6 6 35 25 10'], ['ZZGGZ', '7 7 28 29 2']]
zip(*...) will give you the original elements back into a tuple of their own, we then just need to join the strings back together. If you wanted to get back to the total original state you could just join again:
In[18][ " ".join(["".join(a), " ".join(b)]) for a, b in (zip(*tup) for tup in pairs) ]
Out[19]: ['ZZCTT 6 6 35 25 10', 'ZZGGZ 7 7 28 29 2']
If you consider using tuples to pair the items, then this works:
>>> from pprint import pprint
>>> lst = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
>>> new_lst = [list(zip(sub[0], sub[1:])) for sub in [i.split() for i in lst]]
>>> pprint(new_lst)
[[('A', '6'), ('G', '6'), ('C', '35'), ('T', '25'), ('T', '10')],
[('A', '7'), ('G', '7'), ('G', '28'), ('G', '29'), ('T', '2')]]
[i.split() for i in lst]: An initial split on the string.
zip(sub[0], sub[1:])): Zip lists of alphabets and list of numbers
Iterate through list > iterate through items (alpha numeric) of the list and construct list of characters and numbers > and then construct list of tuple.
alphanum = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
list_of_tuple = []
for s in alphanum:
ints = []
chars = []
for i in s.split():
if i.isdigit():
ints.append(i)
else:
chars.append(i)
new_tuple = []
for (n, item) in enumerate(list(chars[0])):
new_tuple.append((item, ints[n]))
list_of_tuple.append(new_tuple)
print list_of_tuple
This code would work, assuming the elements in the list are correctly formed.
This means the number of letters and numbers must match!
And it will overwrite the value if the key already exists.
list = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
dictionary = {}
for line in list:
split_line = line.split()
letters = split_line[0]
iterator = 1
for letter in letters:
dictionary[letter] = split_line[iterator]
iterator += 1
print dictionary
This modified one will check if the key exists and add it to a list with that key:
list = ['AGCTT 6 6 35 25 10', 'AGGGT 7 7 28 29 2']
dictionary = {}
for line in list:
split_line = line.split()
letters = split_line[0]
iterator = 1
for letter in letters:
if letter in dictionary.keys():
dictionary[letter].append(split_line[iterator])
else:
dictionary[letter] = [split_line[iterator]]
iterator += 1
print dictionary

How to read a text file and group them in tuple?

I am new to python and trying to do the following in python 3
I have a text file like this
1 2 3
4 5 6
7 8 9
.
.
I wanted this to be converted into groups of tuple like this
((1,2,3),(4,5,6),(7,8,9),...)
I have tried using
f = open('text.txt', 'r')
f.readlines()
but this is giving me a list of individual words.
could any one help me with this?
A method using csv module -
>>> import csv
>>> f = open('a.txt','r')
>>> c = csv.reader(f,delimiter='\t') #Use the delimiter from the file , if a single space, use a single space, etc.
>>> l = []
>>> for row in c:
... l.append(tuple(map(int, row)))
...
>>> l = tuple(l)
>>> l
(('1', '2', '3'), ('4', '5', '6'), ('7', '8', '9'))
Though if you do not really need the tuples , do not use them, it may be better to just leave them at list.
Both row and l in above code are initially lists.
You may try this,
>>> s = '''1 2 3
4 5 6
7 8 9'''.splitlines()
>>> tuple(tuple(int(j) for j in i.split()) for i in s)
((1, 2, 3), (4, 5, 6), (7, 8, 9))
For your case,
tuple(tuple(int(j) for j in i.split()) for i in f.readlines())

Categories

Resources