Python - extract lines between multiple instances of the same delimiter - python

I have a file like this
===
aa
bb
===
aa
cc
dd
==
11
I need to extract the lines between the "===" and put them in different variables (a list maybe).
Can you please help me?
Thank you

with open('input.txt') as input_file:
result = input_file.read().split('===\n')
print result

You can use itertools.groupby to group lines between the === and add them to a dictionary.
from itertools import groupby,count
with open("in.txt") as f:
cn = count()
d = {}
for k, v in groupby(f, lambda x: not x.startswith("=")):
if k:
d[next(cn)] = "".join(v)
{0: 'aa\nbb\n', 1: 'aa\ncc\ndd\n', 2: '11'}
Presuming you have at least one = separating each section.
Or use a defaultdict changing the key when we find a line starting with =:
from collections import defaultdict
from itertools import count
with open("in.txt") as f:
cn = count()
d = defaultdict(str)
for line in f:
if line.startswith("="):
key = next(cn)
else:
d[key] += line
print(d)
defaultdict(<type 'str'>, {0: 'aa\nbb\n', 1: 'aa\ncc\ndd\n', 2: '11\n'})
Either way will avoid reading all your file into memory at once. If you want to remove the newline use line.rstrip
If you want each line as an individual element in the lists:
from itertools import groupby, count
with open("in.txt") as f:
cn = count()
d = {}
for k, v in groupby(f, lambda x: not x.startswith("=")):
if k:
d[next(cn)] = list(map(str.rstrip, v))
print(d)
{0: ['aa', 'bb'], 1: ['aa', 'cc', 'dd'], 2: ['11']}
And finally if you want a list of lists:
with open("in.txt") as f:
print [list(map(str.rstrip, v)) for k,v in groupby(f, lambda x: not x.startswith("=")) if k]
[['aa', 'bb'], ['aa', 'cc', 'dd'], ['11']]

Related

How to print values from a file?

I have a text file and its content is something like this:
A:3
B:5
C:7
A:8
C:6
I need to print:
A numbers: 3, 8
B numbers: 5
C numbers: 7, 6
I'm a beginner so if you could give some help I would appreciate it. I have made a dictionary but that's pretty much all I know.
You could use an approach that keeps the values in a dictionary:
d = {} # create an empty dictionary
for line in open(filename): # opens the file
k, v = line.split(':') # unpack each line in the char before : and after
if k in d: # add the values to the dictionary
d[k].append(v)
else:
d[k] = [v]
This gives you a dictionary containing your file in a format that you can utilize to get the desired output:
for key, values in sorted(d.items()):
print(key, 'numbers:' ', '.join(values))
The sorted is required because dictionaries are unordered.
Note that using collections.defaultdict instead of a normal dict could simplify the approach somewhat. The:
d = {}
...
if k in d: # add the values to the dictionary
d[k].append(v)
else:
d[k] = [v]
could then be replaced by:
from collections import defaultdict
d = defaultdict(list)
...
d[k].append(v)
Short version (Which should sort in alphabetic order)
d = {}
lines = [line.rstrip('\n') for line in open('filename.txt')]
[d.setdefault(line[0], []).append(line[2]) for line in lines]
[print(key, 'numbers:', ', '.join(values)) for key,values in sorted(d.items())]
Or if you want to maintain the order as they appear in file (file order)
from collections import OrderedDict
d = OrderedDict() # Empty dict
lines = [line.rstrip('\n') for line in open('filename.txt')] # Get the lines
[d.setdefault(line[0], []).append(line[2]) for line in lines] # Add lines to dictionary
[print(key, 'numbers:', ', '.join(values)) for key,values in d.items()] # Print lines
Tested with Python 3.5.
You can treat your file as csv (comma separated value) so you can use the csv module to parse the file in one line. Then use defaultdict with input in the costructor the class list to say that to create it when the key not exists. Then use OrderedDict class because standard dictionary don't keeps the order of your keys.
import csv
from collection import defaultdict, OrderedDict
values = list(csv.reader(open('your_file_name'), delimiter=":")) #[['A', '3'], ['B', '5'], ['C', '7'], ['A', '8'], ['C', '6']]
dct_values = defaultdict(list)
for k, v in values:
dct_values[k].append(v)
dct_values = OrderedDict(sorted(dct_values.items()))
Then you can simply print iterating the dictionary.
A very easy way to group by key is by external library, if you are interested try PyFunctional

Replace values in Python dict

I have 2 files, The first only has 2 columns
A 2
B 5
C 6
And the second has the letters as a first column.
A cat
B dog
C house
I want to replace the letters in the second file with the numbers that correspond to them in the first file so I would get.
2 cat
5 dog
6 house
I created a dict from the first and read the second. I tried a few things but none worked. I can't seem to replace the values.
import csv
with open('filea.txt','rU') as f:
reader = csv.reader(f, delimiter="\t")
for i in reader:
print i[0] #reads only first column
a_data = (i[0])
dictList = []
with open('file2.txt', 'r') as d:
for line in d:
elements = line.rstrip().split("\t")[0:]
dictList.append(dict(zip(elements[::1], elements[0::1])))
for key, value in dictList.items():
if value == "A":
dictList[key] = "cat"
The issue appears to be on your last lines:
for key, value in dictList.items():
if value == "A":
dictList[key] = "cat"
This should be:
for key, value in dictList.items():
if key in a_data:
dictList[a_data[key]] = dictList[key]
del dictList[key]
d1 = {'A': 2, 'B': 5, 'C': 6}
d2 = {'A': 'cat', 'B': 'dog', 'C': 'house', 'D': 'car'}
for key, value in d2.items():
if key in d1:
d2[d1[key]] = d2[key]
del d2[key]
>>> d2
{2: 'cat', 5: 'dog', 6: 'house', 'D': 'car'}
Notice that this method allows for items in the second dictionary which don't have a key from the first dictionary.
Wrapped up in a conditional dictionary comprehension format:
>>> {d1[k] if k in d1 else k: d2[k] for k in d2}
{2: 'cat', 5: 'dog', 6: 'house', 'D': 'car'}
I believe this code will get you your desired result:
with open('filea.txt', 'rU') as f:
reader = csv.reader(f, delimiter="\t")
d1 = {}
for line in reader:
if line[1] != "":
d1[line[0]] = int(line[1])
with open('fileb.txt', 'rU') as f:
reader = csv.reader(f, delimiter="\t")
reader.next() # Skip header row.
d2 = {}
for line in reader:
d2[line[0]] = [float(i) for i in line[1:]]
d3 = {d1[k] if k in d1 else k: d2[k] for k in d2}
You could use dictionary comprehension:
d1 = {'A':2,'B':5,'C':6}
d2 = {'A':'cat','B':'dog','C':'house'}
In [23]: {d1[k]:d2[k] for k in d1.keys()}
Out[23]: {2: 'cat', 5: 'dog', 6: 'house'}
If the two dictionaries are called a and b, you can construct a new dictionary this way:
composed_dict = {a[k]:b[k] for k in a}
This will take all the keys in a, and read the corresponding values from a and b to construct a new dictionary.
Regarding your code:
The variable a_data has no purpose. You read the first file, pront the first column, and do nothing else with the data in it
zip(elements[::1], elements[0::1]) will just construct pairs like [1,2,3] -> [(1,1),(2,2),(3,3)], I think that's not what you want
After all you have a list of dictionaries, and at the last line you just put strings in that list. I think that is not intentional.
import re
d1 = dict()
with open('filea.txt', 'r') as fl:
for f in fl:
key, val = re.findall('\w+', f)
d1[key] = val
d2 = dict()
with open('file2.txt', 'r') as fl:
for f in fl:
key, val = re.findall('\w+', f)
d2[key] = val
with open('file3.txt', 'wb') as f:
for k, v in d1.items():
f.write("{a}\t{b}\n".format(a=v, b=d2[k]))

Find duplicates of two columns from csv

I want to find duplicate values of one column and replaced with value of another column of csv which has multiple columns. So first I put two columns from the csv to the dictionary. Then I want to find duplicate values of dictionary that has string values and keys. I tried with solutions of remove duplicates of dictionary but got the error as not hashable or no result. Here is the first part of code.
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k = rows[3].strip()
v = rows[2].strip()
if k in mydict:
mydict[k].append(v)
else:
mydict[k] = [v]
#mydict = hash(frozenset(mydict))
print mydict
d = {}
while True:
try:
d = defaultdict(list)
for k,v in mydict.iteritems():
#d[frozenset(mydict.items())]
d[v].append(k)
except:
continue
writer = csv.writer(open(r"OLD.csv", 'wb'))
for key, value in d.items():
writer.writerow([key, value])
Your question is unclear. So I hope I got it right.
Please give an example of input columns and the desired output columns.
Please give a printout of the error and let us know which line caused the error.
if column1=[1,2,3,1,4] and column2=[a,b,c,d,e] do you want the output to be n_column1=[a,2,3,d,4] and column2 =[1,b,c,d,e]
I imagine the exception was in d[v].append(k) since clearly v is a list. you cannot use a list as a key in a dictionary.
In [1]: x = [1,2,3,1,4]
In [2]: y = ['a','b','c','d','e']
In [5]: from collections import defaultdict
In [6]: d = defaultdict(int)
In [7]: for a in x:
...: d[a] += 1
In [8]: d
Out[8]: defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 1, 4: 1})
In [9]: x2 = []
In [10]: for a,b in zip(x,y):
....: x2.append(a if d[a]==1 else b)
....:
In [11]: x
Out[11]: [1, 2, 3, 1, 4]
In [12]: x2
Out[12]: ['a', 2, 3, 'd', 4]
In that case, I guess if I had to change your code to fit. I'd do something like that:
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
histogram = defaultdict(int)
k = []
v = []
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k.append(rows[3].strip())
v.append(rows[2].strip())
item = k[-1]
histogram[item] += 1
output_column = []
for first_item, second_item in zip(k,v):
output_column.append(first_item if histogram[first_item]==1 else second_item)
writer = csv.writer(open(r"OLD.csv", 'wb'))
for c1, c2 in zip(output_column, v):
writer.writerow([c1, c2])

Concatenate strings by groups python

I would like to concatenate a list of strings into new strings grouped over values in a list. Here is an example of what I mean:
Input
key = ['1','2','2','3']
data = ['a','b','c','d']
Result
newkey = ['1','2','3']
newdata = ['a','b c','d']
I understand how to join text. But I don't know how to iterate correctly over the values of the list to aggregate the strings that are common to the same key value.
Any help or suggestions appreciated. Thanks.
from collections import defaultdict
d = defaultdict(list)
for k, v in zip(key, data):
d[k].append(v)
print [(k, ' '.join(v)) for k, v in d.items()]
Output:
[('1', 'a'), ('3', 'd'), ('2', 'b c')]
And how to get new lists:
newkey, newvalue = d.keys(), [' '.join(v) for v in d.values()]
And with saved order:
newkey, newvalue = zip(*[(k, ' '.join(d.pop(k))) for k in key if k in d])
Use the itertools.groupby() function to combine elements; zip will let you group two input lists into two output lists:
import itertools
import operator
newkey, newdata = [], []
for key, items in itertools.groupby(zip(key, data), key=operator.itemgetter(0)):
# key is the grouped key, items an iterable of key, data pairs
newkey.append(key)
newdata.append(' '.join(d for k, d in items))
You can turn this into a list comprehension with a bit more zip() magic:
from itertools import groupby
from operator import itemgetter
newkey, newdata = zip(*[(k, ' '.join(d for _, d in it)) for k, it in groupby(zip(key, data), key=itemgetter(0))])
Note that this does require the input to be sorted; groupby only groups elements based on the consecutive keys being the same. On the other hand, it does preserve that initial sorted order.
you can use itertools.groupby() on zip(key,data):
In [128]: from itertools import *
In [129]: from operator import *
In [133]: lis=[(k," ".join(x[1] for x in g)) for k,g in groupby(zip(key,data),key=itemgetter(0))]
In [134]: newkey,newdata=zip(*lis)
In [135]: newkey
Out[135]: ('1', '2', '3')
In [136]: newdata
Out[136]: ('a', 'b c', 'd')
If you dont feel like importing collections you can always use a regular dictionary.
key = ['1','2','2','3']
data = ['a','b','c','d']
newkeydata = {}
for k,d in zip(key,data):
newkeydata[k] = newkeydata.get(k, []).append(d)
Just for the sake of variety, here is a solution that works without any external libraries and without dictionaries:
def group_vals(keys, vals):
new_keys= sorted(set(keys))
zipped_keys = zip(keys, keys[1:]+[''])
zipped_vals = zip(vals, vals[1:]+[''])
new_vals = []
for i, (key1, key2) in enumerate(zipped_keys):
if key1 == key2:
new_vals.append(' '.join(zipped_vals[i]))
else:
new_vals.append(zipped_vals[i][0])
return new_keys, new_vals
group_vals([1,2,2,3], ['a','b','c','d'])
# --> ([1, 2, 3], ['a', 'b c', 'd'])
But I know that it's quite ugly and probably not as performant as the other solutions. Just for demonstration purposes. :)

create a dict of lists from a string

I want to convert a string such as 'a=b,a=c,a=d,b=e' into a dict of lists {'a': ['b', 'c', 'd'], 'b': ['e']} in Python 2.6.
My current solution is this:
def merge(d1, d2):
for k, v in d2.items():
if k in d1:
if type(d1[k]) != type(list()):
d1[k] = list(d1[k])
d1[k].append(v)
else:
d1[k] = list(v)
return d1
record = 'a=b,a=c,a=d,b=e'
print reduce(merge, map(dict,[[x.split('=')] for x in record.split(',')]))
which I'm sure is unnecessarily complicated.
Any better solutions?
d = {}
for i in 'a=b,a=c,a=d,b=e'.split(","):
k,v = i.split("=")
d.setdefault(k,[]).append(v)
print d
or, if you're using python > 2.4, you can use defaultdict
from collections import defaultdict
d = defaultdict(list)
for i in 'a=b,a=c,a=d,b=e'.split(","):
k,v = i.split("=")
d[k].append(v)
print d
>>> result={}
>>> mystr='a=b,a=c,a=d,b=e'
>>> for k, v in [s.split('=') for s in mystr.split(',')]:
... result[k] = result.get(k, []) + [v]
...
>>> result
{'a': ['b', 'c', 'd'], 'b': ['e']}
How about...
STR = "a=c,b=d,a=x,a=b"
d = {} # An empty dictionary to start with.
# We split the string at the commas first, and each substr at the '=' sign
pairs = (subs.split('=') for subs in STR.split(','))
# Now we add each pair to a dictionary of lists.
for key, value in pairs:
d[key] = d.get(key, []) + [value]
Using a regex allow to do the work of two splits in only one:
import re
ch ='a=b,a=c ,a=d, b=e'
dic = {}
for k,v in re.findall('(\w+)=(\w+)\s*(?:,|\Z)',ch):
dic.setdefault(k,[]).append(v)
print dic

Categories

Resources