Why is my file seemingly being read incorrectly?

Why is my file seemingly being read incorrectly? - python

In Python I want to read from a large file:
def aggregate(file_input):
import fileinput
reviews = []
with open(file_input.replace(".txt", "_aggregated.txt"), "w") as outp:
currComp = ""
outp.write("Business;Stars_In_Sequence")
for line in fileinput.input(file_input):
reviews.append(MyReview(line))
if(currComp != reviews[-1].getCompany()):
currComp = reviews[-1].getCompany()
outp.write("\n" + currComp + ";" + reviews[-1].getStars())
outp.flush()
else:
outp.write(reviews[-1].getStars())
outp.flush()
The file looks like this:
Business;User;Review_Stars;Date;Length;Votes_Cool;Votes_Funny;Votes_Useful;
0DI8Dt2PJp07XkVvIElIcQ;jkrzTC5P5QGJRoKECzcleQ;5;2014-03-11;421;0;1;0
0DI8Dt2PJp07XkVvIElIcQ;cK78PTjb65kdmRL9BnEdoQ;5;2014-03-29;190;0;1;0
and works fine if I use only a small part of the file, returning the right output:
Business;Stars_In_Sequence
Business;R
0DI8Dt2PJp07XkVvIElIcQ;55555455555555515
LTlCaCGZE14GuaUXUGbamg;555555555
EDqCEAGXVGCH4FJXgqtjqg;3324133
However, if I use the original file it returns this, and I cant figure out why
Business;Stars_In_Sequence
ÿþB u s i n e s s ;
0 D I 8 D t 2 P J p 0 7 X k V v I E l I c Q ;
L T l C a C G Z E 1 4 G u a U X U G b a m g ;
E D q C E A G X V G C H 4 F J X g q t j q g ;

Related

How to create a table using a list of lists

I'm trying to write a file where you have 2 rows, with the first row being numbers and the 2nd row being letters. As an example, I was trying to do this with the alphabet.
list1=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
list2=list1+list1
abcList = [[],[]]
for i in range(len(list2)):
i+=1
if i % 5 == 0:
if i>=10:
abcList[0].append(str(i) + ' ')
else:
abcList[0].append(str(i) + ' ')
elif i<=1:
abcList[0].append(str(i) + ' ')
else:
abcList[0].append(' ')
for i,v in enumerate(list2):
i+=1
if i > 10:
abcList[1].append(' '+v+' ')
else:
abcList[1].append(v+' ')
print(''.join(abcList[0]))
print(''.join(abcList[1]))
with open('file.txt','w') as file:
file.write(''.join(abcList[0]))
file.write('\n')
file.write(''.join(abcList[1]))
The problem with the above setup is its very "hacky" (I don't know if its the right word). It "works", but its really just modifying 2 lists to make sure they stack on top of one another properly. The problem is if your list becomes too long, then the text wraps around, and stacks on itself instead of the numbers. I'm looking for something a bit less "hacky" that would work for any size list (trying to do this without external libraries, so I don't want to use pandas or numpy).
Edit: The output would be:
1 5 10
A B C D E F G H I J...etc.
Edit 2:
Just thought I'd add, I've gotten this far with it so far, but I've only been able to make columns, not rows.
list1=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
list2=list1*2
abcList = [[],[]]
for i in range(len(list2)):
i+=1
if i % 5 == 0:
if i>=5:
abcList[0].append(str(i))
elif i<=1:
abcList[0].append(str(i))
else:
abcList[0].append('')
for i,letter in enumerate(list2):
abcList[1].append(letter)
for number, letters in zip(*abcList):
print(number.ljust(5), letters)
However, this no longer has the wrapping issues, and the numbers line up with the letters perfectly. The only thing now is to get them from columns to rows.
Output of above is:
1 A
B
C
D
5 E
F
G
H
I
10 J

I mean, you could do something like this:
file_contents = """...""" # The file contents. I not the best at file manipulation
def parser(document): # This function will return a nested list
temp = str(document).split('\n')
return [[line] for line in temp] # List comprehension
parsed = parser(file_contents)
# And then do what you want with that

Your expected output is a bit inconsistent, since in the first one, you have 1, 6, 11, 16... and in the second: 1, 5, 10, 15.... So I have a couple of possible solutions:
print(''.join([' ' if n%5 else str(n+1).ljust(2) for n in range(len(list2))]))
print(''.join([c.ljust(2) for c in list2]))
Output:
1 6 11 16 21 26 31 36 41 46 51
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
print(''.join([' ' if n%5 else str(n).ljust(2) for n in range(len(list2))]))
print(''.join([c.ljust(2) for c in list2]))
Output:
0 5 10 15 20 25 30 35 40 45 50
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
print(''.join(['1 ']+[' ' if n%5 else str(n).ljust(2) for n in range(len(list2))][1:]))
print(''.join([c.ljust(2) for c in list2]))
Output:
1 5 10 15 20 25 30 35 40 45 50
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

If you are wanting to keep variable width strings aligned, you could use string formatting with a width equal to the maximum of the widths of the individual items in that position. (This example will work with more than any number of lists, by the way.)
list1 = ["", "5", "", "10", "", "4"]
list2 = ["A", "B", "C", "D", "EE", "F"]
lists = [list1, list2]
widths = [max(map(len, t)) for t in zip(*lists)]
for lst in lists:
line = " ".join("{val:{width}s}".format(val=val, width=width)
for val, width in zip(lst, widths))
print(line)
gives:
5 10 4
A B C D EE F

How to leave only one defined sub-string in a string in Python

Say I have one of the strings:
"a b c d e f f g" || "a b c f d e f g"
And I want there to be only one occurrence of a substring (f in this instance) throughout the string so that it is somewhat sanitized.
The result of each string would be:
"a b c d e f g" || "a b c d e f g"
An example of the use would be:
str = "a b c d e f g g g g g h i j k l"
str.leaveOne("g")
#// a b c d e f g h i j k l

If it doesn't matter which instance you leave, you can use str.replace, which takes a parameter signifying the number of replacements you want to perform:
def leave_one_last(source, to_remove):
return source.replace(to_remove, '', source.count(to_remove) - 1)
This will leave the last occurrence.
We can modify it to leave the first occurrence by reversing the string twice:
def leave_one_first(source, to_remove):
return source[::-1].replace(to_remove, '', source.count(to_remove) - 1)[::-1]
However, that is ugly, not to mention inefficient. A more elegant way might be to take the substring that ends with the first occurrence of the character to find, replace occurrences of it in the rest, and finally concatenate them together:
def leave_one_first_v2(source, to_remove):
first_index = source.index(to_remove) + 1
return source[:first_index] + source[first_index:].replace(to_remove, '')
If we try this:
string = "a b c d e f g g g g g h i j k l g"
print(leave_one_last(string, 'g'))
print(leave_one_first(string, 'g'))
print(leave_one_first_v2(string, 'g'))
Output:
a b c d e f h i j k l g
a b c d e f g h i j k l
a b c d e f g h i j k l
If you don't want to keep spaces, then you should use a version based on split:
def leave_one_split(source, to_remove):
chars = source.split()
first_index = chars.index(to_remove) + 1
return ' '.join(chars[:first_index] + [char for char in chars[first_index:] if char != to_remove])
string = "a b c d e f g g g g g h i j k l g"
print(leave_one_split(string, 'g'))
Output:
'a b c d e f g h i j k l'

If I understand correctly, you can just use a regex and re.sub to look for groups of two or more of your letter with or without a space and replace it by a single instance:
import re
def leaveOne(s, char):
return re.sub(r'((%s\s?)){2,}' % char, r'\1' , s)
leaveOne("a b c d e f g g g h i j k l", 'g')
# 'a b c d e f g h i j k l'
leaveOne("a b c d e f ggg h i j k l", 'g')
# 'a b c d e f g h i j k l'
leaveOne("a b c d e f g h i j k l", 'g')
# 'a b c d e f g h i j k l'
EDIT
If the goal is to get rid of all occurrences of the letter except one, you can still use a regex with a lookahead to select all letters followed by the same:
import re
def leaveOne(s, char):
return re.sub(r'(%s)\s?(?=.*?\1)' % char, '' , s)
print(leaveOne("a b c d e f g g g h i j k l g", 'g'))
# 'a b c d e f h i j k l g'
print(leaveOne("a b c d e f ggg h i j k l gg g", 'g'))
# 'a b c d e f h i j k l g'
print(leaveOne("a b c d e f g h i j k l", 'g'))
# 'a b c d e f g h i j k l'
This should even work with more complicated patterns like:
leaveOne("a b c ffff d e ff g", 'ff')
# 'a b c d e ff g'

Given String
mystr = 'defghhabbbczasdvakfafj'
cache = {}
seq = 0
for i in mystr:
if i not in cache:
cache[i] = seq
print (cache[i])
seq+=1
mylist = []
Here I have ordered the dictionary with values
for key,value in sorted(cache.items(),key=lambda x : x[1]):
mylist.append(key)
print ("".join(mylist))

How to extract data recursively on Linux?

I'm attempting to work on a large dataset, however, the format structure of the data has been split up into hundreds of directories.
data/:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z
data/0:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/1:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/2:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/3:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/4:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/5:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/6:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/7:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/8:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/9:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/a:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
Furthermore, the file types are also completely random.
0: UTF-8 Unicode text
1: UTF-8 Unicode text
2: UTF-8 Unicode text
3: UTF-8 Unicode text
4: UTF-8 Unicode text
5: Non-ISO extended-ASCII text, with LF, NEL line terminators
6: UTF-8 Unicode text
7: UTF-8 Unicode text
8: UTF-8 Unicode text
9: UTF-8 Unicode text
a: UTF-8 Unicode text
...
z: UTF-8 Unicode text
The files contain a email:password format.
How can I get all of the content into a JSON file, or CSV file?
I'm looking to import the data to MongoDB.
Thanks.

I'm sure someone will help you better than I can but if I can point you in right direction I will.
Have you tried making a perl script? Ie
opendir(DIR, ".");
#files = grep(/\.cnf$/,readdir(DIR));
closedir(DIR);
foreach $file (#files) {
//shuv in a JSON file
}
Something like that?

The question was tagged with python, so I would recommend os.walk() (documentation) for recursively reading files. Something like:
# path is the path to the data
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = os.path.join(subdir, file)
try:
read_file(file_path) # This is where you read the files and push to mongo etc
except:
continue
For the second part about reading Non-ISO extended-ASCII English text, there are some answers that might be helpful here: File encoding from English text to UTF-8

regex pattern won't return in python script

Why does the first snippet return digits, but the latter does not? I have tried more complicated expressions without success. The expressions I use are valid according to pythex.org, but do not work in the script.
(\d{6}-){7}\d{6}) is one such expression. I've tested it against this string: 123138-507716-007469-173316-534644-033330-675057-093280
import re
pattern = re.compile('(\d{1})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
==============
import re
pattern = re.compile('(\d{6})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
When I put the string into a variable and then search the variable it works just fine. This should work as is. But it doesn't help if I want to read a text file. I've tried to read each line of the file and that seems to be where the script breaks down.
import re
pattern = re.compile('((\d{6}-){7})')
#pattern = re.compile('(\d{6})')
#load_file = open('demo.txt', 'r')
#search_file = load_file.read()
test_string = '123138-507716-007469-173316-534644-033330-675057-093280'
result = pattern.findall(test_string)
print(result)
=========
printout,
Search File:
ÿþB i t L o c k e r D r i v e E n c r y p t i o n R e c o v e r y K e y
T h e r e c o v e r y k e y i s u s e d t o r e c o v e r t h e d a t a o n a B i t L o c k e r p r o t e c t e d d r i v e .
T o v e r i f y t h a t t h i s i s t h e c o r r e c t r e c o v e r y k e y c o m p a r e t h e i d e n t i f i c a t i o n w i t h w h a t i s p r e s e n t e d o n t h e r e c o v e r y s c r e e n .
R e c o v e r y k e y i d e n t i f i c a t i o n : f f s d f a - f s d f - s f
F u l l r e c o v e r y k e y i d e n t i f i c a t i o n : 8 8 8 8 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 8 8 8 8 8 8 8
B i t L o c k e r R e c o v e r y K e y :
1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1
6 6 6 6 6 6
Search Results:
[]
Process finished with exit code 0
================
This is where I ended up. It finds the string just fine and without the commas.
import re
pattern = re.compile('(\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6})')
load_file = open('demo3.txt', 'r')
for line in load_file:
print(pattern.findall(line))

Python: Split a mixed String

I read some lines from a file in the following form:
line = a b c d,e,f g h i,j,k,l m n
What I want is lines without the ","-separated elements, e.g.,
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
. . . . . . . . .
. . . . . . . . .
First I would split line
sline = line.split()
Now I would iterate over sline and look for elements that can be splited with "," as separator. The Problem is I don't know always how much from those elements I have to expect.
Any ideas?

Using regex, itertools.product and some string formatting:
This solution preserves the initial spacing as well.
>>> import re
>>> from itertools import product
>>> line = 'a b c d,e,f g h i,j,k,l m n'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
>>> for prod in product(*items):
... print (strs.format(*prod))
...
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
Another example:
>>> line = 'a b c d,e,f g h i,j,k,l m n q,w,e,r f o o'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
for prod in product(*items):
print (strs.format(*prod))
...
a b c d g h i m n q f o o
a b c d g h i m n w f o o
a b c d g h i m n e f o o
a b c d g h i m n r f o o
a b c d g h j m n q f o o
a b c d g h j m n w f o o
a b c d g h j m n e f o o
a b c d g h j m n r f o o
a b c d g h k m n q f o o
a b c d g h k m n w f o o
a b c d g h k m n e f o o
a b c d g h k m n r f o o
a b c d g h l m n q f o o
a b c d g h l m n w f o o
a b c d g h l m n e f o o
a b c d g h l m n r f o o
a b c e g h i m n q f o o
a b c e g h i m n w f o o
a b c e g h i m n e f o o
a b c e g h i m n r f o o
a b c e g h j m n q f o o
a b c e g h j m n w f o o
a b c e g h j m n e f o o
a b c e g h j m n r f o o
a b c e g h k m n q f o o
a b c e g h k m n w f o o
a b c e g h k m n e f o o
a b c e g h k m n r f o o
a b c e g h l m n q f o o
a b c e g h l m n w f o o
a b c e g h l m n e f o o
a b c e g h l m n r f o o
a b c f g h i m n q f o o
a b c f g h i m n w f o o
a b c f g h i m n e f o o
a b c f g h i m n r f o o
a b c f g h j m n q f o o
a b c f g h j m n w f o o
a b c f g h j m n e f o o
a b c f g h j m n r f o o
a b c f g h k m n q f o o
a b c f g h k m n w f o o
a b c f g h k m n e f o o
a b c f g h k m n r f o o
a b c f g h l m n q f o o
a b c f g h l m n w f o o
a b c f g h l m n e f o o
a b c f g h l m n r f o o

Your question is not really clear. If you want to strip off any part after commas (as your text suggests), then a fairly readable one-liner should do:
cleaned_line = " ".join([field.split(",")[0] for field in line.split()])
If you want to expand lines containing comma-separated fields into multiple lines (as your example suggests), then you should use the itertools.product function:
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in line.split()]
for expanded_line_fields in itertools.product(*line_fields):
print " ".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
If it's important to keep the original spacing, for some reason, then you can replace line.split() by re.findall("([^ ]*| *)", line):
import re
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in re.findall("([^ ]+| +)", line)]
for expanded_line_fields in itertools.product(*line_fields):
print "".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n

If I have understood your example correctly You need following
import itertools
sss = "a b c d,e,f g h i,j,k,l m n d,e,f "
coma_separated = [i for i in sss.split() if ',' in i]
spited_coma_separated = [i.split(',') for i in coma_separated]
symbols = (i for i in itertools.product(*spited_coma_separated))
#use generator statement to save memory
for s in symbols:
st = sss
for part, symb in zip(coma_separated, s):
st = st.replace(part, symb, 1) # To prevent replacement of the
# same coma separated group replace once
# for first occurance
print (st.split()) # for python3 compatibility

Most other answers only produce one line instead of the multiple lines you seem to want.
To achieve what you want, you can work in several ways.
The recursive solution seems the most intuitive to me:
def dothestuff(l):
for n, i in enumerate(l):
if ',' in i:
# found a "," entry
items = i.split(',')
for j in items:
for rest in dothestuff(l[n+1:]):
yield l[:n] + [j] + rest
return
yield l
line = "a b c d,e,f g h i,j,k,l m n"
for i in dothestuff(line.split()): print i

for i in range(len(line)-1):
if line[i] == ',':
line = line.replace(line[i]+line[i+1], '')

import itertools
line_data = 'a b c d,e,f g h i,j,k,l m n'
comma_fields_indices = [i for i,val in enumerate(line_data.split()) if "," in val]
comma_fields = [i.split(",") for i in line_data.split() if "," in i]
all_comb = []
for val in itertools.product(*comma_fields):
sline_data = line_data.split()
for index,word in enumerate(val):
sline_data[comma_fields_indices[index]] = word
all_comb.append(" ".join(sline_data))
print all_comb

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is my file seemingly being read incorrectly? - python

Related

How to create a table using a list of lists

How to leave only one defined sub-string in a string in Python

How to extract data recursively on Linux?

regex pattern won't return in python script

Python: Split a mixed String

Categories

Resources