remove the 'u' and brackets from text file output - pyspark - python

I need to output the values from my spark program into a text file in the following format:
'ADDRESS', VALUE
However, my current output is:
(u'ADDRESS', VALUE)
Is there a way for me to reformat the output so when it is written into the text file is it in the format of the 1st stated format.
Here is my code below:
import pyspark
import re
from operator import *
sc = pyspark.SparkContext()
sc.setLogLevel("ERROR")
def good_line(line):
try:
fields = line.split(',')
if len(fields)!=7:
return False
if int(fields[3]) == 0:
return false
str(fields[2])
int(fields[3])
return True
except:
return False
lines = sc.textFile("/user/ae306/transactions.csv")
clean_lines = lines.filter(good_line)
transactions = clean_lines.map(lambda transaction: (transaction.split(',')[2] ,int(transaction.split(',')[3])))
result = transactions.reduceByKey(add)
print(result)
result.saveAsTextFile("CompEvalSparkPartBJob1TestFile")
Thank you for your time.

result is a tuple and printing it prints its string representation as text, which in turn prints the representation of the elements inside it
But the representation of the elements is not the same if you print the elements separately. You have no control on this representation.
Good old format would allow this control
print("'{}', {}".format(*result))
to put this in a text file instead, a handle must be obtained somewhere during initialization (not using the with syntax willingly, look it up if needed)
f = open("textfile.txt","w")
then instead of print just f.write, with linefeed, as many times as needed (if there are several results)
f.write("'{}', {}\n".format(*result))
in the end, close the file:
f.close()

Related

modified textfile python script

I am totally new in python world. Here I am looking for some suggestion about my problem. I have three text file one is original text file, one is text file for updating original text file and write in a new text file without modifying the original text file. So file1.txt looks like
$ego_vel=x
$ped_vel=2
$mu=3
$ego_start_s=4
$ped_start_x=5
file2.txt like
$ego_vel=5
$ped_vel=5
$mu=6
$to_decel=5
outputfile.txt should be like
$ego_vel=5
$ped_vel=5
$mu=6
$ego_start_s=4
$ped_start_x=5
$to_decel=5
the code I tried till now is given below:
import sys
import os
def update_testrun(filename1: str, filename2: str, filename3: str):
testrun_path = os.path.join(sys.argv[1] + "\\" + filename1)
list_of_testrun = []
with open(testrun_path, "r") as reader1:
for line in reader1.readlines():
list_of_testrun.append(line)
# print(list_of_testrun)
design_path = os.path.join(sys.argv[3] + "\\" + filename2)
list_of_design = []
with open(design_path, "r") as reader2:
for line in reader1.readlines():
list_of_design .append(line)
print(list_of_design)
for i, x in enumerate(list_of_testrun):
for test in list_of_design:
if x[:9] == test[:9]:
list_of_testrun[i] = test
# list_of_updated_testrun=list_of_testrun
break
updated_testrun_path = os.path.join(sys.argv[5] + "\\" + filename3)
def main():
update_testrun(sys.argv[2], sys.argv[4], sys.argv[6])
if __name__ == "__main__":
main()
with this code I am able to get output like this
$ego_vel=5
$ped_vel=5
$mu=3
$ego_start_s=4
$ped_start_x=5
$to_decel=5
all the value I get correctly except $mu value.
Will any one provide me where I am getting wrong and is it possible to share a python script for my task?
Looks like your problem comes from the if statement:
if x[:9] == test[:9]:
Here you're comparing the first 8 characters of each string. For all other cases this is fine as you're not comparing past the '=' character, but for $mu this means you're evaluating:
if '$mu=3' == '$mu=6'
This obviously evaluates to false so the mu value is not updated.
You could shorten to if x[:4] == test[:4]: for a quick fix but maybe you would consider another method, such as using the .split() string function. This lets you split a string around a specific character which in your case could be '='. For example:
if x.split('=')[0] == test.split('=')[0]:
Would evaluate as:
if '$mu' == '$mu':
Which is True, and would work for the other statements too. Regardless of string length before the '=' sign.

How to make translating function in python?

I want to ask something about translating somestring using python. I have a csv file contains list of abreviation dictionary like this.
before, after
ROFL, Rolling on floor laughing
STFU, Shut the freak up
LMK, Let me know
...
I want to translate string that contains word in column "before" to be word in column "after". I try to use this code, but it doesn't change anything.
def replace_abbreviation(tweet):
dictionary = pd.read_csv("dict.csv", encoding='latin1')
dictionary['before'] = dictionary['before'].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'ignore').decode())
tmp = dictionary.set_index('before').to_dict('split')
tweet = tweet.translate(tmp)
return tweet
For example :
Input = "lmk your test result please"
Output = "let me know your test
result please"
You can read the contents to a dict and then use the following code.
res = {}
with open('dict.csv') as file:
next(file) # skip the first line "before, after"
for line in file:
k, v = line.strip().split(', ')
res[k] = v
def replace(tweet):
return ' '.join(res.get(x.upper(), x) for x in tweet.split())
print(replace('stfu and lmk your test result please'))
Output
Shut the freak up and Let me know your test result please

Extract text element separated by hash & comma and store it in separate variable

I have text file having this content
group11#,['631', '1051']#,ADD/H/U_LS_FR_U#,group12#,['1', '1501']#,ADD/H/U_LS_FR_U#,group13#,['31', '28']#,ADD/H/UC_DT_SS#,group14#,['18', '27', '1017', '1073']#,AN/H/UC_HR_BAN#,group15#,['13']#,AD/H/U_LI_NW#,group16#,['1031']#,AN/HE/U_LE_NW_IES#
Requirment is to pull each element separated by #, and to store it in separate variable. And text file above is not having fixed length. So if there are 200 #, separated values then, those should be stored in 200 varaiables.
So the expected output would be
a = group11, b = [631, 1051] c = ADD/H/U_LS_FR_U, d = group12, e = [1, 1501] f = ADD/H/U_LS_FR_U and so on
I'd use those a,b,c,d further as
url = (url+c)
rjson = {"reqparam":{"ids":[str(b)]+str(b)}]}
freq = json.dumps(rjson)
resp = request.request("Post",url,rjson)
Actually in reqparam 'b' have to use values like 631 and 1051
Not sure how to achieve this?
I've started with
with open("filename.txt", "r") as f:
data = f.readlines()
for line in data:
value = line.strip().split('#')
print(value)
You should not use new variable for each object, there are different containers for this, e.g. list.
To parse this string into a list, you can just split string using "#," as a divider and cut last symbol (which is "#") from source before strip:
result = src[:-1].split(",#")
But in output sample you show that you want items which contains list to be converted into a list. You can do this using ast.literal_eval():
import ast
result = [ast.literal_eval(s) if "[" in s else s for s in src[:-1].split("#,")]
I used list comprehesion in previous example, but you can write it using regular for loop:
import ast
result = []
for s in src[:-1].split(",#"):
if "[" in s:
try:
converted = ast.literal_eval(s) # string repr of list into a list
except Exception as e:
print(f"\"{s}\" throws an error: {e}")
else:
result.append(converted)
else:
result.append(s)
You can also use str.strip() to cut "#" and "," from the end of the string (and from the start):
src.strip(",#").split(",#")

How to read strings as integers when reading from a file in python

I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))

Workarounds when a string is too long for a .join. OverflowError occurs

I'm working through some python problems on pythonchallenge.com to teach myself python and I've hit a roadblock, since the string I am to be using is too large for python to handle. I receive this error:
my-macbook:python owner1$ python singleoccurrence.py
Traceback (most recent call last):
File "singleoccurrence.py", line 32, in <module>
myString = myString.join(line)
OverflowError: join() result is too long for a Python string
What alternatives do I have for this issue? My code looks like such...
#open file testdata.txt
#for each character, check if already exists in array of checked characters
#if so, skip.
#if not, character.count
#if count > 1, repeat recursively with first character stripped off of page.
# if count = 1, add to valid character array.
#when string = 0, print valid character array.
valid = []
checked = []
myString = ""
def recursiveCount(bigString):
if len(bigString) == 0:
print "YAY!"
return valid
myChar = bigString[0]
if myChar in checked:
return recursiveCount(bigString[1:])
if bigString.count(myChar) > 1:
checked.append(myChar)
return recursiveCount(bigString[1:])
checked.append(myChar)
valid.append(myChar)
return recursiveCount(bigString[1:])
fileIN = open("testdata.txt", "r")
line = fileIN.readline()
while line:
line = line.strip()
myString = myString.join(line)
line = fileIN.readline()
myString = recursiveCount(myString)
print "\n"
print myString
string.join doesn't do what you think. join is used to combine a list of words into a single string with the given seperator. Ie:
>>> ",".join(('foo', 'bar', 'baz'))
'foo,bar,baz'
The code snippet you posted will attempt to insert myString between every character in the variable line. You can see how that will get big quickly :-). Are you trying to read the entire file into a single string, myString? If so, the way you want to concatenate the strings is like this:
myString = myString + line
While I'm here... since you're learning Python here are some other suggestions.
There are easier ways to read an entire file into a variable. For instance:
fileIN = open("testdata.txt", "r")
myString = fileIN.read()
(This won't have the exact behaviour of your existing strip() code, but may in fact do what you want.)
Also, I would never recommend practical Python code use recursion to iterate over a string. Your code will make a function call (and a stack entry) for every character in the string. Also I'm not sure Python will be very smart about all the uses of bigString[1:]: it may well create a second string in memory that's a copy of the original without the first character. The simplest way to process every character in a string is:
for mychar in bigString:
... do your stuff ...
Finally, you are using the list named "checked" to see if you've ever seen a particular character before. But the membership test on lists ("if myChar in checked") is slow. In Python you're better off using a dictionary:
checked = {}
...
if not checked.has_key(myChar):
checked[myChar] = True
...
This exercise you're doing is a great way to learn several Python idioms.

Categories

Resources