How convert Record Separator character into line break - python

Hello i'm using for this porpouse pyspark
I have a txt file that contains this information
c-234r4|Julio|38|Madrida-533r2|Ana|32|Madrida-543r4|Sonia|33|Bilbaob-654r4|Jorge|23|Barcelona
If you see all records are concateneted using the Record Separator character (see this link)
I'm trying to do this, but without results
df = spark.read.load("s3://my-bucket/txt_file/data.txt", format="csv", sep="|", inferSchema="true", encoding="UTF-8", escape='U+001E')
df.show(10, False)
Error:
Py4JJavaError: An error occurred while calling o496.load.
: java.lang.RuntimeException: escape cannot be more than one character
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.getChar(CSVOptions.scala:52)
the final result must be looks like this:
+-------+-----+---+--------------+
|_c0 |_c1 |_c2|_c3 |
+-------+-----+---+--------------+
|c-234r4|Julio|38 |Madrid |
|a-533r2|Ana |32 |Madrid |
|a-543r4|Sonia|33 |Bilbao |
|b-654r4|Jorge|23 |Barcelona |
+-------+-----+---+--------------+
Options tested:
option-1 --> This totally wrong
option-2 --> This show rows as columns ... and this is wrong
Can somebody give me an advice i need an idea to resolve this on my actual role?
I'll appreciete
Thanks

Related

Insert variable inside a check_call / check_output in python 2.7?

Good morning everyone,
I'm trying to make a mini software where i read a csv, insert it into a variable and then give this variable to a check_call function.
The CSV is a list of databases:
cat test_db.csv
andreadb
billing
fabiodb
And this is what i wrote right now:
from subprocess import *
import csv
#Load the CSV inside the variable data
with open('test_db.csv', 'r') as csvfile:
data = list(csv.reader(csvfile))
#For loop that per each database it shows me the tables and the output saved into risultato.txt
for line in data:
database = line
check_call[("beeline", "-e", "\"SHOW TABLES FROM \"", database, ";" , ">>" , "risultato.txt")]
When i execute it i get the following error:
Traceback (most recent call last):
File "test_query.py", line 10, in <module>
check_call[("beeline", "-e", "\"SHOW TABLES FROM \"", database, ";")]
TypeError: 'function' object has no attribute '__getitem__'
I'm relatively new to python and this is my first project, so any help would be great.
If i didn't explained correctly something, please tell me and i'll edit the post.
Thanks!!
You have mistyped the function call. It should be
check_call(["beeline", "-e", "\"SHOW TABLES FROM \"", database, ";" , ">>" , "risultato.txt"])
The ( was placed after [ in your question. It should be ( first followed by a list of commands and params.
After a lot of tinkering i found a way to concatenate the variable in this check_call:
for line in data:
i=0
database=str(line[i])
check_call(["beeline -e \"SHOW TABLES FROM "+database+"\" >> risultato.txt"])
i+=i
After execution it produces the correct output after saving it in risultato.txt:
+-----------+
| tab_name |
+-----------+
| result |
| results |
+-----------+
+------------------------------------+
| tab_name |
+------------------------------------+
| tab_example_1 |
| tab_example_2 |
+------------------------------------+
+---------------------------------------+
| tab_name |
+---------------------------------------+
| tab_example_3 |
| tab_example_4 |
+---------------------------------------+

File CSV with pipe (|) but I've pipe in the middle of the field

I have a CSV file, generate by ERP, the delimiter is pipe (|). But in this file I have columns with the format in ERP is Text and the users in many lines put pipe(|) in the middle of the text
ex
|100019391 |99806354 |EV | RES: Consulta COBRO VVISTA - Chile |31|24.06.2021|
this part EV | Res*** is the field where de user put pipe.
My error is, when the pand read this lines, it give me a Error
Skipping line 46: Expected 28 fields in line 46, saw 29
enter image description here
Is there a option to fix it?
Tks
Assuming that no space exists after the pipe separator, then we can use the following regex r"\|(?!\s)" for the sep argument.
Sample input:
col|col1|col2|col3|col4|
100019391 |99806354 |EV | RES: Consulta COBRO VVISTA - Chile |31|24.06.2021|
100019392 |99806777 |TEST - Chile |31|25.06.2021|
100019393 |99806779 |TE | ST - Chile |31|25.06.2021|
Then, we can import the above csv as follows:
df = pd.read_csv(csv_filename,
usecols=range(5),
sep=r"\|(?!\s)",
lineterminator='\r',
engine='python')
Adjust usecols according to the number of columns you have. Adjust lineterminator according to the line terminator being used in your file. The engine='python' is required as python will throw a warning for using regex in sep.
Pic of the output

Modify MinimalWordCount example to read from BigQuery

I am trying to modify Apache Beam's MinimalWordCount python example to read from a BigQuery table. I have made the following modifications, and I appear to have the query working but the example.
Original Example Here:
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
.with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
output = counts | 'Format' >> beam.Map(lambda (w, c): '%s: %s' % (w, c))
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
Rather than ReadFromText I am trying to adjust this to read from a column in a BigQuery table. To do this I have replaced lines = p | ReadFromText(known_args.input) with the following code:
query = 'SELECT text_column FROM `bigquery.table.goes.here` '
lines = p | 'ReadFromBigQuery' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
When I re-run the pipeline, I get the error: "WARNING:root:A task failed with exception. expected string or buffer [while running 'Split']"
I recognize that the 'Split' operation is expecting a string and it is clearly not getting a string. How can I modify 'ReadFromBigQuery' so that it is passing a string/buffer? Do I need to provide a table schema or something to convert the results of 'ReadFromBigQuery' into a buffer of strings?
This is because BigQuerySource returns PCollection of dictionaries (dict), where every key in the dictionary represents a column. For your case the simplest thing to do will be just applying beam.Map after beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True) like this:
lines = (p
|"ReadFromBigQuery" >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
| "Extract text column" >> beam.Map(lambda row: row.get("text_column"))
)
If you encounter problem with column name, try change it to u"text_column".
Alternatively you can modify your Split transform to extract the value of column there:
'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x.get("text_column")))
.with_output_types(unicode))

Clean up string extracted from csv file

I am extracting certain data from a csv file using Ruby and I want to cleanup the extracted string by removing the unwanted characters.
This is how I extract the data so far:
CSV.foreach(data_file, :encoding => 'windows-1251:utf-8', :headers => true) do |row|
#create an array for each page
page_data = []
#For each page, get the data we are interested in and save it to the page_data
page_data.push(row['dID'])
page_data.push(row['xTerm'])
pages_to_import.push(page_data)
Then I output the csv file with the extracted data
The output extracted is exactly as it is on the csv data file:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | ##106#107#my##106#term## |
| 13345 | ##63#hello## |
| 11436 | ##55#rock##20#my##10015#18#world## |
However, My desired result that I want to achieve is:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | my, term |
| 13345 | hello |
| 11436 | rock, my, world |
Any suggestions on how to achieve this?
Libraries that Im using:
require 'nokogiri'
require 'cgi'
require 'csv'
Using a regular expression, I'd do:
%w[
##106#107#term1##106#term2##
##63#term1##
##55#term1##20#term2##10015#18#term3##
##106#107#my##106#term##
##63#hello##
##55#rock##20#my##10015#18#world##
].map{ |str|
str.scan(/[^##]+?)(?=#/)
}
# => [["term1", "term2"], ["term1"], ["term1", "term2", "term3"], ["my", "term"], ["hello"], ["rock", "my", "world"]]
My str is the equivalent of the contents of your row['xTerm'].
The regular expression /[^##]+?(?=#)/ searches for patterns in str that don't contain # or # and end with #.
From the garbage in the string, and your comment that you're using Nokogiri and CSV, and because you didn't show your input data as CSV or HTML, I have to wonder if you're not mangling the incoming data somehow, and trying to wiggle out of it in post-processing. If so, show us what you're actually doing and maybe we can help you get clean data to start.
I'm assuming your terms are bookended and separated by ## and consist of one or more numbers followed by the actual term separated by #. To get the terms into an array:
row['xTerm'].split('##')[1..-1].map { |term| term.split(?#)[-1] }
Then you can join or do whatever you want with it.

Python -- how to read and change specific fields from file? (specifically, numbers)

I just started learning python scripting yesterday and I've already gotten stuck. :(
So I have a data file with a lot of different information in various fields.
Formatted basically like...
Name (tab) Start# (tab) End# (tab) A bunch of fields I need but do not do anything with
Repeat
I need to write a script that takes the start and end numbers, and add/subtract a number accordingly depending on whether another field says + or -.
I know that I can replace words with something like this:
x = open("infile")
y = open("outfile","a")
while 1:
line = f.readline()
if not line: break
line = line.replace("blah","blahblahblah")
y.write(line + "\n")
y.close()
But I've looked at all sorts of different places and I can't figure out how to extract specific fields from each line, read one field, and change other fields. I read that you can read the lines into arrays, but can't seem to find out how to do it.
Any help would be great!
EDIT:
Example of a line from the data here: (Each | represents a tab character)
| |
V V
chr21 | 33025905 | 33031813 | ENST00000449339.1 | 0 | **-** | 33031813 | 33031813 | 0 | 3 | 1835,294,104, | 0,4341,5804,
chr21 | 33036618 | 33036795 | ENST00000458922.1 | 0 | **+** | 33036795 | 33036795 | 0 | 1 | 177, | 0,
The second and third columns (indicated by arrows) would be the ones that I'd need to read/change.
You can use csv to do the splitting, although for these sorts of problems, I usually just use str.split:
with open(infile) as fin,open('outfile','w') as fout:
for line in fin:
#use line.split('\t'3) if the name of the field can contain spaces
name,start,end,rest = line.split(None,3)
#do something to change start and end here.
#Note that `start` and `end` are strings, but they can easily be changed
#using `int` or `float` builtins.
fout.write('\t'.join((name,start,end,rest)))
csv is nice if you want to split lines like this:
this is a "single argument"
into:
['this','is','a','single argument']
but it doesn't seem like you need that here.

Categories

Resources