Using Pyspark how to convert plain text to csv file - python

When I created a hive table, the data is as follows.
data file
<__name__>abc
<__code__>1
<__value__>1234
<__name__>abcdef
<__code__>2
<__value__>12345
<__name__>abcdef
<__code__>2
<__value__>12345
1234156321
<__name__>abcdef
<__code__>2
<__value__>12345
...
Can I create a table right away without converting the file?
It's a plain text file, three columns are repeated.
How to convert dataframe? or csv file?
I want
| name | code | value
| abc | 1 | 1234
| abcdef | 2 | 12345
...
or
abc,1,1234
abcdef,2,12345
...

I solved my problem like this.
data = spark.read.text(path)
rows = data.rdd.zipWithIndex().map(lambda x: Row(x[0].value, int(x[1]/3)))
schema = StructType() \
.add("col1",StringType(), False) \
.add("record_pos",IntegerType(), False)
df = spark.createDataFrame(rows, schema)
df1 = df.withColumn("key", regexp_replace(split(df["col1"], '__>')[0], '<|__', '')) \
.withColumn("value", regexp_replace(regexp_replace(split(df["col1"], '__>')[1], '\n', '<NL>'), '\t', '<TAB>'))
dataframe = df1.groupBy("record_pos").pivot("key").agg(first("value")).drop("record_pos")
dataframe.show()

val path = "file:///C:/stackqustions/data/stackq5.csv"
val data = sc.textFile(path)
import spark.implicits._
val rdd = data.zipWithIndex.map {
case (records, index) => Row(records, index / 3)
}
val schema = new StructType().add("col1", StringType, false).add("record_pos", LongType, false)
val df = spark.createDataFrame(rdd, schema)
val df1 = df
.withColumn("key", regexp_replace(split($"col1", ">")(0), "<|__", ""))
.withColumn("value", split($"col1", ">")(1)).drop("col1")
df1.groupBy("record_pos").pivot("key").agg(first($"value")).drop("record_pos").show
result:
+----+------+-----+
|code| name|value|
+----+------+-----+
| 1| abc| 1234|
| 2|abcdef|12345|
| 2|abcdef|12345|
| 2|abcdef|12345|
+----+------+-----+

Related

ValueError: Cannot convert [] to Excel

I'm trying to run a program that connects to a postrgres database, runs a query and saves this result in an excel spreadsheet.
The problem I'm facing is at the time I run the query. The result of the 'memberof' column has an empty {}.
rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolconnlimit | rolvaliduntil | memberof | rolreplication | rolbypassrls
------------+----------+------------+---------------+-------------+-------------+--------------+---------------+----------+----------------+--------------
joe | f | t | f | f | t | -1 | | {} | f | f
postgres | t | t | t | t | t | -1 | | {} | t | t
test_joe | f | t | f | f | t | -1 | | {} | f | f
Because of this, my code returns this error output.
Traceback (most recent call last):
File "/path/to/file.py", line 55, in <module>
GetPostgresqlData()
File "/path/to/file.py", line 52, in GetPostgresqlData
ws.append(row)
File "/path/to/.local/lib/python3.9/site-packages/openpyxl/worksheet/worksheet.py", line 665, in append
cell = Cell(self, row=row_idx, column=col_idx, value=content)
File "/path/to/.local/lib/python3.9/site-packages/openpyxl/cell/cell.py", line 116, in __init__
self.value = value
File "/path/to/.local/lib/python3.9/site-packages/openpyxl/cell/cell.py", line 215, in value
self._bind_value(value)
File "/path/to/.local/lib/python3.9/site-packages/openpyxl/cell/cell.py", line 184, in _bind_value
raise ValueError("Cannot convert {0!r} to Excel".format(value))
ValueError: Cannot convert [] to Excel
What should I do so that the result of this column and the others is written in the excel spreadsheet?
Here is the query I'm running against my testing database.
SELECT r.rolname, r.rolsuper, r.rolinherit,
r.rolcreaterole, r.rolcreatedb, r.rolcanlogin,
r.rolconnlimit, r.rolvaliduntil,
ARRAY(SELECT b.rolname
FROM pg_catalog.pg_auth_members m
JOIN pg_catalog.pg_roles b ON (m.roleid = b.oid)
WHERE m.member = r.oid) as memberof
, r.rolreplication
, r.rolbypassrls
FROM pg_catalog.pg_roles r
WHERE r.rolname !~ '^pg_'
ORDER BY 1;
Here is my code.
def GetPostgresqlData():
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet(0)
ws.title = 'Titulo 01'
hosts = open('serverlist.csv', 'r')
hosts = hosts.readlines()[1:]
for i in range(len(filequery[1:])):
with open(filequery[i+1], 'r') as postgresql_query:
query = postgresql_query.read()
for row in hosts:
command_Array = row.strip('\n').split(',')
db_host = command_Array[0]
db_port = command_Array[1]
db_name = command_Array[2]
db_user = command_Array[3]
db_password = command_Array[4]
postgresql_connection = psycopg2.connect(user=db_user,
password=db_password,
host=db_host,
port=db_port,
database=db_name)
cursor = postgresql_connection.cursor()
cursor.execute(query)
colnames = [desc[0] for desc in cursor.description]
records = cursor.fetchall()
ws.append(colnames)
for row in records:
ws.append(row)
wb.save("postgresql.xlsx")
GetPostgresqlData()
The problem I was facing that was pointed out by Charlie Chuck in the first comment is that I can't save a list in a single cell. So I solved the problem by converting the array to string as shown below.
SELECT r.rolname, r.rolsuper, r.rolinherit,
r.rolcreaterole, r.rolcreatedb, r.rolcanlogin,
r.rolconnlimit, r.rolvaliduntil,
array_to_string(ARRAY(SELECT b.rolname
FROM pg_catalog.pg_auth_members m
JOIN pg_catalog.pg_roles b ON (m.roleid = b.oid)
WHERE m.member = r.oid), ',', '*') as memberof
, r.rolreplication
, r.rolbypassrls
FROM pg_catalog.pg_roles r
WHERE r.rolname !~ '^pg_'
ORDER BY 1

Use Pyspark df.select() to stack the output

I have the following code: . .
base_curr = 'BTC,XRP,ETH'
uri = f"https://min-api.cryptocompare.com/data/pricemultifull?fsyms={base_curr}&tsyms=JPY"
r = requests.get(uri)rdd = sc.parallelize([r.text])
json_df = sqlContext.read.json(rdd)
json_df = json_df.select(
col('RAW.BTC.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.BTC.JPY.PRICE').alias('PRICE'),
F.col('RAW.BTC.JPY.LASTUPDATE').alias('LASTUPDATE'),
F.col('RAW.XRP.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.XRP.JPY.PRICE').alias('PRICE'),
F.col('RAW.XRP.JPY.LASTUPDATE').alias('LASTUPDATE'),
F.col('RAW.ETH.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.ETH.JPY.PRICE').alias('PRICE'),
F.col('RAW.ETH.JPY.LASTUPDATE').alias('LASTUPDATE')
)
print(json_df)
json_df.show()
json_df.printSchema()
The output is:
+----------+----------+----------+----------+-----+----------+----------+--------+----------+
|FROMSYMBOL| PRICE|LASTUPDATE|FROMSYMBOL|PRICE|LASTUPDATE|FROMSYMBOL| PRICE|LASTUPDATE|
+----------+----------+----------+----------+-----+----------+----------+--------+----------+
| BTC|1994390.75|1607158504| XRP|61.85|1607158490| ETH|61874.24|1607158509|
+----------+----------+----------+----------+-----+----------+----------+--------+----------+
But the desired output is:
+----------+----------+----------+
|FROMSYMBOL| PRICE|LASTUPDATE|
+----------+----------+----------+
| BTC|1994390.75|1607158504|
+----------+----------+----------+
| XRP|61.85 |1607158490|
+----------+----------+----------+
| ETH|61874.24 |1607158509|
+----------+----------+----------+
Any help would be so great. I tried several method without success.
Does this work for you?
base_curr = 'BTC,XRP,ETH'
uri = f"https://min-api.cryptocompare.com/data/pricemultifull?fsyms={base_curr}&tsyms=JPY"
r = requests.get(uri)
rdd = sc.parallelize([r.text])
json_df = sqlContext.read.json(rdd)
json_df = json_df.select(
F.explode(
F.array(
F.array(F.col('RAW.BTC.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.BTC.JPY.PRICE').alias('PRICE'),
F.col('RAW.BTC.JPY.LASTUPDATE').alias('LASTUPDATE')
),
F.array(F.col('RAW.XRP.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.XRP.JPY.PRICE').alias('PRICE'),
F.col('RAW.XRP.JPY.LASTUPDATE').alias('LASTUPDATE')
),
F.array(F.col('RAW.ETH.JPY.FROMSYMBOL').alias('FROMSYMBOL'),
F.col('RAW.ETH.JPY.PRICE').alias('PRICE'),
F.col('RAW.ETH.JPY.LASTUPDATE').alias('LASTUPDATE')
)
)
).alias('arrays')
).select(
F.col('arrays')[0].alias('FROMSYMBOL'),
F.col('arrays')[1].alias('PRICE'),
F.col('arrays')[2].alias('LASTUPDATE'),
)

split a full name to first name and last name in pyspark?

so basically i'm learning pyspark
and i know how to split full name to first name and last name in python
name = "sun moon"
FName = name.split()[0]
LName = name.split()[1]
i want to do this in pyspark file json
{"l":"santee, california, united states","t":"161xxxx","caseN":"888548748565","caseL":"CA","n":"sun moon"}
my code
df = spark.read.json("cases.json")
df.select("l","t","caseN","caseL","n")
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('cases')
i want to split n to FName and Lname
from pyspark.sql.functions import split
df = spark.read.json("cases.json")
df.select("l","t","caseN","caseL","n")\
.withColumn("FName", split(col("n"), " ").getItem(0))\
.withColumn("LName", split(col("n"), " ").getItem(1))\
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('cases')
No you can just do it this way:
sname = name.split(" ")
The above line chops the name whenever it encounters space.
Fname = sname[0]
Lname = sname[-1]
from pyspark.sql.functions import split,size,col
df2=(spark
.createDataFrame(
[
("f1 f2","l1 l2 l3"),
("f1","l1 l2"),
("f1 f2 f3","l1 l2 l3 l4")
],
["first_name","last_name"]
)
)
(df2
.withColumn("last_name_size",size(split(col("last_name")," ")))
.withColumn("first_name2",split(col("first_name")," ").getItem(0))
.withColumn("last_name2",split(col("last_name")," ").getItem(col("last_name_size")-1)).show())
The result is
+----------+-----------+--------------+-----------+----------+
|first_name| last_name|last_name_size|first_name2|last_name2|
+----------+-----------+--------------+-----------+----------+
| f1 f2| l1 l2 l3| 3| f1| l3|
| f1| l1 l2| 2| f1| l2|
| f1 f2 f3|l1 l2 l3 l4| 4| f1| l4|
+----------+-----------+--------------+-----------+----------+
what I did was simply create an auxiliary column with last_name split size
and with that it is possible to get the last item in the array
If df is the data frame where customer's full name is in column CUST_NAME (For ex:- VIKRANT R THAKUR) then the following code format will work :-
df1 = df.withColumn('FIRST_NAME', initcap(substring_index('CUST_NAME', ' ', 1)) )
-> VIKRANT
df1 = df.withColumn('LAST_NAME', initcap(substring_index('CUST_NAME',' ',-1)) )
-> THAKUR

Read file to determine the rules

I have an excel file with user defined business rules as below:
Column_Name|Operator|Column_Value1|Operand|RuleID|Result
ABC | Equal| 12| and| 1| 1
CDE | Equal| 10| and| 1| 1
XYZ | Equal| AD| | 1| 1.5
ABC | Equal| 11| and| 2| 1
CDE | Equal| 10| | 2| 1.2
and so on. (just for formatting purpose have put | symbol).
Input file (CSV) will look like below:
ABC,CDE,XYZ
12,10,AD
11,10,AD
Goal here is to derive an output column called Result which needs to be looked up to the user defined business rule excel.
Output Expected:
ABC,CDE,XYZ,Result
12,10,AD,1.5
11,10,AD,1.2
I have so far tried to generate an if statement and trying to assign the entire if/elif statement to a function. So that I can pass it to below statement to apply the rules.
ouput_df['result'] = input_df.apply(result_func, axis=1)
When I have the function with manually coding the rules it works as shown below:
def result_func(input_df):
if (input_df['ABC'] == 12):
return '1.25'
elif (ip_df['ABC'] == 11):
return '0.25'
else:
return '1'
Is this the right way of handling this scenario? If so how do I pass the entire dynamically generated if/elif to the function?
Code
import pandas as pd
import csv
# Load rules table
rules_table = []
with open('rules.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
rules_table.append([x.strip() for x in row.values()])
# Load CSV file into DataFrame
df = pd.read_csv('data.csv', sep=",")
def rules_eval(row, rules):
" Steps through rules table for appropriate value "
def operator_eval(op, col, value):
if op == 'Equal':
return str(row[col]) == str(value)
else:
# Curently only Equal supported
raise ValueError(f"Unsupported Operator Value {op}, only Equal allowed")
prev_rule = '~'
for col, op, val, operand, rule, res in rules:
# loop through rows of rule table
if prev_rule != rule:
# rule ID changed so we can follow rule chains again
ignore_rule = False
if not ignore_rule:
if operator_eval(op, col, val):
if operand != 'and':
return res
else:
# Rule didn't work for an item in group
# ignore subsequent rules with this id
ignore_rule = True
prev_rule = rule
return None
df['results'] = df.apply(lambda row: rules_eval(row, rules_table), axis=1)
print(df)
Output
ABC CDE XYZ results
0 12 10 AD 1.5
1 11 10 AD 1.2
Explanation
df.apply - applies the rules_eval function to each row of the DataFrame.
The output is placed into column 'result' via
df['result'] = ...
Handling Rule Priority
Change
Added a Priority column to the rules_table so rules with the same RuleID are processed in order of priority.
Priority order decided by tuple ordering added to heap, currently
Priority, Column_Name, Operator, Column_Value, Operand, RuleID, Result
Code
import pandas as pd
import csv
from collections import namedtuple
from heapq import (heappush, heappop)
# Load CSV file into DataFrame
df = pd.read_csv('data.csv', sep=",")
class RulesEngine():
###########################################
# Static members
###########################################
# Named tuple for rules
fieldnames = 'Column_Name|Operator|Column_Value1|Operand|RuleID|Priority|Result'
Rule = namedtuple('Rule', fieldnames.replace('|', ' '))
number_fields = fieldnames.count('|') + 1
###########################################
# members
###########################################
def __init__(self, table_file):
# Load rules table
rules_table = []
with open(table_file) as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
fields = [self.convert(x.strip()) for x in row.values() if x is not None]
if len(fields) != self.number_fields:
# Incorrect number of values
error = f"Rules require {self.number_fields} fields per row, was given {len(fields)}"
raise ValueError(error)
rules_table.append([self.convert(x.strip()) for x in row.values()])
#rules_table.append([x.strip() for x in row.values()])
self.rules_table = rules_table
def convert(self, s):
" Convert string to (int, float, or leave current value) "
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return s
def operator_eval(self, row, rule):
" Determines value for a rule "
if rule.Operator == 'Equal':
return str(row[rule.Column_Name]) == str(rule.Column_Value1)
else:
# Curently only Equal supported
error = f"Unsupported Operator {rule.Operator}, only Equal allowed"
raise ValueError(error)
def get_rule_value(self, row, rule_queue):
" Value of a rule or None if no matching rule "
found_match = True
while rule_queue:
priority, rule_to_process = heappop(rule_queue)
if not self.operator_eval(row, rule_to_process):
found_match = False
break
return rule_to_process.Result if found_match else None
def rules_eval(self, row):
" Steps through rules table for appropriate value "
rule_queue = []
for index, r in enumerate(self.rules_table):
# Create named tuple with current rule values
current_rule = self.Rule(*r)
if not rule_queue or \
rule_queue[-1][1].RuleID == current_rule.RuleID:
# note: rule_queue[-1][1].RuleID is previous rule
# Within same rule group or last rule of group
priority = current_rule.Priority
# heap orders rules by pririty
# (lowest numbers are processed first)
heappush(rule_queue, (priority, current_rule))
if index < len(self.rules_table)-1:
continue # not at last rule, so keep accumulating
# Process rules in the rules queue
rule_value = self.get_rule_value(row, rule_queue)
if rule_value:
return rule_value
else:
# Starting over with new rule group
rule_queue = []
priority = current_rule.Priority
heappush(rule_queue, (priority, current_rule))
# Process Final queue if not empty
return self.get_rule_value(row, rule_queue)
# Init rules engine with rules from CSV file
rules_engine = RulesEngine('rules.csv')
df['results'] = df.apply(rules_engine.rules_eval, axis=1)
print(df)
Data Table
ABC,CDE,XYZ
12,10,AD
11,10,AD
12,12,AA
Rules Table
Column_Name|Operator|Column_Value1|Operand|RuleID|Priority|Result
ABC | Equal| 12| and| 1| 2|1
CDE | Equal| 10| and| 1| 1|1
XYZ | Equal| AD| and| 1| 3|1.5
ABC | Equal| 11| and| 2| 1|1
CDE | Equal| 10| foo| 2| 2|1.2
ABC | Equal| 12| foo| 3| 1|1.8
Output
ABC CDE XYZ results
0 12 10 AD 1.5
1 11 10 AD 1.2
2 12 12 AA 1.8

How to split a comma-delimited list in IronPython (Spotfire)?

I have a existing data table with two columns, one is a ID and one is a list of IDs, separated by comma.
For example
ID | List
---------
1 | 1, 4, 5
3 | 2, 12, 1
I would like to split the column List so that I have a table like this:
ID | List
---------
1 | 1
1 | 4
1 | 5
3 | 2
3 | 12
3 | 1
I figured this out now:
tablename='Querysummary Data'
table=Document.Data.Tables[tablename]
topiccolname='TOPIC_ID'
topiccol=table.Columns[topiccolname]
topiccursor=DataValueCursor.Create[str](topiccol)
docscolname='DOC_IDS'
doccol=table.Columns[docscolname]
doccursor=DataValueCursor.Create[str](doccol)
myPanel = Document.ActivePageReference.FilterPanel
idxSet = myPanel.FilteringSchemeReference.FilteringSelectionReference.GetSelection(table).AsIndexSet()
keys=dict()
topdoc=dict()
for row in table.GetRows(idxSet,topiccursor,doccursor):
keys[topiccursor.CurrentValue]=doccursor.CurrentValue
for key in keys:
str = keys[key].split(",")
for i in str:
topdoc[key]=i
print key + " " +i
now I can print the topic id with the corresponding id.
How can I create a new data table in Spotfire using this dict()?
I solved it myself finally..maybe there is some better code but it works:
tablename='Querysummary Data'
table=Document.Data.Tables[tablename]
topiccolname='TOPIC_ID'
topiccol=table.Columns[topiccolname]
topiccursor=DataValueCursor.Create[str](topiccol)
docscolname='DOC_IDS'
doccol=table.Columns[docscolname]
doccursor=DataValueCursor.Create[str](doccol)
myPanel = Document.ActivePageReference.FilterPanel
idxSet = myPanel.FilteringSchemeReference.FilteringSelectionReference.GetSelection(table).AsIndexSet()
# build a string representing the data in tab-delimited text format
textData = "TOPIC_ID;DOC_IDS\r\n"
keys=dict()
topdoc=dict()
for row in table.GetRows(idxSet,topiccursor,doccursor):
keys[topiccursor.CurrentValue]=doccursor.CurrentValue
for key in keys:
str = keys[key].split(",")
for i in str:
textData += key + ";" + i + "\r\n"
dataSet = DataSet()
dataTable = DataTable("DOCIDS")
dataTable.Columns.Add("TOPIC_ID", System.String)
dataTable.Columns.Add("DOC_IDS", System.String)
dataSet.Tables.Add(dataTable)
# make a stream from the string
stream = MemoryStream()
writer = StreamWriter(stream)
writer.Write(textData)
writer.Flush()
stream.Seek(0, SeekOrigin.Begin)
# set up the text data reader
readerSettings = TextDataReaderSettings()
readerSettings.Separator = ";"
readerSettings.AddColumnNameRow(0)
readerSettings.SetDataType(0, DataType.String)
readerSettings.SetDataType(1, DataType.String)
readerSettings.SetDataType(2, DataType.String)
# create a data source to read in the stream
textDataSource = TextFileDataSource(stream, readerSettings)
# add the data into a Data Table in Spotfire
if Document.Data.Tables.Contains("Querysummary Mapping"):
Document.Data.Tables["Querysummary Mapping"].ReplaceData(textDataSource)
else:
newTable = Document.Data.Tables.Add("Querysummary Mapping", textDataSource)
tableSettings = DataTableSaveSettings (newTable, False, False)
Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)

Categories

Resources