writerow() with lists from JSON - python

I have the following information extracted from a JSON and saved on a variables. The variables and its information are:
tickets = ['DC-830', 'DC-463’, ' DC-631’]
duration = ['5h 3m', '1h 7m' , ‘3h 4m']
When I use writerow() if the JSON has only one value for example, tickets = 'DC-830', I am able to save the information in a csv file. However, if it has 2 or more values it writes the information in the same row.
This is what I get:
Ticket | Duration
['DC-830', 'DC-463’, ' DC-631’] | ['5h 3m', '1h 7m' , ‘3h 4m']
Instead I need something like this:
Ticket | Duration
DC-830 | 5h 3m
DC-463 | 1h 7m
DC-631 | 3h 4m
This is the code:
issues_test=s_json[['issues'][0]]
tickets,duration=[],[]
for item in issues_test:
tickets.append(item['key'])
duration.append(item['fields']['customfield_154'])
header = ['Ticket', 'Duration']
with open('P1_6.7.csv', 'a') as arc:
writer = csv.writer(arc)
writer.writerow(header)
writer.writerow([tickets, duration])

As the singular name suggests, writerow() just writes one row. The argument should be a list of strings or numbers. You're giving it a 2-dimensional list, and expecting it to pair them up and write each pair to a separate row.
To write multiple rows, use writerows() (notice the plural in the name). You can use zip() to pair up the elements of the two lists.
writer.writerows(zip(tickets, duration))

Related

How to use writeStream from a pyspark streaming dataframe to chop the values into different columns?

I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:
"64 Apple 32.32128Orange12.1932 Banana 2.45"
Expected dataframe:
64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45
Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).
I tried using forEach as the following example but it is not doing anything:
two_d = []
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
def process_row(string):
rows = round(len(string)/chars_per_row)
for i in range(rows):
current_index = 0
two_d.append([])
for j in range(len(META_SIZES)):
two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
current_index += META_SIZES[j]
print(two_d[i])
query = streamingDF.writeStream.foreach(process_row).start()
I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.
Edit: I added an input example and explained META_SIZES
Assuming the inputs are something like the following.
...
"64 Apple 32.32"
"128 Orange 12.19"
"32 Banana 2.45"
...
You can do this.
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
#remove this line if strings are already utf-8
lines = stream_lines.select(stream_lines['value'].cast('string'))
lengths = (lines.withColumn('Count', functions.split(lines['value'], ' ').getItem(0))
.withColumn('Fruit', functions.split(lines['value'], ' ').getItem(1)
.withColumn('Price', functions.split(lines['value'], ' ').getItem(1))
Note that "value" is set as the default column name when reading a string using readStream. If clouds_config contains anything changing the column name of the input you will have to alter the column name in the code above.

Formatting the data correctly from text file using pandas Python

I have data in my .txt file
productname1
7,64
productname2
6,56
4.73
productname3
productname4
12.58
10.33
So the data is explained here. We have product name in the first name and in the 2nd line is the price. But for 2nd product name we have original product price and discounted price. Also, the prices sometimes contain '.' and ',' to represent cents. I want to format the data in the following way
Product o_price d_price
productname1 7.64 -
productname2 6.56 4.73
productname3 - -
productname4 12.58 10.33
My current approach is a bit naive but it works for 98% of the cases
import pandas as pd
data = {}
tempKey = []
with open("myfile.txt", encoding="utf-8") as file:
arr_content = file.readlines()
for val in arr_content:
if not val[0].isdigit():# check whether Starting letter is a digit or text
val = ' '.join(val.split()) # Remove extra spaces
data.update({val: []}) # Adding key to the dict and initializing it with a list in which I'll populate values
tempKey.append(val) # keeping track of the last key added because dicts are not sequential
else:
data[str(tempKey[-1])].append(val) # Using last added key and updating it with prices
df = pd.DataFrame(list(data.items()), columns = ['Product', 'Pricelist'])
df[['o_price', 'd_price']] = pd.DataFrame([x for x in df.Pricelist])
df = df.drop('Prices', axis=1)
So this technique does not work when product name starts with a digit. Any suggestions for a better approach ?
Use a regular expression to check if the line contains only numbers and/or periods.
if (re.match("^[0-9\.]*$", val)):
# This is a product price
else:
# This is a product name

Reading specific columns from CSV Python

I am trying to parse through a CSV file and extract few columns from the CSV.
ID | Code | Phase |FBB | AM | Development status | AN REMARKS | stem | year | IN -NAME |IN Year |Company
L2106538 |Rs124 | 4 | | | Unknown | | -pre- | 1982 | Domoedne | 1982 | XYZ
I would like to group and extract few columns for uploading them to different models.
For example I would like to group first 3 columns to a model, next two to a different model, first column and the 6, 7 to a different model and so on.
I also need to keep the header of the file and store the data as key value pair so that I would know which column should go for a particular field in a model.
This is what I have so far.
def group_header_value(file):
reader = csv.DictReader(open(file, 'r'))# to have the header and get the data as a key value pair.
all_result= []
for row in reader:
print row
all_result.append(row)
return all_result
def group_by_models(all_results):
MD = range(1,3) # to get the required cols.
for every_row in all_results:
contents = [(every_row[i] for i in MD)]
print contents
def handle(self, *args, **options):
database = options.get('database')
filename = options.get('filename')
all_results = group_header_value(filename)
print 'grouped_bymodel', group_by_models(all_results)
This is what I get when I try to get the contents
grouped_by model: at 0x7f9f5382e0f0>
at 0x7f9f5382e0a0>
at 0x7f9f5382e0f0>
Is there a different approach to extract particular columns in DictReader? how else can I extract required columns using DictReader. Thanks
(every_row[i] for i in MD) is a generator expression. The syntax for a generator expression is (mostly) the same as that for a list comprehension, except that a generator expression is enclosed by parentheses, (...), while a list comprehension uses brackets, [...].
[(every_row[i] for i in MD)] is a list containing one element, the generator expression.
To fix your code with minimal changes, remove the parentheses:
def group_by_models(all_results):
MD = range(1,3) # to get the required cols.
for every_row in all_results:
contents = [every_row[i] for i in MD]
print(contents)
You could also make group_by_models more reusable by making MD a parameter:
def group_by_models(all_results, MD=range(3)):
for every_row in all_results:
contents = [every_row[i] for i in MD]
print(contents)

How to output vertical bar format in textfile

I am trying to save my contents in the textfile in this format
|Name|Description|
|QY|Hello|
But for now mine is returning me as this output:
|2014 1Q 2014 2Q 2014 3Q 2014 4Q|
|96,368.3 96,808.3 97,382 99,530.5|
Code:
def write_to_txt(header,data):
with open('test.txt','a+') as f:
f.write("|{}|\n".format('\t'.join(str(field)
for field in header))) # write header
f.write("|{}|\n".format('\t'.join(str(field)
for field in data))) # write items
Any idea how to change my code so that I would have the ideal output?
This is happenning because you are joining the fields inside data iterable using \t , use | in the join as well for your requirement. Example -
def write_to_txt(header,data):
with open('test.txt','a+') as f:
f.write("|{}|\n".format('|'.join(str(field)
for field in header))) # write header
f.write("|{}|\n".format('|'.join(str(field)
for field in data)))

How to easily filter from csv with python?

Assuming I have the following CSV file:
User ID Name Application
001 Ajohns ABI
002 Fjerry Central
900 Xknight RFC
300 JollK QDI
078 Demik Central
Is there some easy way to (import this into some data structure)? and/or be able to easily perform the following operations in python:
1) Get all user IDs with Application=Central
2) Get the row where name="FJerry" then extract say the "userid" value from
3) Give me the filtered rows for those with "Application=Central" so I can write out to CSV
4) Give me all the rows where Name="Ajohn", and Application="ABI" such that I could easy do a len() to count them up?
Is there some python library or what is the easiest way to accomplish the above?
Trivial using DictReader. You need to pass excel-tab as the dialect since your fields are tab delimited.
rows is a list of dictionaries.
>>> with open(r'd:\file.csv','rb') as f:
... reader = csv.DictReader(f,dialect='excel-tab')
... rows = list(reader)
...
>>> [x['User ID'] for x in rows if x['Application'] == 'Central']
['002', '078']

Categories

Resources