csvwriter, commas separators create undesirable columns between the actual values - python

I am trying to create a CSV file I can open with excel from an API data extraction (I don't know how to import it here) by using csvwriter, however for now commas separators are considered a value and added to columns between the actual values.
My code looks like this :
import csv
resources_list="resources_list.csv"
list_keys_res=[]
list_keys_res=list(res['items'][0].keys())
list_items_res=[]
list_items_res=list(res['items'])
resources = open(resources_list, 'w', newline ='')
with resources:
# identifying header
writer = csv.writer(resources, delimiter=';')
header_res = writer.writerow([list_keys_res[1]]+[";"]+[list_keys_res[0]]+[";"]+[list_keys_res[2]]+[";"]+[list_keys_res[4]])
resource=None
# loop to write the data into each row of the csv file - a row describes 1 resource
for i in list_items_res:
resource=list_items_res.index(i)
list_values_res=[]
list_values_res=list(list_items_res[resource].values())
writer.writerow([list_values_res[1]]+[";"]+[list_values_res[0]]+[';']+[list_values_res[2]]+[";"]+[list_values_res[4]])
i=resource+1`
My goal is to have :
name ; id ; userName ; email "Resource 1" ; 1 ; res1 ; myaddress1#host.com "Resource 2" ; 2 ; res2 ; myaddress2#host.com "Resource 3" ; 3 ; res3 ; myaddress3#host.com
...
And for now I have :
name;";";id;";";userName;";";email Resource 1;";";1;";"; res1;";"; myaddress1#host.com Resource 2;";";2;";"; res2;";"; myaddress2#host.com Resource 3;";";3;";"; res3;";"; myaddress3#host.com
...
There is the look on Excel for now
The code works just fine I just can't seem to make the form right. Thanks in advance, I hope this will help others aswell!

Related

How to Convert PDF file into CSV file using Python Pandas

I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is
import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this
It may help you to see how the surface of a pdf is displayed to the screen. so that one string of plain text is placed part by part on the display. (Here I highlight where the first AM is to be placed.
As a side issue that first AM in the file is I think at first glance encoded as this block
BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET
Where in that area 1D = A and 1E = M
So If you wish to extract each LINE as it is displayed, by far the simplest way is to use a library such as pdftotext that especially outputs each row of text as seen on page.
Thus using an attack such as tabular comma separated you can expect each AM will be given its own row. Which should by logic be " ",AM," "," " but some extractors should say nan,AM,nan,nan
As text it looks like this from just one programmable line
pdftotext -layout "Battery Voltage.pdf"
That will output "Battery Voltage.txt" in the same work folder
Then placing that in a spreadsheet becomes
Now we can export in a couple of clicks (no longer) as "proper output" csv along with all its oddities that csv entails.
,,Battery Vo,ltage,
Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,
So, if the edits were not done before csv generation it is simpler to post process in an editor, like this html page (no need for more apps)
,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,
Then on re-import it looks more human generated
In discussions it was confirmed all that's desired is a means to a structured list and first parse using
pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt
will output regulated data columns for framing, with a fixed headline or splitting or otherwise using cleaned data.
1 01-11-2022 00:08:10 47.15 Off
2 01-11-2022 00:23:10 47.15 Off
3 01-11-2022 00:38:10 47.15 Off
4 01-11-2022 00:58:10 47.15 Off
5 01-11-2022 01:18:10 47.15 Off
...
32357 24-11-2022 17:48:43 45.40 On
32358 24-11-2022 17:48:52 44.51 On
32359 24-11-2022 17:48:55 44.51 On
32360 24-11-2022 17:48:58 44.51 On
32361 24-11-2022 17:48:58 44.51 On
At this stage we can use text handling such as csv or add json brackets
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"
So the request is for JSON (not my forte so you may need to improve on my code as I dont know what mongo expects)
here I drop a pdf onto a battery.bat
{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}
it is a bit slow as running in pure console so lets run it blinder by add #, it will still take time as we are working in plain text, so do expect a significant delay for 32,000+ lines = 2+1/2 minutes on my kit
pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt
echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do #echo "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],
echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"
To see progress change #echo { to #echo %%a&echo {
Thus, after a minute or so
however, it tends to add an extra minute for all that display activity! before the window closes as a sign of completion.
For cases like these, build a parser that converts the unusable data into something you can use.
Logic below converts that exact file to a CSV, but will only work with that specific file contents.
Note that for this specific file you can ignore the AM/PM as the time is in 24h format.
import pdfplumber
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
with open("output.csv", "w") as outfile:
header = "serialnumber;date;time;voltage;ignition\n"
outfile.write(header)
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split('\n'):
if line.strip() in skiplines:
continue
outfile.write(";".join(line.split())+"\n")
EDIT
So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).
The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...
import pdfplumber
import json
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
result = []
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split("\n"):
if line.strip() in skiplines:
continue
serialnumber, date, time, voltage, ignition = line.split()
result.append(
{
"serialnumber": serialnumber,
"date": date,
"time": time,
"voltage": voltage,
"ignition": ignition,
}
)
with open("output.json", "w") as outfile:
json.dump(result, outfile)

Split large CSV file based on row value

The porblem
I have a csv file called data.csv. On each row I have:
timestamp: int
account_id: int
data: float
for instance:
timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990
This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M
How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?
Things I tried
to generate a sample:
N_ROW = 10**6
N_ACCOUNT = 10**5
# Generate data to split
with open('./data.csv', 'w') as csv_file:
csv_file.write('timestamp,account_id,value\n')
for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
account_id = random.randint(1,N_ACCOUNT)
data = random.random()
csv_file.write(f'{timestamp},{account_id},{data}\n')
# Clean result folder
if os.path.isdir('./result'):
shutil.rmtree('./result')
os.mkdir('./result')
Solution 1
Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).
Code:
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file_path = f'result/{account_id}.csv'
file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
with open(account_file_path, file_opening_mode) as account_file:
account_file.write(row)
p_bar.update(1)
Issues:
It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?
Solution 2 (works on small example not on large csv file)
Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary
Code:
# A dict that will contain all files
account_file_dict = {}
# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
file = account_file_dict.get(account_id, None)
if file is None:
file = open(f'./result/{account_id}.csv', 'w')
account_file_dict[account_id] = file
file.__enter__()
return file
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file = get_account_file(account_id)
account_file.write(row)
p_bar.update(1)
Issues:
I am not sure it is actually faster.
I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.
Solution 3 (works on small example not on large csv file)
Use awk command, solution from: split large csv text file based on column value
code:
after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv
Issues:
I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files
#VinceM :
While not quite 15 GB, I do have a 7.6 GB one with 3 columns :
-- 148 mn prime numbers, their base-2 log, and their hex
in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%
148,156,631 lines 7773.641 MB ( 8151253694) /dev/stdin
|
f="$( grealpath -ePq ~/master_primelist_19d.txt )"
( time ( for __ in '12' '34' '56' '78' '9'; do
( gawk -v ___="${__}" -Mbe 'BEGIN {
___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]"
} ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done |
gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"
|
out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]
( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe "${f}" &)
0.36s user 3 out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]
Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,
—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file
if it can square that quickly, then merely grouping by account_id should be a walk in the park
Have a look at https://docs.python.org/3/library/sqlite3.html
You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:
from convtools.contrib.tables import Table
from convtools import conversion as c
iterable_of_matched_rows = (
Table.from_csv("tmp/in.csv", header=True)
.filter(c.col("account_id") == "1")
.into_iter_rows(dict)
)
However this won't be faster than reading 100M row csv file with csv.reader.

Python csv merge multiple files with different columns

I hope somebody can help me with this issue.
I have about 20 csv files (each file with its headers), each of this files has hundreds of columns.
My problem is related to merging those files, because a couple of them have extra columns.
I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.
So far I used I used the awk terminal command:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
to merge removing the headers from all the files expect from the first one.
I got this from my previous question
Merge multiple csv files into one
But this does not solve the issue with the extra column.
EDIT:
Here are some file csv in plain text with the headers.
file 1
"#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file2
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file3
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file4
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.
When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.
Once I run the script on the original file. This is what I get back
There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?
Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?
$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d
==> file2.csv <==
1,2,3
x,y,z
$ cat tst.awk
BEGIN {
FS = OFS = ","
for (i=1; i<ARGC; i++) {
if ( (getline < ARGV[i]) > 0 ) {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
}
}
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }
$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.
Alternatively with an assist from GNU head for -q:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }
$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

How to Remove Pipe character from a data filed in a pipe delimited file

Experts, i have a simple pipe delimited file from source system which has a free flow text field and for one of the records, i see that "|" character is coming in as part of data. This is breaking my file unevenly and not getting parsed in to correct number of fields. I want to replace the "|" in the data field with a "#".
Record coming in from source system. There are total 9 fields in the file.
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
If you Notice the 4th field - Free"flow|Text , this is complete value from source which has a pipe in it.
i want to change it to- Free"flow#Text and then read the file with a pipe delimiter.
Desired Outcome-
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
I tried few awk/sed combinations, but didn't get the desired output.
Thanks
Since you know there are 9 fields, and the 4th is a problem: take the first 3 fields and the last 5 fields and whatever is left over is the 4th field.
You did tag shell, so here's some bash: I'm sure the python equivalent is close:
line='OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
IFS='|'
read -ra fields <<<"$line"
first3=( "${fields[#]:0:3}" )
last5=( "${fields[#]: -5}" )
tmp=${line#"${first3[*]}$IFS"} # remove the first 3 joined with pipe
field4=${tmp%"$IFS${last5[*]}"} # remove the last 5 joined with pipe
data=( "${first3[#]}" "$field4" "${last5[#]}" )
newline="${first3[*]}$IFS${field4//$IFS/#}$IFS${last5[*]}"
# .......^^^^^^^^^^^^....^^^^^^^^^^^^^^^^^....^^^^^^^^^^^
printf "%s\n" "$line" "$newline"
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
with awk, it's simpler: If there are 10 fields, join fields 4 and 5, and shift the rest down one.
echo "$line" | awk '
BEGIN { FS = OFS = "|" }
NF == 10 {
$4 = $4 "#" $5
for (i=5; i<NF; i++)
$i = $(i+1)
NF--
}
1
'
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
You tagged your question with Python so I assume a Python-based answer is acceptable.
I assume not all records in your file have the additional "|" in it, but only some records have the "|" in the free text column.
For a more realistic example, I create an input with some correct records and some erroneous records.
I use StringIO to simulate the file, in your environment read the real file with 'open'.
from io import StringIO
sample = 'OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456\nOutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
infile = StringIO(sample)
outfile = StringIO()
for line in infile.readlines():
cols = line.split("|")
if len(cols) > 9:
print(f"bad colum {cols[3:5]}")
line = "|".join(cols[:3]) + "#".join(cols[3:5]) + "|".join(cols[5:])
outfile.write(line)
print("Corrected file:")
print(outfile.getvalue())
Results in:
> bad colum ['Free"flow', 'Text']
> bad colum ['Free"flow', 'Text']
> Corrected file:
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456

Csv Writer - Trying to write variable in each Column

iam trying to write some data in a csv file but i cant select different columns..
car=["car 11"]
finish=["Landhaus , Nord"]
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]<br/>
myfile = open("Informationen.csv", "wb")
writer = csv.writer(myfile,dialect='excel',delimiter=' ')
bla =[car,finish,time]
writer.writerow(bla)
Output:
car 11 Landhaus , Nord "['05:36']", "['06:06']", [..]
All in 1 row and Colum 1
But i want it like this
car 11 (in row 1 Colum 1) | "Landhaus , Nord" (in row 1 Column 2) | ['05:36'] (in Line 1 Column 3 ) | ['06:06'] (in row 1 Column 4 ) till Column n
Thanks for any help !
Edit
1more example how it should look like
Line 1: car 11 (column 1) Landhaus, Nord (column 2) ['05:36'] (column 3) ['05:36'] (column 4) [...]
example http://img13.imageshack.us/img13/4964/unbenanntvilw.png
Solution so far:
but got still problems with time list
car=["car 11"]
trenn=[';']
finish=['Landhaus , Nord']
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]
myfile = open("Informationen2.csv", "wb")
writer = csv.writer(myfile,delimiter=' ')
bla = car + trenn + finish + trenn + time
writer.writerow(bla)
myfile.close()
The Python documentation states that for the csv.writer() function the ...
"optional dialect parameter can be given which is used to define
a set of parameters specific to a particular CSV dialect".
... and that ...
"the other optional fmtparams keyword arguments can be given to override individual formatting parameters in the current dialect".
The problem you are experiencing is a consequence of writing a string representation of the time list to file and writing to file using a whitespace delimiter. If you were to view Informationen.csv as a plain text file the problem becomes apparent.
Firstly, writing to file having passed a whitespace delimiter as an argument in csv.writer(myfile, dialect='excel', delimiter=' ') overrides the default delimiter as defined in the Excel dialect and results in the elements of list bla being written to file with the format element1 element2 element3 as opposed to element1,element2,element3.
Secondly, although the majority of the elements in the time list are allocated their own columns in a spreadsheet as desired, writing the list to file as a string representation of itself has contributed to the overall undesired formatting.
When you open the file created with your script as an Excel file, Excel reads in two initial values based on the first two commas it finds which happen to be in the center of the string 'Landhaus , Nord' and within the string representation of the time list.
You can achieve the column separation you require firstly by appending the elements of the time list to the bla list, as opposed to nesting the former within the latter. You then need to omit delimiter=' ' in csv.writer(myfile, dialect='excel', delimiter=' '), thus avoiding the delimiter overriding effect when writing to file:
import csv
car = ['car 11']
finish = ['Landhaus , Nord']
time = ["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']"]
try:
with open('Informationen.csv', 'w') as myfile:
writer = csv.writer(myfile, dialect='excel')
bla = [car, finish]
for each_time in time:
bla.append(each_time)
writer.writerow(bla)
except IOError as ioe:
print('Error: ' + str(ioe))
producing the following output in Excel:
http://imageshack.us/a/img839/8061/screenshotkn.png
import csv
car=["car 11"]
finish=['Landhaus , Nord']
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]
myfile = open("derp.csv", "wb")
writer = csv.writer(myfile)
bla = car + finish + time
writer.writerow(bla)
myfile.close()
Here is the output I get from excel

Categories

Resources