How to Convert PDF file into CSV file using Python Pandas - python

I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is
import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this

It may help you to see how the surface of a pdf is displayed to the screen. so that one string of plain text is placed part by part on the display. (Here I highlight where the first AM is to be placed.
As a side issue that first AM in the file is I think at first glance encoded as this block
BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET
Where in that area 1D = A and 1E = M
So If you wish to extract each LINE as it is displayed, by far the simplest way is to use a library such as pdftotext that especially outputs each row of text as seen on page.
Thus using an attack such as tabular comma separated you can expect each AM will be given its own row. Which should by logic be " ",AM," "," " but some extractors should say nan,AM,nan,nan
As text it looks like this from just one programmable line
pdftotext -layout "Battery Voltage.pdf"
That will output "Battery Voltage.txt" in the same work folder
Then placing that in a spreadsheet becomes
Now we can export in a couple of clicks (no longer) as "proper output" csv along with all its oddities that csv entails.
,,Battery Vo,ltage,
Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,
So, if the edits were not done before csv generation it is simpler to post process in an editor, like this html page (no need for more apps)
,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,
Then on re-import it looks more human generated
In discussions it was confirmed all that's desired is a means to a structured list and first parse using
pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt
will output regulated data columns for framing, with a fixed headline or splitting or otherwise using cleaned data.
1 01-11-2022 00:08:10 47.15 Off
2 01-11-2022 00:23:10 47.15 Off
3 01-11-2022 00:38:10 47.15 Off
4 01-11-2022 00:58:10 47.15 Off
5 01-11-2022 01:18:10 47.15 Off
...
32357 24-11-2022 17:48:43 45.40 On
32358 24-11-2022 17:48:52 44.51 On
32359 24-11-2022 17:48:55 44.51 On
32360 24-11-2022 17:48:58 44.51 On
32361 24-11-2022 17:48:58 44.51 On
At this stage we can use text handling such as csv or add json brackets
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"
So the request is for JSON (not my forte so you may need to improve on my code as I dont know what mongo expects)
here I drop a pdf onto a battery.bat
{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}
it is a bit slow as running in pure console so lets run it blinder by add #, it will still take time as we are working in plain text, so do expect a significant delay for 32,000+ lines = 2+1/2 minutes on my kit
pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt
echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do #echo "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],
echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"
To see progress change #echo { to #echo %%a&echo {
Thus, after a minute or so
however, it tends to add an extra minute for all that display activity! before the window closes as a sign of completion.

For cases like these, build a parser that converts the unusable data into something you can use.
Logic below converts that exact file to a CSV, but will only work with that specific file contents.
Note that for this specific file you can ignore the AM/PM as the time is in 24h format.
import pdfplumber
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
with open("output.csv", "w") as outfile:
header = "serialnumber;date;time;voltage;ignition\n"
outfile.write(header)
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split('\n'):
if line.strip() in skiplines:
continue
outfile.write(";".join(line.split())+"\n")
EDIT
So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).
The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...
import pdfplumber
import json
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
result = []
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split("\n"):
if line.strip() in skiplines:
continue
serialnumber, date, time, voltage, ignition = line.split()
result.append(
{
"serialnumber": serialnumber,
"date": date,
"time": time,
"voltage": voltage,
"ignition": ignition,
}
)
with open("output.json", "w") as outfile:
json.dump(result, outfile)

Related

Split large CSV file based on row value

The porblem
I have a csv file called data.csv. On each row I have:
timestamp: int
account_id: int
data: float
for instance:
timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990
This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M
How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?
Things I tried
to generate a sample:
N_ROW = 10**6
N_ACCOUNT = 10**5
# Generate data to split
with open('./data.csv', 'w') as csv_file:
csv_file.write('timestamp,account_id,value\n')
for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
account_id = random.randint(1,N_ACCOUNT)
data = random.random()
csv_file.write(f'{timestamp},{account_id},{data}\n')
# Clean result folder
if os.path.isdir('./result'):
shutil.rmtree('./result')
os.mkdir('./result')
Solution 1
Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).
Code:
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file_path = f'result/{account_id}.csv'
file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
with open(account_file_path, file_opening_mode) as account_file:
account_file.write(row)
p_bar.update(1)
Issues:
It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?
Solution 2 (works on small example not on large csv file)
Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary
Code:
# A dict that will contain all files
account_file_dict = {}
# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
file = account_file_dict.get(account_id, None)
if file is None:
file = open(f'./result/{account_id}.csv', 'w')
account_file_dict[account_id] = file
file.__enter__()
return file
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file = get_account_file(account_id)
account_file.write(row)
p_bar.update(1)
Issues:
I am not sure it is actually faster.
I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.
Solution 3 (works on small example not on large csv file)
Use awk command, solution from: split large csv text file based on column value
code:
after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv
Issues:
I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files
#VinceM :
While not quite 15 GB, I do have a 7.6 GB one with 3 columns :
-- 148 mn prime numbers, their base-2 log, and their hex
in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%
148,156,631 lines 7773.641 MB ( 8151253694) /dev/stdin
|
f="$( grealpath -ePq ~/master_primelist_19d.txt )"
( time ( for __ in '12' '34' '56' '78' '9'; do
( gawk -v ___="${__}" -Mbe 'BEGIN {
___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]"
} ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done |
gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"
|
out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]
( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe "${f}" &)
0.36s user 3 out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]
Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,
—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file
if it can square that quickly, then merely grouping by account_id should be a walk in the park
Have a look at https://docs.python.org/3/library/sqlite3.html
You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:
from convtools.contrib.tables import Table
from convtools import conversion as c
iterable_of_matched_rows = (
Table.from_csv("tmp/in.csv", header=True)
.filter(c.col("account_id") == "1")
.into_iter_rows(dict)
)
However this won't be faster than reading 100M row csv file with csv.reader.

Python csv merge multiple files with different columns

I hope somebody can help me with this issue.
I have about 20 csv files (each file with its headers), each of this files has hundreds of columns.
My problem is related to merging those files, because a couple of them have extra columns.
I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.
So far I used I used the awk terminal command:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
to merge removing the headers from all the files expect from the first one.
I got this from my previous question
Merge multiple csv files into one
But this does not solve the issue with the extra column.
EDIT:
Here are some file csv in plain text with the headers.
file 1
"#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file2
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file3
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file4
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.
When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.
Once I run the script on the original file. This is what I get back
There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?
Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?
$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d
==> file2.csv <==
1,2,3
x,y,z
$ cat tst.awk
BEGIN {
FS = OFS = ","
for (i=1; i<ARGC; i++) {
if ( (getline < ARGV[i]) > 0 ) {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
}
}
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }
$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.
Alternatively with an assist from GNU head for -q:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }
$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Concatenate multiple text files of DNA sequences in Python or R?

I was wondering how to concatenate exon/DNA fasta files using Python or R.
Example files:
So far I really liked using R ape package for the cbind method, solely because of the fill.with.gaps=TRUE attribute. I really need gaps inserted when a species is missing an exon.
My code:
ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")
Example:
exon1.txt
>sp1
AAAA
>sp2
CCCC
exon2.txt
>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT
Output file:
>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT
So far I am having trouble trying to apply this technique when I have multiple exon files (trying to figure out a loop to open and execute the cbind method for all files ending with .fa in the directory), and sometimes not all files have exons that are all identical in length - hence DNAbin stops working.
So far I have:
file_list <- list.files(pattern=".fa")
myFunc <- function(x) {
for (file in file_list) {
x <- read.dna(file, format="fasta")
out <- cbind(x, fill.with.gaps=TRUE)
write.dna(out, "Output.txt", format="fasta")
}
}
However when I run this and I check my output text file, it misses many exons and I think that is because not all files have the same exon length... or my script is failing somewhere and I can't figure it out: (
Any ideas? I can also try Python.
If you prefer using Linux one liners you have
cat exon1.txt exon2.txt > outfile
if you want only the unique records from the outfile use
awk '/^>/{f=!d[$1];d[$1]=1}f' outfile > sorted_outfile
I just came out with this answer in Python 3:
def read_fasta(fasta): #Function that reads the files
output = {}
for line in fasta.split("\n"):
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in output:
output[active_sequence_name] = []
continue
sequence = line
output[active_sequence_name].append(sequence)
return output
with open("exon1.txt", 'r') as file: # read exon1.txt
file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
file2 = read_fasta(file.read())
finaldict = {} #Concatenate the
for i in list(file1.keys()) + list(file2.keys()): #both files content
if i not in file1.keys():
file1[i] = ["-" * len(file2[i][0])]
if i not in file2.keys():
file2[i] = ["-" * len(file1[i][0])]
finaldict[i] = file1[i] + file2[i]
with open("output.txt", 'w') as file: # output that in file
for k, i in finaldict.items(): # named output.txt
file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting
It's pretty hard to comment and explain it completely, and it might not help you, but this is better than nothing :P
I used Łukasz Rogalski's code from answer to Reading a fasta file format into Python dict.

ODS file to JSON

I would like to convert my speadsheet of data to a JSON array of array.
This site do it: http://www.shancarter.com/data_converter/index.html
And I looked into the source code.
But what I would like is a macro / script / extension or any way to programm it to convert my .ods into a JSON file:
Like:
NAME VALUE COLOR DATE
Alan 12 blue Sep. 25, 2009
Shan 13 "green blue" Sep. 27, 2009
John 45 orange Sep. 29, 2009
Minna 27 teal Sep. 30, 2009
To:
[
["Alan",12,"blue","Sep. 25, 2009"],
["Shan",13,"green\tblue","Sep. 27, 2009"],
["John",45,"orange","Sep. 29, 2009"],
["Minna",27,"teal","Sep. 30, 2009"]
]
The answer might be again late but marcoconti83 has done exactly that: reading a ods file and return them as two dimensional arrays.
https://github.com/marcoconti83/read-ods-with-odfpy/blob/master/ODSReader.py
Once you have the data in arrays, it's not that difficult to get them into a json file. Here's example code:
import json
from odftoarray import ODSReader # renamed the file to odftoarray.py
r = ODSReader("your_file.ods")
arrays = r.getSheet("your_data_sheet_name")
json.dumps(arrays)
This may be a bit late but for those who come looking and want to do this it would likely be best to save the .ods file as .csv which nearly all spreadsheet programs can do. Then use something like this to convert it:
import csv
import sys
import json, os
def convert(csv_filename, fieldnames):
print ("Opening CSV file: ",csv_filename)
f=open(csv_filename, 'r')
csv_reader = csv.DictReader(f,fieldnames)
json_filename = csv_filename.split(".")[0]+".json"
print ("Saving JSON to file: ",json_filename)
jsonf = open(json_filename,'w')
data = json.dumps([r for r in csv_reader])
jsonf.write(data)
f.close()
jsonf.close()
csvfile = ('path/to/the/csv/file.csv')
field_names = [
"a",
"list",
"of",
"fieldnames"
]
convert(csvfile, field_names)
And a tip, csv is pretty human readable so just go through and make sure it saved in the format you want and then run this script to convert it to JSON. Check it out in a JSON viewer like JSONView and then you should be good to go!

Categories

Resources