When converting .dat into csv my code change the output values - python

I have a .dat file and I am trying to convert it into a csv one.
I have found a piece of code that "somehow" solved my problem.
The thing is: such code gave me a messed up output file as a result. In other words: it changed my values!!!!
Someone can help me with that?
I am a total beginner at this.
Thanks a lot.
with open('f.dat') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLines.append(newLine)
with open('f_out.csv','w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
My input file looks like this:
"-18.7723311308 3166157043.25795 0 1006743187.3562
-18.8214122765 188717303.231381 0 57141624.5127759
-18.7022205742 399933910.540253 0 87142384.8698447
-18.5903166748 23045528.3797531 0 5841919.83133624
-18.3051499783 76457482.0309581 0 25326122.2381197"
(with more lines)
And the output file like this:
-21.5607314306,1200000000.0,0,500000000.0,MBH
-21.5607314306,1200000000.0,0,500000000.0,MBH
-21.5607314306,1200000000.0,0,500000000.0,MBH
What I simply want is an output file where my columns are separated by a comma, like:
"-18.7723311308, 3166157043.25795, 0, 1006743187.3562
-18.8214122765, 188717303.231381, 0 ,57141624.5127759"

.dat files are not much readable using file io operations
you can use asammdf module to read the .dat file.use
pip install asammdf

There is a module called pandas which can convert a list to csv file easily. Try this code it works.
import pandas as pd
with open('f.dat') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLines.append(line)
pd.DataFrame(newLines).to_csv('f_out.csv')

.dat files are not much readable using file io operations you can use asammdf module to read the .dat file.use pip install asammdf
.dat files in the context of asammdf are Measurement Data File v2 or v3 typically used in the automotive domain. I don't think the OP has this kind of file

Related

How load huge data json in jupyter on VSCode

I'm doing a sentiment analysis for my master's degree and i'm working with jupyter nootebook on VSCode on Ubuntu 20.04. I have a problem: when I try to load my file (12gb) my kernel dieds. So I splitted my file into 6 of 2 gb each, but also in this case I can't load all file to create a dataframe in order to work with it. So i would to ask how can I load each file, create a database and then storage all together into one dataframe to work with it?
I tried to load one file in this way:
import pandas as pd
filename = pd.read_json("xaa.json", lines=True, chunksize= 200000)
and in this case the kernel didn't die. From this point, how could I save this filename into a dataframe? I know that in this way I splitted one file into many files of 200000 lines, but I don't know how storage all this chunks into a first dataframe.
Thank you for the attention and I'm sorry for the banal question.
I want to post my solution: first of all I chose to make my IDE read all data in this way:
import glob
import json
files = list(glob.iglob('Tesi/Resources/Twitter/*.json'))
tweets_data = []
for file in files:
tweets_file = open(file, "r", encoding='utf-8')
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
tweets_file.close()
Then I defined a function to flat all tweets in order to load all in one dataframe.

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

How do I search a csv file for keywords stored in another csv file?

I'm trying to search a csv file having 150K+ row using keywords stored in a csv file with several dozen row. What's the best way to go about this? I've tried a few things but nothing has gotten me very far.
Current Code:
import csv
import pandas as pd
data = pd.read_csv('mycsv.csv')
for line in data:
if 'Apple' in line:
print(line)
This isn't what I want, it's just what I currently have. The for loop is my attempt at just getting output using one of the keywords from the smaller csv file. So far I'm either getting errors or there is no output.
Edit: I forgot to mention that the large csv file I'm trying to search from is from a web link, so I don't think with open is going to work
Supposing that your keywords are stored in a file named keys.csv and in each row of that file, there's only one keyword, like this:
Orange
Apple
...
then try this:
with open('mycsv.csv') as mycsv, open('keys.csv') as keys:
keys = keys.readlines()
for line in mycsv:
if any([key[:-2] in line for key in keys]):
print(line)

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Opening and reading an excel .xlsx file in python

I'm trying to open an excel .xlsx file with python but am unable to find a way to do it, I've tried using pandas but it's wanting to use a library called NumPy I've tried to install numpy but it still can't find numpy.
I've also tried using the xlrd library but I get the following traceback:
Traceback (most recent call last):
File "C:\test.py", line 3, in <module>
book = open_workbook('test.xlsx')
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 370, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1323, in getbof
raise XLRDError('Expected BOF record; found 0x%04x' % opcode)
XLRDError: Expected BOF record; found 0x4b50
Which I assume is because XLRD can't read .xlsx files?
Anyone got any ideas?
EDIT:
import csv
with open('test.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell
Maybe you could export your .xlsx to a .csv file?
Then you could try:
import csv
with open('file.csv','rb') as file:
contents = csv.reader(file)
[x for x in contents]
This may be useful:
http://docs.python.org/2/library/csv.html#csv.reader
Hope that helps!
EDIT:
If you want to locate a spectific cell, such as F13, you could make a nested list like a matrix and them refer to each element:
import csv
with open('file.csv','rb') as file:
contents = csv.reader(file)
matrix = list()
for row in contents:
matrix.append(row)
And then access F13 with matrix[5][12].
P.S.: I did not test this. If "row" is a list with each cell as an element, you keep appending all lines to the matrix, so the first index is row number and the second is the column number.
it seems that you are on a Linux Distro. I had the same problem too and this does not happen with "xlwt" library but only with "xlrd". what I did is not the right way to solve this problem but it makes things work for the time being to hopefully have an answer to that question soon ;I have installed "xlrd" on Windows and took the folder and pasted it on Linux in the directory where my python code is and it worked.
Since I know other people will also be reading this -
You can install the following module (it's not there automatically)
https://pypi.python.org/pypi/openpyxl
You can read the following to get a nice breakdown on how to use it
https://automatetheboringstuff.com/chapter12/

Categories

Resources