pandas reading csv with one row spanning multiple lines - python

My csv starts out like this:
,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813
each row ends like this in a new line:
0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"
These values are from a big numpy array encoded with np.array2string() and span multiple lines in the csv.
When using pd.read_csv it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607". When using the parameter engine="python" it throws an "ParserError: unexpected end of data". When using the seperator sep='\t+' it just puts each line in a new row in the dataframe. When using csv.reader by using with open(file_path) and then iterating through each line, the same happens as with the sep='\t+'.
Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?

I tried to use your csv data to check it. I pasted the code along with the answer below,
import pandas as pd
import csv
data_path = 'dt.csv'
df = pd.read_csv(data_path, header = None, quoting=csv.QUOTE_NONE, encoding='utf-8')
dt_json = pd.DataFrame.to_json(df)
print(dt_json)
For an example, I just tried to change the data format from CSV to JSON using pandas dataframe.
{"0":{"0":null,"1":0.0},
"1":{"0":"index","1":"0"},
"2":{"0":"spotify_id","1":"5Jk0vfT81ltt2rYyrWDzZ5"},
"3":{"0":"artist_name","1":"Hundred Waters"},
"4":{"0":"track_name","1":"Xtalk - Kodak to Graph Remix"},
"5":{"0":"album_name","1":"The Moon Rang Like a Bell"},
"6":{"0":"duration_ms","1":"285327"},
"7":{"0":"lyrics","1":"not fetched"},
"8":{"0":"lyrics_bert_embeddings","1":"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813\r\n 0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"}}
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

Related

text file to csv conversion how to get ride of split lines in input file

I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Reading the 2nd entry of a .json

I am trying to read the density entry of a list of arrays within a .json file. He's a small portion of the file from the beginning:
["time_tag","density","speed","temperature"],["2019-04-14 18:20:00.000","4.88","317.9","11495"],["2019-04-14 18:21:00.000","4.89","318.4","11111"]
This is the code I have thus far:
with open('plasma.json', 'r') as myfile:
data = myfile.read()
obj = json.loads(data)
print(str(obj['density']))
It should print everything under the density column but I'm getting an error saying that the file can't be opened
First, you json file is not correct. If you want to read it with a single call obj = json.load(data), the json file should be:
[["time_tag","density","speed","temperature"],["2019-04-14 18:20:00.000","4.88","317.9","11495"],["2019-04-14 18:21:00.000","4.89","318.4","11111"]]
Notice the extra square bracket, making it a single list of sublists.
This said, being obj a list of lists, there is no way print(str(obj['density'])) will work. You need to loop on the list to print what you want, or convert this to a dataframe before.
Looping directly
idx = obj[0].index('density') #get the index of the density entry
#from the first list in obj, the header
for row in obj[1:]: #looping on all sublists except the first one
print(row[idx]) #printing
Using a dataframe (pandas)
import pandas as pd
df = pd.DataFrame(obj[1:], columns=obj[0]) #converting to a dataframe, using
#first row as column header
print(df['density'])
Are you sure your data is a valid json and not a csv?
As the snippet of data provided above matches that of a csv file and not a json.
You will be able to read the density key of the csv with:
import csv
input_file = csv.DictReader(open("plasma.csv"))
for row in input_file:
print(row['density'])
Data formatted as csv
["time_tag","density","speed","temperature"]
["2019-04-14 18:20:00.000","4.88","317.9","11495"]
["2019-04-14 18:21:00.000","4.89","318.4","11111"]
Result
4.88
4.89

Wrong row count for CSV file in python

I am processing a csv file and before that I am getting the row count using the below code.
total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))
The code has been written with the help given in this link.
However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??
In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.
Any help is appreciated!
Let's assume you have a csv file in which some cell's a multi-line text.
$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"
Which, by look of it, has three lines and wc -l agrees:
$ wc -l example.csv
3 example.csv
And so does open with sum:
sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3
But now if you read is with some csv parser such as pandas.read_csv:
import pandas as pd
df = pd.read_csv('./example.csv')
df
colA colB
0 1 Hi. This is Line 1.\nAnd this is Line2
The other alternative way to fetch the correct number of rows is given below:
with open(csv_file,"r",encoding="utf-8") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
Excluding the header, the csv contains 1 line, which I believe is what you expect.
This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.
I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:
1, "my huge text\n with many lines\n"
2, "other text"
Your method for data above will return 4 when accutaly there are only 2 rows
Try to use Pandas or other library for reading CSV files. Example:
import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()
Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

Splitting this CSV file into a list

So I want to read in a csv file in Python3 and split it into a list. Where each index[0,1, ..., onwards] relates to each value separated by comma.
This is my csv file:
2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3
,2017-04-2,17.4,31.6,0,8.0,10.9,NE,35,08:56,23.5,34,0,NE,17,1021.4,30.7,20,0,SE,9,1018.6
2017-04-3,12.1,31.8,0,6.8,10.8,SSW,33,15:14,23.1,39,0,SSE,6,1022.7,29.3,34,0,SSW,19,1020.8
,2017-04-4,15.4,30.4,0,7.0,10.7,E,28,03:01,19.8,64,0,ESE,11,1024.9,29.7,29,0,S,9,1020.3
,2017-04-5,12.3,30.4,0,5.2,10.6,S,19,13:10,21.7,55,0,NE,6,1018.2,27.7,37,0,WSW,11,1013.5
,2017-04-6,13.7,24.4,0,5.2,8.1,SW,43,16:16,17.1,94,7,NE,2,1013.1,22.8,63,3,SSW,20,1011.7
,2017-04-7,14.8,22.3,0,5.4,8.9,SSE,35,06:26,16.4,56,5,SSE,17,1023.6,21.3,33,3,SSE,4,1021.6
,2017-04-8,8.7,23.6,0,5.0,10.5,SW,33,15:41,16.0,58,0,SE,7,1027.6,22.1,44,0,SW,17,1024.5
,2017-04-9,11.1,27.4,0,4.8,10.4,ESE,24,10:30,18.1,56,0,ENE,13,1027.4,26.8,26,0,NE,9,1023.1
,2017-04-10,10.0,30.1,0,5.6,10.4,SSW,24,16:38,19.3,51,4,E,9,1022.7,30.0,20,1,E,6,1018.4
,2017-04-11,13.1,28.1,0,6.6,10.5,SW,28,15:02,21.8,38,0,NE,9,1016.6,26.6,35,0,SW,13,1015.7
,2017-04-12,10.6,23.8,0,5.2,9.7,SW,35,16:19,17.4,69,6,ENE,9,1021.5,23.0,52,1,SW,15,1019.9
,2017-04-13,12.9,26.8,0,4.2,10.4,SSW,31,15:56,19.9,64,1,ESE,11,1024.3,25.5,49,1,SW,15,1020.1
,2017-04-14,12.8,29.0,0,5.8,6.2,SW,22,15:43,18.1,73,4,SSE,2,1019.3,27.6,42,5,SSW,11,1015.4
,2017-04-15,14.8,29.3,0,4.0,7.3,SSE,31,22:03,18.5,73,6,S,11,1015.7,28.2,38,7,SE,9,1011.7
,2017-04-16,17.2,25.1,0,5.4,7.0,SSE,35,00:43,18.8,66,4,ESE,11,1014.6,24.4,54,5,SW,11,1011.6
,2017-04-17,15.4,21.8,0,5.0,2.5,SSW,24,07:56,17.8,74,5,S,13,1015.3,21.4,59,8,SSW,11,1013.2
,2017-04-18,15.3,25.0,0,4.0,8.0,SSW,31,19:02,19.7,72,6,SSE,9,1013.0,22.8,63,1,SW,15,1010.4
Currently when I read it in, each element is being split at the end of the line. So if I accessed index[0] this would be the output.
2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3
What I need to understand is how to split the csv so that if I access index[0], I will be given 2017-04-1. And index[1] would give the next value after the comma.
Here is my code at the moment.
import matplotlib.pyplot as plt
# Opening and reading files
weatherdata = open('weather.csv', encoding='latin1')
data = weatherdata.readlines()
Encoding is required to be in latin one because it needs to be able to handle degree symbols.
Thanks for the guidance.
You have already read all lines into data:
weatherdata = open('weather.csv')
data = weatherdata.readlines()
data
data will be a string list:
['2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3\n',
'2017-04-2,17.4,31.6,0,8.0,10.9,NE,35,08:56,23.5,34,0,NE,17,1021.4,30.7,20,0,SE,9,1018.6 \n',
'2017-04-3,12.1,31.8,0,6.8,10.8,SSW,33,15:14,23.1,39,0,SSE,6,1022.7,29.3,34,0,SSW,19,1020.8\n',
'2017-04-4,15.4,30.4,0,7.0,10.7,E,28,03:01,19.8,64,0,ESE,11,1024.9,29.7,29,0,S,9,1020.3\n',
'2017-04-5,12.3,30.4,0,5.2,10.6,S,19,13:10,21.7,55,0,NE,6,1018.2,27.7,37,0,WSW,11,1013.5\n',
'2017-04-6,13.7,24.4,0,5.2,8.1,SW,43,16:16,17.1,94,7,NE,2,1013.1,22.8,63,3,SSW,20,1011.7\n',
'2017-04-7,14.8,22.3,0,5.4,8.9,SSE,35,06:26,16.4,56,5,SSE,17,1023.6,21.3,33,3,SSE,4,1021.6\n',
'2017-04-8,8.7,23.6,0,5.0,10.5,SW,33,15:41,16.0,58,0,SE,7,1027.6,22.1,44,0,SW,17,1024.5\n',
'2017-04-9,11.1,27.4,0,4.8,10.4,ESE,24,10:30,18.1,56,0,ENE,13,1027.4,26.8,26,0,NE,9,1023.1\n',
'2017-04-10,10.0,30.1,0,5.6,10.4,SSW,24,16:38,19.3,51,4,E,9,1022.7,30.0,20,1,E,6,1018.4\n',
'2017-04-11,13.1,28.1,0,6.6,10.5,SW,28,15:02,21.8,38,0,NE,9,1016.6,26.6,35,0,SW,13,1015.7\n',
'2017-04-12,10.6,23.8,0,5.2,9.7,SW,35,16:19,17.4,69,6,ENE,9,1021.5,23.0,52,1,SW,15,1019.9\n',
'2017-04-13,12.9,26.8,0,4.2,10.4,SSW,31,15:56,19.9,64,1,ESE,11,1024.3,25.5,49,1,SW,15,1020.1\n',
'2017-04-14,12.8,29.0,0,5.8,6.2,SW,22,15:43,18.1,73,4,SSE,2,1019.3,27.6,42,5,SSW,11,1015.4\n',
'2017-04-15,14.8,29.3,0,4.0,7.3,SSE,31,22:03,18.5,73,6,S,11,1015.7,28.2,38,7,SE,9,1011.7\n',
'2017-04-16,17.2,25.1,0,5.4,7.0,SSE,35,00:43,18.8,66,4,ESE,11,1014.6,24.4,54,5,SW,11,1011.6\n',
'2017-04-17,15.4,21.8,0,5.0,2.5,SSW,24,07:56,17.8,74,5,S,13,1015.3,21.4,59,8,SSW,11,1013.2\n',
'2017-04-18,15.3,25.0,0,4.0,8.0,SSW,31,19:02,19.7,72,6,SSE,9,1013.0,22.8,63,1,SW,15,1010.4']
Then use data[0].split(',')[0], it will give you:
'2017-04-1'
and data[0].split(',')[1], will be:
'14.9'
and so on.
Simply read and then split:
weatherdata = open('weather.csv')
data = [line.split(',') for line in weatherdata.read().splitlines()]
Or you can use pandas and it does it for you,Pandas is very useful to read dataset and manipulate them,it reads the data all at once and you can get different columns after reading them
import pandas as pd
df = pd.read_csv('weather.csv')
df.column[0]# this will get the first column

Categories

Resources