Splitting this CSV file into a list - python

So I want to read in a csv file in Python3 and split it into a list. Where each index[0,1, ..., onwards] relates to each value separated by comma.
This is my csv file:
2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3
,2017-04-2,17.4,31.6,0,8.0,10.9,NE,35,08:56,23.5,34,0,NE,17,1021.4,30.7,20,0,SE,9,1018.6
2017-04-3,12.1,31.8,0,6.8,10.8,SSW,33,15:14,23.1,39,0,SSE,6,1022.7,29.3,34,0,SSW,19,1020.8
,2017-04-4,15.4,30.4,0,7.0,10.7,E,28,03:01,19.8,64,0,ESE,11,1024.9,29.7,29,0,S,9,1020.3
,2017-04-5,12.3,30.4,0,5.2,10.6,S,19,13:10,21.7,55,0,NE,6,1018.2,27.7,37,0,WSW,11,1013.5
,2017-04-6,13.7,24.4,0,5.2,8.1,SW,43,16:16,17.1,94,7,NE,2,1013.1,22.8,63,3,SSW,20,1011.7
,2017-04-7,14.8,22.3,0,5.4,8.9,SSE,35,06:26,16.4,56,5,SSE,17,1023.6,21.3,33,3,SSE,4,1021.6
,2017-04-8,8.7,23.6,0,5.0,10.5,SW,33,15:41,16.0,58,0,SE,7,1027.6,22.1,44,0,SW,17,1024.5
,2017-04-9,11.1,27.4,0,4.8,10.4,ESE,24,10:30,18.1,56,0,ENE,13,1027.4,26.8,26,0,NE,9,1023.1
,2017-04-10,10.0,30.1,0,5.6,10.4,SSW,24,16:38,19.3,51,4,E,9,1022.7,30.0,20,1,E,6,1018.4
,2017-04-11,13.1,28.1,0,6.6,10.5,SW,28,15:02,21.8,38,0,NE,9,1016.6,26.6,35,0,SW,13,1015.7
,2017-04-12,10.6,23.8,0,5.2,9.7,SW,35,16:19,17.4,69,6,ENE,9,1021.5,23.0,52,1,SW,15,1019.9
,2017-04-13,12.9,26.8,0,4.2,10.4,SSW,31,15:56,19.9,64,1,ESE,11,1024.3,25.5,49,1,SW,15,1020.1
,2017-04-14,12.8,29.0,0,5.8,6.2,SW,22,15:43,18.1,73,4,SSE,2,1019.3,27.6,42,5,SSW,11,1015.4
,2017-04-15,14.8,29.3,0,4.0,7.3,SSE,31,22:03,18.5,73,6,S,11,1015.7,28.2,38,7,SE,9,1011.7
,2017-04-16,17.2,25.1,0,5.4,7.0,SSE,35,00:43,18.8,66,4,ESE,11,1014.6,24.4,54,5,SW,11,1011.6
,2017-04-17,15.4,21.8,0,5.0,2.5,SSW,24,07:56,17.8,74,5,S,13,1015.3,21.4,59,8,SSW,11,1013.2
,2017-04-18,15.3,25.0,0,4.0,8.0,SSW,31,19:02,19.7,72,6,SSE,9,1013.0,22.8,63,1,SW,15,1010.4
Currently when I read it in, each element is being split at the end of the line. So if I accessed index[0] this would be the output.
2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3
What I need to understand is how to split the csv so that if I access index[0], I will be given 2017-04-1. And index[1] would give the next value after the comma.
Here is my code at the moment.
import matplotlib.pyplot as plt
# Opening and reading files
weatherdata = open('weather.csv', encoding='latin1')
data = weatherdata.readlines()
Encoding is required to be in latin one because it needs to be able to handle degree symbols.
Thanks for the guidance.

You have already read all lines into data:
weatherdata = open('weather.csv')
data = weatherdata.readlines()
data
data will be a string list:
['2017-04-1,14.9,30.1,0,8.2,10.8,NE,33,11:55,20.3,51,0,E,11,1023.9,29.5,25,0,ENE,7,1020.3\n',
'2017-04-2,17.4,31.6,0,8.0,10.9,NE,35,08:56,23.5,34,0,NE,17,1021.4,30.7,20,0,SE,9,1018.6 \n',
'2017-04-3,12.1,31.8,0,6.8,10.8,SSW,33,15:14,23.1,39,0,SSE,6,1022.7,29.3,34,0,SSW,19,1020.8\n',
'2017-04-4,15.4,30.4,0,7.0,10.7,E,28,03:01,19.8,64,0,ESE,11,1024.9,29.7,29,0,S,9,1020.3\n',
'2017-04-5,12.3,30.4,0,5.2,10.6,S,19,13:10,21.7,55,0,NE,6,1018.2,27.7,37,0,WSW,11,1013.5\n',
'2017-04-6,13.7,24.4,0,5.2,8.1,SW,43,16:16,17.1,94,7,NE,2,1013.1,22.8,63,3,SSW,20,1011.7\n',
'2017-04-7,14.8,22.3,0,5.4,8.9,SSE,35,06:26,16.4,56,5,SSE,17,1023.6,21.3,33,3,SSE,4,1021.6\n',
'2017-04-8,8.7,23.6,0,5.0,10.5,SW,33,15:41,16.0,58,0,SE,7,1027.6,22.1,44,0,SW,17,1024.5\n',
'2017-04-9,11.1,27.4,0,4.8,10.4,ESE,24,10:30,18.1,56,0,ENE,13,1027.4,26.8,26,0,NE,9,1023.1\n',
'2017-04-10,10.0,30.1,0,5.6,10.4,SSW,24,16:38,19.3,51,4,E,9,1022.7,30.0,20,1,E,6,1018.4\n',
'2017-04-11,13.1,28.1,0,6.6,10.5,SW,28,15:02,21.8,38,0,NE,9,1016.6,26.6,35,0,SW,13,1015.7\n',
'2017-04-12,10.6,23.8,0,5.2,9.7,SW,35,16:19,17.4,69,6,ENE,9,1021.5,23.0,52,1,SW,15,1019.9\n',
'2017-04-13,12.9,26.8,0,4.2,10.4,SSW,31,15:56,19.9,64,1,ESE,11,1024.3,25.5,49,1,SW,15,1020.1\n',
'2017-04-14,12.8,29.0,0,5.8,6.2,SW,22,15:43,18.1,73,4,SSE,2,1019.3,27.6,42,5,SSW,11,1015.4\n',
'2017-04-15,14.8,29.3,0,4.0,7.3,SSE,31,22:03,18.5,73,6,S,11,1015.7,28.2,38,7,SE,9,1011.7\n',
'2017-04-16,17.2,25.1,0,5.4,7.0,SSE,35,00:43,18.8,66,4,ESE,11,1014.6,24.4,54,5,SW,11,1011.6\n',
'2017-04-17,15.4,21.8,0,5.0,2.5,SSW,24,07:56,17.8,74,5,S,13,1015.3,21.4,59,8,SSW,11,1013.2\n',
'2017-04-18,15.3,25.0,0,4.0,8.0,SSW,31,19:02,19.7,72,6,SSE,9,1013.0,22.8,63,1,SW,15,1010.4']
Then use data[0].split(',')[0], it will give you:
'2017-04-1'
and data[0].split(',')[1], will be:
'14.9'
and so on.

Simply read and then split:
weatherdata = open('weather.csv')
data = [line.split(',') for line in weatherdata.read().splitlines()]

Or you can use pandas and it does it for you,Pandas is very useful to read dataset and manipulate them,it reads the data all at once and you can get different columns after reading them
import pandas as pd
df = pd.read_csv('weather.csv')
df.column[0]# this will get the first column

Related

pandas reading csv with one row spanning multiple lines

My csv starts out like this:
,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813
each row ends like this in a new line:
0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"
These values are from a big numpy array encoded with np.array2string() and span multiple lines in the csv.
When using pd.read_csv it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607". When using the parameter engine="python" it throws an "ParserError: unexpected end of data". When using the seperator sep='\t+' it just puts each line in a new row in the dataframe. When using csv.reader by using with open(file_path) and then iterating through each line, the same happens as with the sep='\t+'.
Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?
I tried to use your csv data to check it. I pasted the code along with the answer below,
import pandas as pd
import csv
data_path = 'dt.csv'
df = pd.read_csv(data_path, header = None, quoting=csv.QUOTE_NONE, encoding='utf-8')
dt_json = pd.DataFrame.to_json(df)
print(dt_json)
For an example, I just tried to change the data format from CSV to JSON using pandas dataframe.
{"0":{"0":null,"1":0.0},
"1":{"0":"index","1":"0"},
"2":{"0":"spotify_id","1":"5Jk0vfT81ltt2rYyrWDzZ5"},
"3":{"0":"artist_name","1":"Hundred Waters"},
"4":{"0":"track_name","1":"Xtalk - Kodak to Graph Remix"},
"5":{"0":"album_name","1":"The Moon Rang Like a Bell"},
"6":{"0":"duration_ms","1":"285327"},
"7":{"0":"lyrics","1":"not fetched"},
"8":{"0":"lyrics_bert_embeddings","1":"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813\r\n 0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"}}
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

text file to csv conversion how to get ride of split lines in input file

I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.

Reading the 2nd entry of a .json

I am trying to read the density entry of a list of arrays within a .json file. He's a small portion of the file from the beginning:
["time_tag","density","speed","temperature"],["2019-04-14 18:20:00.000","4.88","317.9","11495"],["2019-04-14 18:21:00.000","4.89","318.4","11111"]
This is the code I have thus far:
with open('plasma.json', 'r') as myfile:
data = myfile.read()
obj = json.loads(data)
print(str(obj['density']))
It should print everything under the density column but I'm getting an error saying that the file can't be opened
First, you json file is not correct. If you want to read it with a single call obj = json.load(data), the json file should be:
[["time_tag","density","speed","temperature"],["2019-04-14 18:20:00.000","4.88","317.9","11495"],["2019-04-14 18:21:00.000","4.89","318.4","11111"]]
Notice the extra square bracket, making it a single list of sublists.
This said, being obj a list of lists, there is no way print(str(obj['density'])) will work. You need to loop on the list to print what you want, or convert this to a dataframe before.
Looping directly
idx = obj[0].index('density') #get the index of the density entry
#from the first list in obj, the header
for row in obj[1:]: #looping on all sublists except the first one
print(row[idx]) #printing
Using a dataframe (pandas)
import pandas as pd
df = pd.DataFrame(obj[1:], columns=obj[0]) #converting to a dataframe, using
#first row as column header
print(df['density'])
Are you sure your data is a valid json and not a csv?
As the snippet of data provided above matches that of a csv file and not a json.
You will be able to read the density key of the csv with:
import csv
input_file = csv.DictReader(open("plasma.csv"))
for row in input_file:
print(row['density'])
Data formatted as csv
["time_tag","density","speed","temperature"]
["2019-04-14 18:20:00.000","4.88","317.9","11495"]
["2019-04-14 18:21:00.000","4.89","318.4","11111"]
Result
4.88
4.89

Splitting an array with \x00\t as the separator

I have imported from a csv file all the data by using:
import pandas as pd
# Import data using panda
df = pd.read_csv('ML_Cancelaciones_20190301.csv','rb', engine='python')
x = df.values
I used 'rb' as it was impossible using utf-8 and other methods.
When I try utf-16 I get the following error:
ParserError: Expected 2 fields in line 6666, saw 3
I believe this might be due to a 'ñ' being present in this row.
Using 'rb' gives me an array with twice as many rows as the original csv file, with one row being empty and the other row containing all columns stack together. The answer looks like so:
array([['\x00'],
[
'\x001\x000\x004\x000\x001\x007\x00H\x001\x00\t\x006\x003\x00\t\x002\x000\x001\x006\x000\x008\x001\x002\x00\t\x009\x009\x009\x009\x009\x009\x009\x009\x00\t\x002\x000\x001\x006\x000\x008\x001\x002\x00\t\x001\x004\x00:\x005\x008\x00\t\x002\x000\x001\x006\x000\x008\x002\x006\x00\t\x002\x000\x001\x006\x000\x008\x002\x009\x00\t\x000\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x000\x00\t\x00F\x00A\x00-\x00R\x00S\x00\t\x00F\x00a\x00c\x00t\x00u\x00r\x00a\x00d\x00a\x00 \x00(\x00N\x00)\x00\t\x00A\x00c\x00t\x00i\x00v\x00a\x00\t\x00E\x00s\x00t\x00a\x00d\x00o\x00s\x00 \x00U\x00n\x00i\x00d\x00o\x00s\x00\t\x002\x00\t\x00E\x00s\x00t\x00a\x00d\x00o\x00s\x00 \x00U\x00n\x00i\x00d\x00o\x00s\x00\t\x002\x00\t\x00E\x00.\x00E\x00.\x00U\x00.\x00U\x00.\x00\t\x00E\x00s\x00p\x00a\x00ñ\x00o\x00l\x00\t\x00E\x00S\x00\t\x00P\x00T\x00G\x00\t\x00P\x00r\x00e\x00s\x00t\x00i\x00g\x00e\x00\t\x00P\x00T\x00G\x00\t\x00P\x00r\x00e\x00s\x00t\x00i\x00g\x00e\x00\t\x00E\x00D\x00\t\x00E\x00-\x00D\x00i\x00s\x00t\x00r\x00i\x00b\x00u\x00t\x00i\x00o\x00n\x00\t\x00B\x002\x00B\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00P\x00r\x00o\x00m\x00o\x00c\x00i\x00o\x00n\x00a\x00l\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00T\x00o\x00d\x00o\x00 \x00I\x00n\x00c\x00l\x00u\x00i\x00d\x00o\x00\t\x001\x000\x000\x00%\x00\t\x00J\x00u\x00n\x00e\x00 \x00S\x00a\x00l\x00e\x00 \x00\t\x00J\x00u\x00n\x00e\x00 \x00S\x00a\x00l\x00e\x00 \x00\t\x00V\x00\t\x00N\x00u\x00e\x00v\x00a\x00\t\x00D\x00o\x00u\x00b\x00l\x00e\x00\t\x00N\x00o\x00\t\x00-\x009\x009\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00N\x00o\x00\t\x00N\x00o\x00\t\x00\t\x00\t\x00S\x00í\x00\t\x002\x00\t\x00P\x00e\x00n\x00d\x00i\x00e\x00n\x00t\x00e\x00 \x00d\x00e\x00 \x00C\x00o\x00b\x00r\x00o\x00\t\x00-\x009\x009\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00C\x00R\x00\t\x00C\x00r\x00é\x00d\x00i\x00t\x00o\x00\t\x003\x002\x001\x000\x009\x00\t\x001\x005\x000\x003\x001\x005\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00N\x00o\x00\t\x00E\x00.\x00E\x00.\x00U\x00.\x00U\x00.\x00\t\x00U\x00S\x00A\x00\t\x00P\x00A\x00B\x00L\x00O\x00\t\x00S\x00r\x00.\x00\t\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00R\x00e\x00p\x00ú\x00b\x00l\x00i\x00c\x00a\x00 \x00D\x00o\x00m\x00i\x00n\x00i\x00c\x00a\x00n\x00a\x00\t\x003\x00\t\x006\x00\t\x002\x00\t\x002\x00\t\x000\x00\t\x002\x00\t\x004\x000\x008\x00,\x009\x006\x000\x000\x00'],
['\x00'],
...,
[ '\x00V\x003\x000\x004\x000\x001\x00H\x001\x00\t\x006\x001\x00\t\x002\x000\x001\x005\x000\x004\x001\x005\x00\t\x009\x009\x009\x009\x009\x009\x009\x009\x00\t\x002\x000\x001\x005\x000\x004\x001\x005\x00\t\x001\x006\x00:\x000\x000\x00\t\x002\x000\x001\x005\x000\x004\x001\x008\x00\t\x002\x000\x001\x005\x000\x004\x002\x006\x00\t\x000\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x000\x00\t\x00F\x00A\x00-\x00R\x00S\x00\t\x00F\x00a\x00c\x00t\x00u\x00r\x00a\x00d\x00a\x00 \x00(\x00N\x00)\x00\t\x00A\x00c\x00t\x00i\x00v\x00a\x00\t\x00E\x00s\x00t\x00a\x00d\x00o\x00s\x00 \x00U\x00n\x00i\x00d\x00o\x00s\x00\t\x002\x00\t\x00E\x00s\x00t\x00a\x00d\x00o\x00s\x00 \x00U\x00n\x00i\x00d\x00o\x00s\x00\t\x002\x00\t\x00E\x00.\x00E\x00.\x00U\x00.\x00U\x00.\x00\t\x00E\x00n\x00g\x00l\x00i\x00s\x00h\x00\t\x00E\x00N\x00\t\x00B\x002\x00C\x00 \x00M\x00\t\x00W\x00e\x00b\x00 \x00C\x00a\x00l\x00l\x00 \x00C\x00e\x00n\x00t\x00e\x00r\x00\t\x00B\x002\x00C\x00\t\x00W\x00e\x00b\x00 \x00C\x00l\x00i\x00e\x00n\x00t\x00e\x00\t\x00B\x002\x00C\x00\t\x00B\x00u\x00s\x00i\x00n\x00e\x00s\x00s\x00-\x00t\x00o\x00-\x00C\x00u\x00s\x00t\x00o\x00m\x00e\x00r\x00\t\x00B\x002\x00C\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00P\x00r\x00o\x00m\x00o\x00c\x00i\x00o\x00n\x00a\x00l\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00T\x00o\x00d\x00o\x00 \x00I\x00n\x00c\x00l\x00u\x00i\x00d\x00o\x00\t\x001\x000\x000\x00%\x00\t\x00S\x00P\x00R\x00I\x00N\x00G\x00 \x00S\x00U\x00P\x00E\x00R\x00 \x00S\x00A\x00L\x00E\x00\t\x00S\x00P\x00R\x00I\x00N\x00G\x00 \x00S\x00U\x00P\x00E\x00R\x00 \x00S\x00A\x00L\x00E\x00\t\x00M\x00\t\x00M\x00o\x00d\x00i\x00f\x00i\x00c\x00a\x00d\x00a\x00\t\x00J\x00u\x00n\x00i\x00o\x00r\x00 \x00S\x00u\x00i\x00t\x00e\x00\t\x00N\x00o\x00\t\x00-\x009\x009\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00\t\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00N\x00o\x00\t\x00N\x00o\x00\t\x00\t\x00\t\x00N\x00o\x00\t\x002\x00\t\x00P\x00e\x00n\x00d\x00i\x00e\x00n\x00t\x00e\x00 \x00d\x00e\x00 \x00C\x00o\x00b\x00r\x00o\x00\t\x00-\x009\x009\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00T\x00C\x00\t\x00T\x00a\x00r\x00j\x00e\x00t\x00a\x00 \x00C\x00r\x00é\x00d\x00i\x00t\x00o\x00\t\x00-\x009\x009\x00\t\x00-\x009\x008\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00N\x00o\x00\t\x00D\x00e\x00s\x00c\x00o\x00n\x00o\x00c\x00i\x00d\x00o\x00\t\x00-\x009\x009\x00\t\x00D\x00I\x00L\x00L\x00O\x00N\x00\t\x00\t\x001\x009\x006\x001\x00-\x001\x002\x00-\x001\x007\x00 \x000\x000\x00:\x000\x000\x00:\x000\x000\x00\t\x00\t\x00E\x00.\x00E\x00.\x00U\x00.\x00U\x00.\x00\t\x00E\x00s\x00t\x00a\x00d\x00o\x00s\x00 \x00U\x00n\x00i\x00d\x00o\x00s\x00\t\x008\x00\t\x003\x002\x00\t\x004\x00\t\x003\x00\t\x001\x00\t\x000\x00\t\x003\x004\x001\x006\x00,\x000\x000\x000\x000\x00'],
['\x00'],
['\x00']], dtype=object)
I wanted to convert this array so that it has a shape nrowsxncolumns of the original csv file. I also wanted the entries to be the same as the original file, i.e. words and numbers.
How could I do this?
An example of the data is here:
csv file opened with word pad
The row: '\x001\x000\x004\x000\x001\x007\x00H\x001\x00\t\x006\x003\x00\t\x002\x000\x001\x006\x000\x008\x001\x002\x00\t\...'
should look like:
['104017H1', '63', '20160812',...]
Therefore all values have a '\x00' before them and each column is separated by a '\x00\t. Is there a way I can do this trasformation?
Thank you very much
You can try replace and split :
a = '\x001\x000\x004\x000\x001\x007\x00H\x001\x00\t\x006\x003\x00\t\x002\x000\x001\x006\x000\x008\x001\x002\x00\t'
a.replace('\x00','').split('\t')
OUTPUT :
['104017H1', '63', '20160812', '']

How to extract specific data from a downloaded csv file and transpose into a new csv file?

I'm working with an online survey application that allows me to download survey results into a csv file. However, the format of the downloaded csv puts each survey question and answer in a new column, whereas, I need the csv file to be formatted with each survey question and answer on a new row. There is also a lot of data in the downloaded csv file that I want to ignore completely.
How can I parse out the desired rows and columns of the downloaded csv file and write them to a new csv file in a specific format?
For example, I download data and it looks like this:
V1,V2,V3,Q1,Q2,Q3,Q4....
null,null,null,item,item,item,item....
0,0,0,4,5,4,5....
0,0,0,2,3,2,3....
The first row contains the 'keys' that I will need except V1-V3 must be excluded. Row 2 must be excluded altogether. Row 3 is my first subject so I need the values 4,5,4,5 to be paired with the keys Q1,Q2,Q3,Q4. And row 4 is a new subject which needs to be excluded as well since my program only handles one subject at a time.
The csv file that I need to create in order for my script to function properly looks like this:
Q1,4
Q2,5
Q3,4
Q4,5
I've tried using this izip to pivot the data, but I don't know how to specifically select the rows and columns I need:
from itertools import izip
a = izip(*csv.reader(open("CDI.csv", "rb")))
csv.writer(open("CDI_test.csv", "wb")).writerows(a)
Here is a simple python script that should do the job for you. It takes in arguments from the command line that designate the number of entries you want to skip at the beginning of the line,the input you want to skip at the end of the line, the input file and the output file. So for example, the command would look like
python question.py 3:7 input.txt output.txt
You can also substitute sys.argv[1] for 3, sys.argv[2] for "input.txt" and so on within the script if you don't want to state the arguments every time.
Text file version:
import sys
inputFile = open(sys.argv[2],"r")
outputFile = open(sys.argv[3], "w")
leadingRemoved=int(sys.argv[1])
#strips extra whitespace from each line in file then splits by ","
lines = [x.strip().split(",") for x in inputFile.readlines()]
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:],lines[2][leadingRemoved:])
for tuples in zipped:
#writes the question/ number pair to a file.
outputFile.write(",".join(tuples))
inputFile.close()
outputFile.close()
#input from command line: python questions.py leadingRemoved pathToInput pathToOutput
CSV file version:
import sys
import csv
with open(sys.argv[2],"rb") as inputFile:
#removes null bytes
reader = csv.reader((line.replace('\0','') for line in inputFile),delimiter="\t")
outputFile = open(sys.argv[3], "wb")
leadingRemoved,endingremoved=[int(x) for x in sys.argv[1].split(":")]
#creates a 2d array of all the elements for each row
lines = [x for x in reader]
print lines
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:endingremoved],lines[2][leadingRemoved:endingremoved])
writer = csv.writer(outputFile)
writer.writerows(zipped)
print zipped
outputFile.close()
Something similar I did using multiple values but could be changed to single values.
#!/usr/bin/env python
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
The format of the csv file and the headers included need to be know to extract the email addresses
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values= mf[1:] #choose row/rows you want
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the people list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list
I suggest you read up on pandas for this type of activity:
http://pandas.pydata.org/pandas-docs/stable/io.html
import pandas
input_dataframe = pandas.read_csv("input.csv")
transposed_df = input_dataframe.transpose()
# delete rows and edit data easily using pandas dataframe
# this is a good library to get some experience working with
transposed_df.to_csv("output.csv")

Categories

Resources