convert .dat into .csv in python - python

I want to convert a data set of an .dat file into csv file. The data format looks like,
Each row begins with the sentiment score followed by the text associated with that rating.
I want the have sentiment value of (-1 or 1) to have a column and the text of review corresponding to the sentiment value to have an review to have an column.
WHAT I TRIED SO FAR
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("train.dat").readlines()]
# write it as a new CSV file
with open("train.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
def your_func(row):
return row['Sentiments'] / row['Review']
columns_to_keep = ['Sentiments', 'Review']
dataframe = pd.read_csv("train.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
Sample screen shot of the resulting train.csv it has an comma after every word in the review.

If all your rows follow that consistent format, you can use pd.read_fwf. This is a little safer than using read_csv, in the event that your second column also contains the delimiter you are attempting to split on.
df = pd.read_fwf('data.txt', header=None,
widths=[2, int(1e5)], names=['label', 'text'])
print(df)
label text
0 -1 ieafxf rjzy xfxk ymi wuy
1 1 lqqm ceegjnbjpxnidygr
2 -1 zss awoj anxb rfw kgbvnl
data.txt
-1 ieafxf rjzy xfxk ymi wuy
+1 lqqm ceegjnbjpxnidygr
-1 zss awoj anxb rfw kgbvnl

As mentioned in the comments, read_csv would be appropriate here.
df = pd.read_csv('train_csv.csv', sep='\t', names=['Sentiments', 'Review'])
Sentiments Review
0 -1 alskjdf
1 1 asdfa
2 1 afsd
3 -1 sdf

Related

Create dataframe from a string Python

How do I create a dataframe from a string that look like this (part of the string)
,file_05,,\r\nx data,y
data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,
Trying to make a dataframe that look like this
In the code below s is the string:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s)).dropna(axis=1)
df.rename(columns={df.columns[0]: ""}, inplace=True)
By the way, if the string comes from a csv file then it is simpler to read the file directly using pd.read_csv.
Edit: This code will create a multiindex of columns:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s), header = None).dropna(how="all", axis=1).T
df[0] = df.loc[1, 0]
df = df.set_index([0, 1]).T
Looks like you want a multi-level dataframe from the string. Here's how I would do it.
Step 1: Split the string by '\r\n'. Then for each value, split by
','
Step 2: The above step will create a list of list. Element #0 has 4
items and element #1 has 2 items. The rest have 3 items each and is
the actual data
Step 3: Convert the data into a dictionary from element #3 onwards.
Use values in element #2 as keys for the dictionary (namely x data
and y data). To ensure you have key:[list of values], use the
dict.setdefault(key,[]).append(value). This will ensure the data
is created as a `key:[list of values]' dictionary.
Step 4: Create a normal dataframe using the dictionary as all the
values are stored as key and values in the dictionary.
Step 5: Now that you have the dictionary, you want to create the
MultiIndex. Convert the column to MultiIndex.
Putting all this together, the code is:
import pandas as pd
text = ',file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,'
line_text = [txt.split(',') for txt in text.split('\r\n')]
dct = {}
for x,y,z in line_text[2:]:
dct.setdefault(line_text[1][0], []).append(x)
dct.setdefault(line_text[1][1], []).append(y)
df = pd.DataFrame(dct)
df.columns = pd.MultiIndex.from_tuples([(line_text[0][i],line_text[1][i]) for i in [0,1]])
print (df)
Output of this will be:
file_05
x data y data
0 -970.0 -34.12164
1 -959.0 -32.37526
2 -949.0 -30.360199
3 -938.0 -28.74816
4 -929.0 -27.53912
5 -920.0 -25.92707
6 -911.0 -24.31503
7 -900.0 -23.64334
8 -891.0 -22.29997
You should convert your raw data to a table with python.
Save to csv file by import csv package with python.
from pandas import DataFrame
# s is raw datas
s = ",file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,"
# convert raw data to a table
table = [i.split(',') for i in s.split("\r\n")]
table = [i[:2] for i in table]
# table is like
"""
[['', 'file_05'],
['x data', 'y data'],
['-970.0', '-34.12164'],
['-959.0', '-32.37526'],
['-949.0', '-30.360199'],
...
['-891.0', '-22.29997']]
"""
# save to output.csv file
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table)
# Save to DataFrame df
from pandas import DataFrame
df = DataFrame (table[2:],columns=table[1][:2])
print(df)

How to split CSV column data into two colums in Python

I have the following code (below) that grabs to CSV files and merges data into one consolidated CSV file.
I now need to grab specific information from one of the columns add that information to another column.
What I have now is one output.csv file with the following sample data:
ID,Name,Flavor,RAM,Disk,VCPUs
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2
What I need to do is open this CSV file and split the data in the Name column across two columns as followed:
ID,Name,Flavor,RAM,Disk,VCPUs,Customer,Misc
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2,customer1,test1-dns
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1,customer2,test2
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2,customer3,test3-dns-api
Note how the Misc column can have multiple values split by one or multiple -.
How can I accomplish this via Python. Below is the code I have now:
import csv
import os
import pandas as pd
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'VCPUs'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
Try this:
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
df
Output:
ID Name Flavor RAM Disk VCPUs Customer Misc
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2 customer1 test1-dns
1 83dbc739-e436-4c9f-a561-c5b40a3a6da5 customer2-test2 m1.tiny 128 1 1 customer2 test2
2 ef68fcf3-f624-416d-a59b-bb8f1aa2a769 customer3-test3-dns-api m1.medium 4096 40 2 customer3 test3-dns-api
I would recommend switching over to pandas. Here's the official Getting Started documentation.
Let's first read in the csv.
import pandas as pd
df = pd.read_csv('input.csv')
print(df.head(1))
You should get something similar to:
ID Name Flavor RAM Disk VCPUs
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2
After that, use string manipulation in the Pandas Series:
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
Finally, you can save the csv.
df.to_csv('output.csv')
This code would be much elegant and simpler if you used pandas.
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df['Name'].str.split(pat='-',n=1,expand=True)
df.to_csv('output.csv',index=False)
Documentation ref
Here is how I do it. The trick is in the function "split()" :
import pandas as pd
file = pd.read_csv(r"C:\...\yourfile.csv",sep=",")
file['Customer']=None
file['Misc']=None
for x in range(len(file)):
temp=file.Name[x].split('-', maxsplit=1)
file['Customer'].iloc[x] = temp[0]
file['Misc'].iloc[x] = temp[1]
file.to_csv(r"C:\...\yourfile_result.csv")

Read text file of protein sequences in python

I am trying to read DNA Sequences in Pandas Data frame but not getting the whole sequence in Data frame column.
I have tried File.open method simple read_csv method these methods didn't help me much.
pd.read_csv('../input/data 1/non-cpp.txt', index_col=0, header=None)
Output:
0
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
myfile = open("../input/data 1/non-cpp.txt")
for line in myfile:
print(line)
myfile.close()
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
QRFSQPTFKLPQGRLTLSRKF
>
FLPVLAGIAAKVVPALFCKITKKC
DataSet Source
Label of Sequence
long Sequence (String)
I need labels in one column which you can see in 1st and whole sequence in the second column which you can see in second row e.g
Label
Sequence
this is a rough not one liner but it will give you what you need, a series with the DNA sequences.
import pandas as pd
data = pd.read_csv('cpp.txt', sep=">",header=None)
data[0].dropna()
I hope it helps
Let's say your file is something like:
>a1|b1|c1
a111
>a2|b2|c2
a222
>a3|b3|c3
a333
Note that here we have 6 lines.
Then, you can read the file, and store the data:
import pandas as pd
with open('filename.txt', 'r') as f:
content = f.readlines()
n = len(content)
label = [content[i].strip() for i in range(0,n,2)]
seq = [content[i].strip() for i in range(1,n,2)]
df = pd.DataFrame({'label':label,
'sequence':seq})
and you get a pandas dataframe:
label sequence
0 >a1|b1|c1 a111
1 >a2|b2|c2 a222
2 >a3|b3|c3 a333

How do i add column header, in the second row in a pandas dataframe?

I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)
You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green
You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green
So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers

How to read and write to CSV Files in Python

I have a csv file, which has only a single column , which acts as my input.
I use that input to find my outputs. I have multiple outputs and I need those outputs in another csv file.
Can anyone please suggest me the ways on how to do it ?
Here is the code :
import urllib.request
jd = {input 1}
//
Some Codes to find output - a,b,c,d,e
//
** Code to write output to a csv file.
** Repeat the code with next input of input csv file.
Input CSV File has only a single column and is represented below:
1
2
3
4
5
Output would in a separate csv in a given below format :
It would be in multiple rows and multiple columns format.
a b c d e
Here is a simple example:
The data.csv is a csv with one column and multiple rows.
The results.csv contain the mean and median of the input and is a csv with 1 row and 2 columns (mean is in 1st column and median in 2nd column)
Example:
import numpy as np
import pandas as pd
import csv
#load the data
data = pd.read_csv("data.csv", header=None)
#calculate things for the 1st column that has the data
calculate_mean = [np.mean(data.loc[:,0])]
calculate_median = [np.median(data.loc[:,0])]
results = [calculate_mean, calculate_median]
#write results to csv
row = []
for result in results:
row.append(result)
with open("results.csv", "wb") as file:
writer = csv.writer(file)
writer.writerow(row)
In pseudo code, you'll do something like this:
for each_file in a_folder_that_contains_csv: # go through all the `inputs` - csv files
with open(each_file) as csv_file, open(other_file) as output_file: # open each csv file, and a new csv file
process_the_input_from_each_csv # process the data you read from the csv_file
export_to_output_file # export the data to the new csv file
Now, I won't write a full-working example because it's better for you to start digging and ask specific questions when you have some. You're now just asking: write this for me because I don't know Python.
here is the official documentation
here you can read about the csv module
here you can read about the os module
I think you need read_csv for reading file to Series and to_csv for writing output Series to file in looping by Series.iteritems.
#file content
1
3
5
s = pd.read_csv('file', squeeze=True, names=['a'])
print (s)
0 1
1 3
2 5
Name: a, dtype: int64
for i, val in s.iteritems():
#print (val)
#some operation with scalar value val
df = pd.DataFrame({'a':np.arange(val)})
df['a'] = df['a'] * 10
print (df)
#write to csv, file name by val
df.to_csv(str(val) + '.csv', index=False)
a
0 0
a
0 0
1 10
2 20
a
0 0
1 10
2 20
3 30
4 40

Categories

Resources