I have a list of tuples. Each tuple contain 2 values, together with the results of an operation between the two values. Here is an example:
my_list = [(1,1,1.0), (1,2,0.8), (1,3,0.3), (2,1,0.8), (2,2,1.0), (2,3,0.5), (3,1,0.3), (3,2,0.5), (3,3,1.0)]
I need to store this value in a csv file so that they look like this:
0 1 2 3
1 1 0.8 0.3
2 0.8 1 0.5
3 0.3 0.5 1
In other words, I need to go to a new row every time the first number of the tuple change.
This is the function I am currently using, which writes each tuple in a new row (not what I want):
def write_csv(my_list, fname = ''):
with open (fname, mode='a+') as f:
f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for x in my_list:
f_writer.writerow([str(x[0]), str(x[1]), str(x[2])])
Any suggestion on how to modify (rewrite from scratch) it?
You could use a combination of the Numpy and Pandas Python libraries.
import pandas as pd
import numpy as np
my_list = [(1,1,1.0), (1,2,0.8), (1,3,0.3), (2,1,0.8), (2,2,1.0), (2,3,0.5), (3,1,0.3), (3,2,0.5), (3,3,1.0)]
new_list = [cell[2] for cell in my_list] # Extract cell values
np_array = np.array(new_list).reshape(3,3) # Create 3x3 matrix
df = pd.DataFrame(np_array) # Create a dataframe
df.to_csv("test.csv") # Write to a csv
For clarity, the dataframe will look like:
df
0 1 2
0 1.0 0.8 0.3
1 0.8 1.0 0.5
2 0.3 0.5 1.0
And the csv file will look like:
,0,1,2
0,1.0,0.8,0.3
1,0.8,1.0,0.5
2,0.3,0.5,1.0
Related
I want to read a text file with values of matrix. Let's say you have got a .txt file looking like this:
0 0 4.0
0 1 5.2
0 2 2.1
1 0 2.1
1 1 2.9
1 2 3.1
Here, the first column gives the indices of the matrix on the x-axis and the second column fives the indices of the y-axis. The third column is a value at this position in the matrix. When values are missing the value is just zero.
I am well aware of the fact, that data formats like the .mtx format exist, but I would like to create a scipy sparse matrix or numpy array from this txt file alone instead of adjusting it to the .mtx file format. Is there a Python function out there, which does this for me, which I am missing?
import numpy
with open('filename.txt','r') as f:
lines = f.readlines()
f.close()
data = [i.split(' ') for i in lines]
z = list(zip(*data))
row_indices = list(map(int,z[0]))
column_indices = list(map(int,z[1]))
values = list(map(float,z[2]))
m = max(row_indices)+1
n = max(column_indices)+1
p = max([m,n])
A = numpy.zeros((p,p))
A[row_indices,column_indices]=values
print(A)
If you want a square matrix with maximum of column 1 as the number of rows and and the maximum of column 2 to be the size, then you can remove p = max([m,n]) and replace A = numpy.zeros((p,p)) with A = numpy.zeros((m,n)).
Starting from the array (a) sorted on the first column (major) and second (minor) as in your example, you can reshape:
# a = np.loadtxt('filename')
x = len(np.unique(a[:,0]))
y = len(np.unique(a[:,1]))
a[:,2].reshape(x,y).T
Output:
array([[4. , 2.1],
[5.2, 2.9],
[2.1, 3.1]])
How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.
Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])
import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747
Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747
I have a data frame (a .txt from R) that looks like this:
my_sample my_coord1 my_coord2 my_cl
A 0.34 0.12 1
B 0.2 1.11 1
C 0.23 0.10 1
D 0.9 0.34 2
E 0.21 0.6 2
... ... ... ...
Using python I would like to extract columns 2 and 3 and put them into a variable as well as I would like to put column 4 into another variable. In R is: my_var1 = mydf[,c(2:3)] and my_var2 = mydf[,4]. I don't know how to do this in python but I tried:
mydf = open("mydf.txt", "r")
print(mydf.read())
lines = mydf.readlines()
for line in lines:
sline = line.split(' ')
print(sline)
mydf.close()
But I don't know how to save into a variable each subsetting.
I know it seems a quite simple question but I'm a newbie in the field.
Thank you in advance
You can use read_table from pandas in order to deal with tabular data file. The code
import pandas as pd
mydf = pd.read_table('mydf.txt',delim_whitespace = True)
my_var1 = mydf[['my_coord1','my_coord2']]
my_var2 = mydf['my_cl']
This is the code I've tried. SS of the CSV file and the error are attached below.
def displayGraph():
with open('seats.csv', newline='') as csvfile:
s = pd.read_csv(csvfile)
x=0
y=0
for p in range(48):
if s[p]==0:
x+=1
for g in range(48):
if s[g]==1:
y+=1
a=x/(x+y) *100
b=y/(x+y) *100
graph=[a,b]
plt.pie(graph, labels=['Empty Seats', 'Booked Seats'])
https://imgur.com/XCMJMal Error (part 1)
https://imgur.com/UsJmrVb Error (part 2)
https://imgur.com/a/6WbeRGV CSV File
Edit
CSV file provided below in text format:
0,0,1,1,0,1,1,1
0,1,1,0,0,1,0,1
1,0,0,1,0,1,1,0
0,1,1,1,0,0,0,1
0,0,1,1,0,1,0,0
1,1,1,1,0,0,1,1
The csv module and Pandas are two ways for processing CSV files, but apart from that they are unrelated and are used differently. Here you have loaded your file with pandas and used it as is you had used a csv.reader. You should choose one.
csv module way
with open('seats.csv', newline='') as csvfile:
s = csv.reader(csvfile)
x = 0
y = 0
for row in s:
for val in row:
if val == '1': # csv module sets values as strings...
y += 1
else:
x += 1
a=x/(x+y) *100
b=y/(x+y) *100
graph=[a,b]
plt.pie(graph, labels=['Empty Seats', 'Booked Seats'])
Pandas way (more magic here...)
df = pd.read_csv('seats.csv', header = None)
y = df.sum().sum()
x = len(df) * len(df.columns) - y
a=x/(x+y) *100
b=y/(x+y) *100
graph=[a,b]
plt.pie(graph, labels=['Empty Seats', 'Booked Seats'])
Have you tried actually looking at the content of the DataFrame?
You access DataFrame like this in Python:
table.csv:
A | B | C
-----------------
0.1 | 0.2 | 0.3
------------------
0.4 | 0.5 | 0.6
------------------
script:
pd.read_csv("table.csv")
print(list(pd["A"]))
OUTPUT
[0.1,0.4]
Your csv does not have any column names so you cannot just do s[p], which is why it is throwing KeyError: 0 (the column called "0" does not exist in your csv)
You need to use iloc:
(if the p is referring to the column index)
s.iloc[:,p]
or if p is referring to the row index
s.iloc[p,:]
I have a file with some data that looks like
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
I can process this data and do math on it just fine:
import sys
import numpy as np
import pandas as pd
def main():
if(len(sys.argv) != 2):
print "Takes one filename as argument"
sys.exit()
file_name = sys.argv[1]
data = pd.read_csv(file_name, sep=" ", header=None)
data.columns = ["timestep", "mux", "muy", "muz"]
t = data["timestep"].count()
c = np.zeros(t)
for i in range(0,t):
for j in range(0,i+1):
c[i-j] += data["mux"][i-j] * data["mux"][i]
c[i-j] += data["muy"][i-j] * data["muy"][i]
c[i-j] += data["muz"][i-j] * data["muz"][i]
for i in range(t):
print c[i]/(t-i)
The expected result for my sample input above is
42.5
62.0
84.5
110.0
This math is finding the time correlation function for my data, which is the time-average of all permutations of the pairs of products in each column.
I would like to generalize this program to
work on n number of columns (in the i/j loop for example), and
be able to read in the column names from the file, so as to not have them hard-coded in
Which numpy or pandas methods can I use to accomplish this?
We can reduce it to one loop, as we would make use of array-slicing and use sum ufunc to operate along the rows of the dataframe and thus in the process make it generic to cover any number of columns, like so -
a = data.values
t = data["timestep"].count()
c = np.zeros(t)
for i in range(t):
c[:i+1] += (a[:i+1,1:]*a[i,1:]).sum(axis=1)
Explanation
1) a[:i+1,1:] is the slice of all rows until the i+1-th row and all columns starting from the second column, i.e mux, muy and so on.
2) Similarly, for [i,1:], that's the i-th row and all columns from second column onwards.
To keep it "pandas-way", simply replace a[ with data.iloc[.