plot specific columns from a text file - python

If I have a text file, data.txt, which contains many columns, how to call this file by python and plot only chosen two columns?
for example:
10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s
and I need to plot only the first and second columns.
Note the space before the first column.
Eidt
I used the following code:
import matplotlib.pyplot as plt
from matplotlib import rcParamsDefault
import numpy as np
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
x, y = np.loadtxt('./calc.dat', delimiter=' ')
plt.plot(x, y, "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
but I have to modify my data to contain only two columns that I need to plot without any spaces before the first column, as follows
10 -22.82215289
12 -22.81978265
15 -22.82359691
20 -22.82464363
25 -22.82615348
30 -22.82641815
35 -22.82649347
40 -22.82655376
50 -22.82661407
60 -22.82663535
70 -22.82664864
80 -22.82665962
90 -22.82666308
100 -22.82666662
So, how to modify the code so that I do not have to modify the data every time?

You can create a DataFrame from from a text file using pandas read_csv, which can simplify future processing of the data, besides plotting it.
In this case, the tricky part are the whitespaces, that can be managed by setting the optional parameter sep to '\s+':
df = pd.read_csv('data.txt', sep='\s+', header=None, names=['foo', 'bar', 'baz'])
>>>df
index
foo
bar
baz
0
10
-22.82215289
0.11s
1
12
-22.81978265
0.14s
2
15
-22.82359691
0.14s
3
20
-22.82464363
0.16s
4
25
-22.82615348
0.17s
5
30
-22.82641815
0.19s
6
35
-22.82649347
0.21s
7
40
-22.82655376
0.22s
8
50
-22.82661407
0.28s
9
60
-22.82663535
0.34s
10
70
-22.82664864
0.42s
11
80
-22.82665962
0.46s
12
90
-22.82666308
0.51s
13
100
-22.82666662
0.56s
And the just your code:
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
plt.plot(df['foo'], df['bar'], "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
I set the names of the columns to arbitrary strings. You can avoid that, and just refer to the columns as df[0], df[1]

You could first read your file data.txt and preprocess it by stripping the whitespaces on the left of each line, save the preprocessed data to data_processed.txt, then load it with pd.read_csv and then plot the two columns of choice col1 and col2 against each other with plt.plot, as follows:
import pandas as pd
import matplotlib.pyplot as plt
s = """ 10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s"""
with open ('data.txt', 'w') as f:
f.write(s)
with open ('data.txt', 'r') as f:
data = f.read()
data_processed = '\n'.join([l.lstrip() for l in data.split('\n')])
with open ('data_processed.txt', 'w') as f:
f.write(data_processed)
df = pd.read_csv('data_processed.txt', sep=' ', header=None)
col1 = 0
col2 = 1
plt.plot(df[col1], df[col2]);

Related

Sort rows of curve shaped data in python

I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

Using if statements to filter data?

Lets say I have an excel document with the following format. I'm reading said excel doc with pandas and plotting data using matplotlib and numpy. Everything is great!
Buttttt..... I wan't more constraints. Now I want to constrain my data so that I can sort for only specific zenith angles and azimuth angles. More specifically: I only want zenith when it is between 30 and 90, and I only want azimuth when it is between 30 and 330
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
0 81 10
70 35 7
110 90 17
270 45 23
330 45 13
345 47 6
175 82 7
220 7 8
This is an example of the sort of constraint I'm looking for.
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
70 35 7
110 90 17
270 45 23
330 45 13
175 82 7
The following is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
P_file = file1
out_file = file2
out_file2 = file3
data = pd.read_csv(file1,header=None,sep=' ')
df=pd.DataFrame(data=data)
df.to_csv(file2,sep=',',header = [19 headers. The three that matter for this question are 'DateTime', 'Zenith', 'Azimuth', and 'Ozone Amount'.]
df=pd.read_csv(file2,header='infer')
mask = df[df['DateTime'].str.contains('20141201')] ## In this line I'm sorting for anything containing the locator for the given day.
mask.to_csv(file2) ##I'm now updating file 2 so that it only has the data I want sorted for.
data2 = pd.read_csv(file2,header='infer')
df2=pd.DataFrame(data=data2)
def tojuliandate(date):
return.... ##give a function that changes normal date of format %Y%m%dT%H%M%SZ to julian date format of %y%j
def timeofday(date):
changes %Y%m%dT%H%M%SZ to %H%M%S for more narrow views of data
df2['Time of Day'] = df2['DateTime'].apply(timeofday)
df2.to_csv(file2) ##adds a column for "timeofday" to the file
So basically at this point this is all the code that goes into making the csv I want to sort. How would I go about sorting
'Zenith' and 'Azimuth'
If they met the criteria I specified above?
I know that I will need if statements to do this.
I tried something like this but it didn't work and I was looking for a bit of help:
df[(df["Zenith"]>30) & (df["Zenith"]<90) & (df["Azimuth"]>30) & (df["Azimuth"]<330)]
Basically a duplicate of Efficient way to apply multiple filters to pandas DataFrame or Series
You can use series between:
df[(df['Zenith'].between(30, 90)) & (df['Azimuth'].between(30, 330))]
Yields:
Azimuth Zenith Ozone Amount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Note that by default, these upper and lower bounds are inclusive (inclusive=True).
You can only write those entries of the dataframe to your file, which are meeting your boundary conditions
# replace the line df.to_csv(...) in your example with
df[((df['Zenith'] >= 3) & (df['Zenith'] <= 90)) and
((df['Azimuth'] >= 30) & (df['Azimuth'] <= 330))].to_csv('my_csv.csv')
Using pd.DataFrame.query:
df_new = df.query('30 <= Zenith <= 90 and 30 <= Azimuth <= 330')
print(df_new)
Azimuth Zenith OzoneAmount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7

Error in using knn for multidimensional data

I am a beginer in Machine Learning, I am trying to classify multi dimensional data into two classes. Each data point is 40x6 float values. To begin with I have read my csv file. In this file shot number represents data point.
https://docs.google.com/spreadsheets/d/1tW1xJqnNZa1PhVDAE-ieSVbcdqhT8XfYGy8ErUEY_X4/edit?usp=sharing
Here is the code in python:
import pandas as pd
1 import numpy as np
2 import matplotlib.pyplot as plot
3
4 from sklearn.neighbors import KNeighborsClassifier
5
6 # Read csv data into pandas data frame
7 data_frame = pd.read_csv('data.csv')
8
9 extract_columns = ['LinearAccX', 'LinearAccY', 'LinearAccZ', 'Roll', 'pitch', 'compass']
10
11 # Number of sample in one shot
12 samples_per_shot = 40
13
14 # Calculate number of shots in dataframe
15 count_of_shots = len(data_frame.index)/samples_per_shot
16
17 # Initialize Empty data frame
18 training_index = range(count_of_shots)
19 training_data_list = []
20
21 # flag for backward compatibility
22 make_old_data_compatible_with_new = 0
23
24 if make_old_data_compatible_with_new:
25 # Convert 40 shot data to 25 shot data
26 # New logic takes 25 samples/shot
27 # old logic takes 40 samples/shot
28 start_shot_sample_index = 9
29 end_shot_sample_index = 34
30 else:
31 # Start index from 1 and continue till lets say 40
32 start_shot_sample_index = 1
33 end_shot_sample_index = samples_per_shot
34
35 # Extract each shot into pandas series
36 for shot in range(count_of_shots):
37 # Extract current shot
38 current_shot_data = data_frame[data_frame['shot_no']==(shot+1)]
39
40 # Select only the following column
41 selected_columns_from_shot = current_shot_data[extract_columns]
42
43 # Select columns from selected rows
44 # Find start and end row indexes
45 current_shot_data_start_index = shot * samples_per_shot + start_shot_sample_index
46 current_shot_data_end_index = shot * samples_per_shot + end_shot_sample_index
47 selected_rows_from_shot = selected_columns_from_shot.ix[current_shot_data_start_index:curren t_shot_data_end_index]
48
49 # Append to list of lists
50 # Convert selected short into multi-dimensional array
51
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
8
7 # Append each sliced shot into training data
6 training_data = pd.DataFrame(training_data_list, columns=extract_columns)
5 training_features = [1 for i in range(count_of_shots)]
4 knn = KNeighborsClassifier(n_neighbors=3)
3 knn.fit(training_data, training_features)
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
After running the above code, I am getting an error
ValueError: setting an array element with a sequence.
for the line
knn.fit(training_data, training_features)

How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
f = open('index.csv','wb')
write = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
for ch_row in data1:
if ( data2[row,3] == ch_row ):
write.writerow(data1[data2[row,3],:])
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
Can anyone help me solve this problem?
You need to indent your last 2 lines. Also, it looks like you are writing to the file from which you are reading.

Categories

Resources