Choosing the correct values in excel in Python - python

General Overview:
I am creating a graph of a large data set, however i have created a sample text document so that it is easier to overcome the problems.
The Data is from an excel document that will be saved as a CSV.
Problem:
I am able to compile the data a it will graph (see below) However how i pull the data will not work for all of the different excel sheet i am going to pull off of.
More Detail of problem:
The Y-Values (Labeled 'Value' and 'Value1') are being pulled for the excel sheet from the numbers 26 and 31 (See picture and Code).
This is a problem because the Values 26 and 31 will not be the same for each graph.
Lets take a look for this to make more sense.
Here is my code
import pandas as pd
import matplotlib.pyplot as plt
pd.read_csv('CSV_GM_NB_Test.csv').T.to_csv('GM_NB_Transpose_Test.csv,header=False)
df = pd.read_csv('GM_NB_Transpose_Test.csv', skiprows = 2)
DID = df['SN']
Value = df['26']
Value1 = df['31']
x= (DID[16:25])
y= (Value[16:25])
y1= (Value1[16:25])
"""
print(x,y)
print(x,y1)
"""
plt.plot(x.astype(int), y.astype(int))
plt.plot(x.astype(int), y1.astype(int))
plt.show()
Output:
Data Set:
Below in the comments you will find the 0bin to my Data Set this is because i do not have enough reputation to post two links.
As you can see from the Data Set
X- DID = Blue
Y-Value = Green
Y-Value1 = Grey
Troublesome Values = Red
The problem again is that the data for the Y-Values are pulled from Row 10&11 from values 26,31 under SN
Let me know if more information is needed.
Thank you

Not sure why you are creating the transposed CSV version. It is also possible to work directly from your original data. For example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('CSV_GM_NB_Test.csv', skiprows=8)
data = df.ix[:,19:].T
data.columns = df['SN']
data.plot()
plt.show()
This would give you:
You can use pandas.DataFrame.ix() to give you a sliced version of your data using integer positions. The [:,19:] says to give you columns 19 onwards. The final .T transposes it. You can then apply the values for the SN column as column headings using .columns to specify the names.

Related

I am trying to correlate between one colum of a dataset to all columns in another dataset in python

I have 2 CSV files one called training_data and another called target data Ive read both of them training data contains around 30 columns of data and target data has 1 im trying to correlate between the one column in the target data to all the columns of the training data
import pandas as pd
import tarfile
import numpy as np
import csv
#reading in the data
training_data = pd.read_csv(training_data_path)
training_target = pd.read_csv(training_targets_path)
%matplotlib inline
import matplotlib.pyplot as plt
#plotting histogram
training_data.hist(bins=60,figsize=(30,25))
#after reviewing the histograms it can be seen in the histogram of the average household sizes that around 50 counties have a AvgHousehold size of almost 0
#PctSomeCol18_24, PctEmployed16_Over, PctPrivateCoverageAlone all have missing data
display(training_data)
display(training_target)
TARGET_deathRate = training_target["TARGET_deathRate"]
corr_matrix=training_data.corr(training_target)
Ive tried using the corr function but it is not working
It is better to use correlation in one data set, therefore first of all you have to join these two datasets and then use the correlation function. for joining you can use concat, append and join which I rather use join:
df = training_data.join(training_target) #joining datasets
corr_matrix=df.corr()['TARGET_deathRate']

How to iterate distance calculation for different vehicles from coordinates

I am new to coding and need help developing a Time Space Diagram (TSD) from a CSV file which I got from a VISSIM simulation as a result.
A general TSD looks like this: TSD and I have a CSV which looks like this:
CSV.
I want to take "VEHICLE:SIMSEC" which represent the simulation time which I want it represented as the X axis on TSD, "NO" which represent the vehicle number (there are 185 different vehicles and I want to plot all 185 of them on the plot) as each of the line represented on TSD, "COORDFRONTX" which is the x coordinate of the simulation, and "COORDFRONTY" which is the y coordinate of the simulation as positions which would be the y axis on TSD.
I have tried the following code but did not get the result I want.
import pandas as pd
import matplotlib.pyplot as mp
# take data
data = pd.read_csv(r"C:\Users\hk385\Desktop\VISSIM_DATA_CSV.csv")
df = pd.DataFrame(data, columns=["VEHICLE:SIMSEC", "NO", "DISTTRAVTOT"])
# plot the dataframe
df.plot(x="NO", y=["DISTTRAVTOT"], kind="scatter")
# print bar graph
mp.show()
The plot came out to be uninterpretable as there were too many dots. The diagram looks like this: Time Space Diagram. So would you be able to help me or guide me to get a TSD from the CSV I have?
Suggestion made by mitoRibo,
The top 20 rows of the csv is the following:
VEHICLE:SIMSEC,NO,LANE\LINK\NO,LANE\INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,8.42
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,93.0
7.0,1,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,101.49
7.1,1,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,109.99
7.2,1,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,118.49
7.3,1,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,127.0
7.4,1,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,135.51
7.5,1,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,144.03
7.6,1,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,152.56
7.7,1,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,161.09
Thank you.
You can groupby and iterate through different vehicles, adding each one to your plot. I changed your example data so there were 2 different vehicles.
import pandas as pd
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""
VEHICLE:SIMSEC,NO,LANE_LINK_NO,LANE_INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,0
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,90
6.0,2,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,0
6.1,2,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,30
6.2,2,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,40
6.3,2,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,50
6.4,2,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,60
6.5,2,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,70
6.6,2,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,80
6.7,2,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,90
"""),sep=',')
fig = plt.figure()
#Iterate through each vehicle, adding it to the plot
for vehicle_no,vehicle_df in df.groupby('NO'):
plt.plot(vehicle_df['VEHICLE:SIMSEC'],vehicle_df['DISTTRAVTOT'], label=vehicle_no)
plt.legend() #comment this out if you don't want a legned
plt.show()
plt.close()
If you don't mind could you please try this.
mp.scatter(x="NO", y=["DISTTRAVTOT"])
If still not work please attach your data for me to test from my side.

How to export 3D array into a single row in excel using python

I am attempting to export a large array of 3D points into excel.
import numpy as np
import pandas as pd
d = np.asarray(data)
df = pd.Dataframe(d)
df.to_csv("C:/Users/Fred/Desktop/test.csv")
This exports the data into rows as below:
3.361490011 -27.39559937 -2.934410095
4.573401244 -26.45699201 -3.845634521
.....
Each line representing the x,y,z coordinates. However, for my analysis, I would like that the 2nd row is moved to columns beside the 1st row, and so on, so that all the coordinates for one shape are on the one row of the excel. I tried turning the data into a string but this returned the above too.
The reason is so I can add some population characteristics to the row for each 3d shape. Thanks for any help that anyone can give.
you can use x = df.to_numpy().flatten() to flatten your data and then save it to csv using np.savetxt.

Plotting data with matplotlib takes forever & plot crashes with higher number of samples

got an issue with plotting x,y data. x is the time series, y the value as y(x). Data3.txt is a text file simply containing all the data with no headers in a matrix with 20 columns and 65534 rows.
Here is the code
import csv
import numpy as np
import matplotlib.pyplot as plt
dates = []
with open('Data3.txt') as csvDataFile:
csvReader = csv.reader(csvDataFile,quoting=csv.QUOTE_NONNUMERIC)
for row in csvReader:
dates.append(row)
np.array(map(float, dates))
time=[]
value=[]
samples=8000
for row in dates:
time.append(row[0])
value.append(row[1])
print(len(time))
print(len(time[:samples]))
plt.plot(time[:samples], value[:samples])
plt.ylim(0,40)
plt.xlim(0,1200)
plt.show()
The plot is shown until I set samples to 7000 - see attached Figure_1. As soon as I set samples to 8000 it takes a lot longer to plot & the outcome is Figure_2.
print(len(time))
gives 65543
print(len(time[:samples]))
gives 8000
Really confused by that. Can anyone explain where this error might come from? Any hint concerning a smarter way of plotting is much appreciated, as well. I would now align every column one listname & make the plots.

Plotting pandas dataframe after doing pandas melt is slow and creates strange y-axis

This could be caused by me not understanding how pandas.melt works but I get strange behaviour when plotting "melted" dataframes using plotnine. Both frames has been converted from wide to long format. One frame with a column containing string values (df_slow) and another with only numerical values (df_fast).
The following code gives different behaviour. Plotting df_slow is slow and gives a strange looking y-axis. Plotting df_fast looks ok and is fast. My guess is that pandas melt is doing something strange with the data which causes this behaviour. See example plots.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
SIZE = 200
value = np.random.rand(SIZE, 1)
# Create test data, one with a column containing strings, one with only numeric values
df_slow = pd.DataFrame({'value': value.flatten(), 'string_column': ['A']*SIZE})
df_fast = pd.DataFrame({'value': value.flatten()})
# Set index
df_slow = df_slow.reset_index()
df_fast = df_fast.reset_index()
# Convert 'df_slow', 'df_fast' to long format
df_slow = pd.melt(df_slow, id_vars='index')
df_fast = pd.melt(df_fast, id_vars='index')
print(df_slow.head())
print(df_fast.head())
df_slow = df_slow[df_slow.variable == 'value']
# This is slow and has many breaks on y-axis
p = (p9.ggplot(df_slow, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
# This is much faster and y-axis looks good
p = (p9.ggplot(df_fast, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
slow and strange plot
fast and good looking plot
Possible fix
Changing the type of the "value" column in df_slow makes it behave like df_fast when plotting.
# This makes df_slow behave like df_fast when plotting
df_slow['value'] = df_slow.value.astype(np.float64)
Question
Is this a bug in plotnine (or pandas) or am I doing something wrong?
Answer
When pivoting two columns with different data types, in this case string and float, I guess it makes sense that the resulting column containing both strings and floats will have the type object. As
ALollz pointed out this probably makes plotnine interpret the values as strings which causes this behavior.

Categories

Resources