I have a model that produces an output in csv. The columns are as follows (just an fictive example):
| Car | Price | Year |
The car column has different car manufacturers for example, with an average car price for each year in column 'Price'.
Example
| Car | Price | Year |
| BMW | 34000 | 1990 |
| BMW | 35000 | 1991 |
| BMW | 37000 | 1993 |
| AUDI | 32000 | 1991 |
| AUDI | 33500 | 1992 |
| AUDI | 34000 | 1993 |
| AUDI | 35500 | 1994 |
| SEAT | 25600 | 1994 |
...
I would like to be able to plot:
An area chart with all the prices for each car manufacturer in the years that the prices are available, within a 20 year period (for example 1990-2010).
Some years, there is no price available for some of the car manufacturers, and for that reason not all car manufacturer has 20 rows of data in the csv, the output just skips the whole year and row. See the BWM in the example, lacking 1992.
Since I run the model with different inputs, the actual names of the "Cars" change (and so do the prices), so I need the code to pick up a certain car name and then plot the available values for each run.
This is just an example for simplification, but the layout of the actual data is the same. Would much appreciate some help on this one!
Try this I think this might work. Also, I am not a pro just a beginner
import pandas as pd
import matplotlib.pyplot as plt
med_path = "path for csv file"
med = pd.read_csv(med_path)
fig, ax = plt.subplots(dpi=120)
area = pd.DataFrame(prices, columns=[‘a’, ‘b’, ‘c’, ‘d’]) # in the places of a,b,c replace with years
area.plot(kind=’area’,ax=ax)
plt.title(‘Graph for Area plot’)
plt.show()
I think this might not be an ideal way to hardcode all the values but you can use for loop to iterate through the csv file's content
Related
I have a csv with several columns including string data, here are the first rows out of about 2000
| | Title | FormeJuridique | Siren | TVA | NAFAPE | TypeAct | DateCrea | DateClo | DureeExe | Adresse | Coordonee |
| 0 | AGILIS COMPTABILITE | Société à responsabilité limitée (sans autre indication) | 902252782 | FR33902252782 | 6920Z | activités comptables | 01-09-2021 | 30-04-2022 | 241 days, 0:00:00 | Mâcon | (46.3036683, 4.8322266) |
| 1 | ALD VOLAILLES | SAS, société par actions simplifiée | 877535864 | FR56877535864 | 4639B | commerce de gros | 20-09-2019 | 19-04-2022 | 942 days, 0:00:00 | Montceau-les-Mines | (46.6740455, 4.3631681) |
first I would like to group data together, as a sub-family, such as the NAFAPE variable, group all the lines that start with 45--- which will correspond to a "Restaurant" family. it's possible ? Or another example group the address variable by city. make one group per city.
another point is to make graphs with string data, whether histograms or pie, I have trouble making them. I put you an example of one of my tries.
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
data_pie = brg.groupby("FormeJuridique").count()['NAFAPE']
explode = (0,0,0,0.05,0,0,0,0,0.05)
plt.pie(x=data_pie, autopct="%.1f%%", explode=explode,
pctdistance=1.1, labels = data_pie.keys())
plt.title("FormeJuridique", fontsize=14);
plt.legend(formju,
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(formju, size=8, weight="bold")
it's not really readable I have problems with the legend, the name of the variables around the pie, ..
For graphs like histograms it's a disaster and I think that grouping variables into sub-family could make it easier, because for the NAFAPE variable for example there are almost only single variables and this makes the graph unreadable
Thanks for your help !
I have one problem with one of my projects at school. I am attempting to change the order of my data.
You are able to appreciate how the data is arranged
this picture contains a sample of the data I am referring to
This is the format I am attempting to reach:
Company name | activity description | year | variable 1 | variable 2 |......
company 1 | | 2011 | | |
company 1 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
company 2 | | 2011 | | |
company 2 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
for ever single one of the 10 companies. this is a sample of my whole data-set, which contains more than 15000 companies. I attempted creating a dataframe of the size I want but I have problems filling it with the data I want and in the format I want. I am fairly new to python. Could anyone help me, please?
I have the following data in a DataFrame:
+----------------------+--------------+-------------------+
| Physician Profile Id | Program Year | Value Of Interest |
+----------------------+--------------+-------------------+
| 1004777 | 2013 | 83434288.00 |
| 1004777 | 2014 | 89237990.00 |
| 1004777 | 2015 | 96321258.00 |
| 1004777 | 2016 | 186993309.00 |
| 1004777 | 2017 | 205274459.00 |
| 1315076 | 2013 | 127454475.84 |
| 1315076 | 2014 | 156388338.20 |
| 1315076 | 2015 | 199733425.11 |
| 1315076 | 2016 | 242766959.37 |
+----------------------+--------------+-------------------+
I want to plot a trend graph with the Program year on the x-axis and Value of Interest on the y-axis and different lines for each Physician Profile ID. What is the best way to get this done?
Two routes I'd consider going with this:
Basic, fast, easy: matplotlib, which would look something like this:
install it, like pip install matplotlib
use it, like import matplotlib.pyplot as plt and this cheatsheet
Graphically compelling and you can drop your pandas dataframe right into it: Bokeh
I hope that helps you get started!
I tried a few things and was able to implement it:
years = df["Program_Year"].unique()
PhysicianIds = sorted(df["Physician_Profile_ID"].unique())
pd.options.mode.chained_assignment = None
for ID in PhysicianIds:
df_filter = df[df["Physician_Profile_ID"] == ID]
for year in years:
found = False
for index, row in df_filter.iterrows():
if row["Program_Year"] == year:
found = True
break
else:
found = False
if not found:
df_filter.loc[index+1] = [ID, year, 0]
VoI = list(df_filter["Value_of_Interest"])
sns.lineplot(x=years, y=VoI, label=ID, linestyle='-')
plt.ylabel("Value of Interest (in 100,000,000)")
plt.xlabel("Year")
plt.title("Top 10 Physicians")
plt.legend(title="Physician Profile ID")
plt.show()
I previously made a database of products.
Its headings appeared as shown below.
Product | Code Product | Product Price | Current Stock | Threshold Level |
One of the products I have is a Pepsi can, which is shown below.
Pepsi | 30994663 | 0.50 | 30 | 5 |
The entire text file looks exactly like this -
Product | Code Product | Product Price | Current Stock | Threshold Level | Pepsi | 30994663 | 0.50 | 30 | 5 |
iPhone | 12345670 | 5.00 | 34 | 8 |
I want to change the value of current stock after a user has purchased 1 Pepsi can.
My code appears as such so far:
stockID = 30994663
stockNew = 29
with open('SODatabaseProduct.txt','r+') as stockDatabase:
for line in stockDatabase:
if str(stockID)in line:
product=line.strip().split('|')
product[3] = str('stockNew')
This changes the third value 'current stock' to 29 in the dictionary.
I want to transfer the change to the textfile using file.write.
I tried this method.
stockDatabase.write("<Br>" + str(product))
This, however write the dictionary in code form to the text file.
I want the following result.
Product | Code Product | Product Price | Current Stock | Threshold Level | Pepsi | 30994663 | 0.50 | 29 | 5 |
iPhone | 12345670 | 5.00 | 34 | 8 |
Of course, the line I will need to edit will need to be the one with the correct Code Product, but I can't seem to find a method that fits what I need. Can someone help?
I would like to store some multidimensional/nested data in a pandas dataframe or panel such that I would like to be able to return for example:
All the times for Runner A, Race A
All the times(and names) for Race A for a certain year say 2015
Split 2 Time for Race A 2015 for all runners
Example data would look something like this, note that not all runners will have data for all years or all races. I have a fair amount of data in the Runner Profile which I'd prefer not to store on every line.
In addition I have another level of data for certain races. So for Race A/2015 for example I would like to have another level of data for split times, average paces etc.
Could anyone suggest a good way to do this with Pandas or any other way?
Name | Gender | Age
Runner A | Male | 35
Race A
Year | Time
2015 | 2:35:09
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
2014 | 2:47:34
2013 | 2:50:12
Race B
Year | Time
2013 | 1:32:07
Runner B | Male | 29
Race A
Year | Time
2015 | 3:05:56
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
Runner C | Female | 32
Race B
Year | Time
1998 | 1:29:43