Handling data inside dataframe pandas - python

I have one problem with one of my projects at school. I am attempting to change the order of my data.
You are able to appreciate how the data is arranged
this picture contains a sample of the data I am referring to
This is the format I am attempting to reach:
Company name | activity description | year | variable 1 | variable 2 |......
company 1 | | 2011 | | |
company 1 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
company 2 | | 2011 | | |
company 2 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
for ever single one of the 10 companies. this is a sample of my whole data-set, which contains more than 15000 companies. I attempted creating a dataframe of the size I want but I have problems filling it with the data I want and in the format I want. I am fairly new to python. Could anyone help me, please?

Related

Updating a column in a dataframe with latest value from the latest year

Lets say I have a dataframe:
df =
|ID | year | value |
|----|------|----------|
|123 | 2011 | Mango |
|232 | 2010 | Pineapple|
|123 | 2022 | Orange |
|232 | 2021 | Apple |
|221 | 2021 | Banana |
I want to update the dataframe value with the latest years value. I am expecting a final df as:
|ID | year | value |
|----|------|----------|
|123 | 2011 | Orange |
|232 | 2010 | Apple |
|123 | 2022 | Orange |
|232 | 2021 | Apple |
|221 | 2021 | Banana |
Basically we want to update the values with the latest year's values.
So in this case, id - 123 is appearing twice in the same df. They both have different values "Mango" in 2011 and "Orange" in 2022. We wish to have a new df created with same columns and same repetitions but with latest year's values.
I need this to be done without using any loops as the originial df is extremely huge and using any loop is taking huge time to run
You need to use 'Rank' & 'Merge' as below, gives required output
df = pd.DataFrame({'ID':[123,232,123,232,221],'Year':[2011,2010,2022,2021,2021],'Value':['Mango','Pineapple','Orange','Apple','Banana']})
df['ID_Year_Rank'] = df.groupby(['ID'])['Year'].rank(method='first', ascending=False)
df
This will add a rank == 1 to each row where year is latest in every ID
After this simple merge with itself based on filtered values give required result
pd.merge(df[['ID','Year']], df[df['ID_Year_Rank']==1][['ID','Value']], left_on='ID', right_on = 'ID')
Try this. Use the indices of each ID's most recent year to index value column with it using loc[] accessor.
# indices of last years of each ID
indx = df.groupby('ID')['year'].transform('idxmax')
# assign values corresponding to the last years back to value
df['value'] = df.loc[indx, 'value'].tolist()
df

Creating area chart from csv file containing multiple values in one column

I have a model that produces an output in csv. The columns are as follows (just an fictive example):
| Car | Price | Year |
The car column has different car manufacturers for example, with an average car price for each year in column 'Price'.
Example
| Car | Price | Year |
| BMW | 34000 | 1990 |
| BMW | 35000 | 1991 |
| BMW | 37000 | 1993 |
| AUDI | 32000 | 1991 |
| AUDI | 33500 | 1992 |
| AUDI | 34000 | 1993 |
| AUDI | 35500 | 1994 |
| SEAT | 25600 | 1994 |
...
I would like to be able to plot:
An area chart with all the prices for each car manufacturer in the years that the prices are available, within a 20 year period (for example 1990-2010).
Some years, there is no price available for some of the car manufacturers, and for that reason not all car manufacturer has 20 rows of data in the csv, the output just skips the whole year and row. See the BWM in the example, lacking 1992.
Since I run the model with different inputs, the actual names of the "Cars" change (and so do the prices), so I need the code to pick up a certain car name and then plot the available values for each run.
This is just an example for simplification, but the layout of the actual data is the same. Would much appreciate some help on this one!
Try this I think this might work. Also, I am not a pro just a beginner
import pandas as pd
import matplotlib.pyplot as plt
med_path = "path for csv file"
med = pd.read_csv(med_path)
fig, ax = plt.subplots(dpi=120)
area = pd.DataFrame(prices, columns=[‘a’, ‘b’, ‘c’, ‘d’]) # in the places of a,b,c replace with years
area.plot(kind=’area’,ax=ax)
plt.title(‘Graph for Area plot’)
plt.show()
I think this might not be an ideal way to hardcode all the values but you can use for loop to iterate through the csv file's content

Best way to update prices using pandas

First of I'm writing this post on my phone while on the road. Sorry for lack of info just trying to get a head start for when I get home.
I have 2 csv files, both of which contain a different amount of columns and a different amount of records. The first file has about 150k records and the second has about 1.2mil records. The first file the first column has values that are both in a column in the second file and values that are not in the second file. What i intend to do is to check if the value in column one of the first file is in the first column of the second file. If so check if the first files second column is less than our greater than a value of a different column in the second file where the first columns match. If so update the first files second column to the new value.
Side note I don't need the fastest or most efficient way I just need a working solution for now. Iwill optimize later. Code will be ran once a month to update csv file.
Currently I am attempting to accomplish this using pandas and loading each file into a dataframe. I am struggling to make this work. If this is the best way could you help me do this? Once I figure out how to do this I can figure out the rest I'm just stuck.
What I thought of before I posted this question that I might try to make a third dataframe containing the columns that hold material values and DCost values where Item column and Material columns match. The looping through the dataframe and if value from Item and Material column match updat cost column in csv file
I didn't know if uploading the csv files to a database and running queries to accomplish this would be easier?
Would converting the dataframes to dicts work with this much data?
File 1
+--------+-------+--------+
| Item | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 24.76 |
| 620388 | 15.78 | 36.99 |
+--------+-------+--------+
File 2
+----------+--------+-----------+
| Material | DCost | List Cost |
+----------+--------+-----------+
| 10C0024 | .24 | 1.56 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
| 2020101 | 100.76 | 267.78 |
+----------+--------+-----------+
Intended result to export to csv.
+--------+-------+--------+
| Labor | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
+--------+-------+--------+

Writing to a table in a text file issue

I previously made a database of products.
Its headings appeared as shown below.
Product | Code Product | Product Price | Current Stock | Threshold Level |
One of the products I have is a Pepsi can, which is shown below.
Pepsi | 30994663 | 0.50 | 30 | 5 |
The entire text file looks exactly like this -
Product | Code Product | Product Price | Current Stock | Threshold Level | Pepsi | 30994663 | 0.50 | 30 | 5 |
iPhone | 12345670 | 5.00 | 34 | 8 |
I want to change the value of current stock after a user has purchased 1 Pepsi can.
My code appears as such so far:
stockID = 30994663
stockNew = 29
with open('SODatabaseProduct.txt','r+') as stockDatabase:
for line in stockDatabase:
if str(stockID)in line:
product=line.strip().split('|')
product[3] = str('stockNew')
This changes the third value 'current stock' to 29 in the dictionary.
I want to transfer the change to the textfile using file.write.
I tried this method.
stockDatabase.write("<Br>" + str(product))
This, however write the dictionary in code form to the text file.
I want the following result.
Product | Code Product | Product Price | Current Stock | Threshold Level | Pepsi | 30994663 | 0.50 | 29 | 5 |
iPhone | 12345670 | 5.00 | 34 | 8 |
Of course, the line I will need to edit will need to be the one with the correct Code Product, but I can't seem to find a method that fits what I need. Can someone help?

Multi-dimensional/Nested DataFrame in Pandas

I would like to store some multidimensional/nested data in a pandas dataframe or panel such that I would like to be able to return for example:
All the times for Runner A, Race A
All the times(and names) for Race A for a certain year say 2015
Split 2 Time for Race A 2015 for all runners
Example data would look something like this, note that not all runners will have data for all years or all races. I have a fair amount of data in the Runner Profile which I'd prefer not to store on every line.
In addition I have another level of data for certain races. So for Race A/2015 for example I would like to have another level of data for split times, average paces etc.
Could anyone suggest a good way to do this with Pandas or any other way?
Name | Gender | Age
Runner A | Male | 35
Race A
Year | Time
2015 | 2:35:09
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
2014 | 2:47:34
2013 | 2:50:12
Race B
Year | Time
2013 | 1:32:07
Runner B | Male | 29
Race A
Year | Time
2015 | 3:05:56
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
Runner C | Female | 32
Race B
Year | Time
1998 | 1:29:43

Categories

Resources