I have an excel file containing multiple worksheets. Each worksheet contains price & inventory data for individual item codes for a particular month.
for example...
sheetname = 201509
code price inventory
5001 5 92
5002 7 50
5003 6 65
sheetname = 201508
code price inventory
5001 8 60
5002 10 51
5003 6 61
Using pandas dataframe, how is the best way to import this data, organized by time and item code.
I need this dataframe to eventually be able to graph changes in price&inventory for item code 5001 for example.
I would appreciate your help. I am still new to python/pandas.
Thanks.
My solution...
Here is a solution I found to my problem.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
D201509 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201509', index_col='Code')
D201508 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201508', index_col='Code')
D201507 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201507', index_col='Code')
D201506 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201506', index_col='Code')
D201505 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201505', index_col='Code')
total = pd.concat(dict(D201509=D201509, D201508=D201508, D201507=D201507, D201506=D201506, D201505=D201505), axis=1)
total.head()
which will nicely produce this dataframe with hierarchical columns..
Now my new question is how would you plot the change in prices for every code # with this dataframe?
I want to see 5 lines (5001,5002,5003,5004,5005), with the x axis being the time (D201505, D201506, etc) and the y axis being the price value.
Thanks.
This will get your data into a data frame and do a scatter plot on 5001
import pandas as pd
import matplotlib.pyplot as plt
import xlrd
file = r'C:\dickster\data.xlsx'
list_dfs = []
xls = xlrd.open_workbook(file, on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(file,sheet_name)
df['time'] = sheet_name
list_dfs.append(df)
dfs = pd.concat(list_dfs,axis=0)
dfs = dfs.sort(['time','code'])
which looks like:
code price inventory time
0 5001 8 60 201508
1 5002 10 51 201508
2 5003 6 61 201508
0 5001 5 92 201509
1 5002 7 50 201509
2 5003 6 65 201509
And now the plot of 5001: price v inventory:
dfs[dfs['code']==5001].plot(x='price',y='inventory',kind='scatter')
plt.show()
which produces:
Related
If I have a text file, data.txt, which contains many columns, how to call this file by python and plot only chosen two columns?
for example:
10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s
and I need to plot only the first and second columns.
Note the space before the first column.
Eidt
I used the following code:
import matplotlib.pyplot as plt
from matplotlib import rcParamsDefault
import numpy as np
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
x, y = np.loadtxt('./calc.dat', delimiter=' ')
plt.plot(x, y, "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
but I have to modify my data to contain only two columns that I need to plot without any spaces before the first column, as follows
10 -22.82215289
12 -22.81978265
15 -22.82359691
20 -22.82464363
25 -22.82615348
30 -22.82641815
35 -22.82649347
40 -22.82655376
50 -22.82661407
60 -22.82663535
70 -22.82664864
80 -22.82665962
90 -22.82666308
100 -22.82666662
So, how to modify the code so that I do not have to modify the data every time?
You can create a DataFrame from from a text file using pandas read_csv, which can simplify future processing of the data, besides plotting it.
In this case, the tricky part are the whitespaces, that can be managed by setting the optional parameter sep to '\s+':
df = pd.read_csv('data.txt', sep='\s+', header=None, names=['foo', 'bar', 'baz'])
>>>df
index
foo
bar
baz
0
10
-22.82215289
0.11s
1
12
-22.81978265
0.14s
2
15
-22.82359691
0.14s
3
20
-22.82464363
0.16s
4
25
-22.82615348
0.17s
5
30
-22.82641815
0.19s
6
35
-22.82649347
0.21s
7
40
-22.82655376
0.22s
8
50
-22.82661407
0.28s
9
60
-22.82663535
0.34s
10
70
-22.82664864
0.42s
11
80
-22.82665962
0.46s
12
90
-22.82666308
0.51s
13
100
-22.82666662
0.56s
And the just your code:
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
plt.plot(df['foo'], df['bar'], "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
I set the names of the columns to arbitrary strings. You can avoid that, and just refer to the columns as df[0], df[1]
You could first read your file data.txt and preprocess it by stripping the whitespaces on the left of each line, save the preprocessed data to data_processed.txt, then load it with pd.read_csv and then plot the two columns of choice col1 and col2 against each other with plt.plot, as follows:
import pandas as pd
import matplotlib.pyplot as plt
s = """ 10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s"""
with open ('data.txt', 'w') as f:
f.write(s)
with open ('data.txt', 'r') as f:
data = f.read()
data_processed = '\n'.join([l.lstrip() for l in data.split('\n')])
with open ('data_processed.txt', 'w') as f:
f.write(data_processed)
df = pd.read_csv('data_processed.txt', sep=' ', header=None)
col1 = 0
col2 = 1
plt.plot(df[col1], df[col2]);
I am trying to read data from : http://dummy.restapiexample.com/api/v1/employees and trying to put it out in tabular format.
I am getting the output. But columns are not created from json file.
How can do this in right way?
Code:
import pandas as pd
import json
df1 = pd.read_json('http://dummy.restapiexample.com/api/v1/employees')
df1.to_csv('try.txt',sep='\t',index=False)
Expected Output:
employee_name employee_salary employee_age profile_image
Tiger Nixon 320800 61
(along with other rows)
You can read the data directly from the web, like you're doing, but you need to help pandas interpret your data with the orient parameter:
df = pd.read_json('http://dummy.restapiexample.com/api/v1/employees', orient='index')
Then there's a second step to focus on the data you want:
df1 = pd.DataFrame(df.loc['data', 0])
Now you can write your csv.
Here are the different steps (note: the data is in [data] array of the JSON response):
import json
import pandas as pd
import requests
res = requests.get('http://dummy.restapiexample.com/api/v1/employees')
data_str = res.content
data_dict = json.loads(data_str)
data_df = pd.DataFrame(data_dict['data'])
data_df.to_csv('try.txt', sep='\t', index=False)
you have to parse your json first.
import pandas as pd
import json
import requests
r = requests.get('http://dummy.restapiexample.com/api/v1/employees')
j = json.loads(r.text)
df = pd.DataFrame(j['data'])
output
id employee_name employee_salary employee_age profile_image
0 1 Tiger Nixon 320800 61
1 2 Garrett Winters 170750 63
2 3 Ashton Cox 86000 66
3 4 Cedric Kelly 433060 22
4 5 Airi Satou 162700 33
5 6 Brielle Williamson 372000 61
6 7 Herrod Chandler 137500 59
7 8 Rhona Davidson 327900 55
8 9 Colleen Hurst 205500 39
When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())
I have a excel file that I want to group based on the Column name 'Step No.' and want the corresponding value.Here is a piece of code I wrote :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fpath=('/Users/Anil/Desktop/Test data.xlsx')
df=pd.read_excel(fpath)
data=df.loc[:,['Step No.','Parameter','Values']]
grp_data=pd.DataFrame(data.groupby(['Step No.','Values']).size().reset_index())
grp_data.to_excel('/Users/Anil/Desktop/Test1 data.xlsx')
The data gets grouped just as I want it to.
Step No. Values
1 62
1 62.5
1 63
1 66.5
1 68
1 70
1 72
1 76.5
1 77
2 66.5
2 67
2 69
3 75.5
3 77
But, I want data corresponding to each Step No. in a different excel sheet, i.e all values corresponding to Step No.1 in one sheet, Step No. 2 in another sheet and so on. I think I should use some sort of iteration, but don't know what kind exactly.
This should do it:
from pandas import ExcelWriter
steps = df['Step No.'].unique()
dfs = [df.loc[df['Step No.']==step] for step in steps]
def save_xls(list_dfs, xls_path):
writer = ExcelWriter(xls_path)
for n, df in enumerate(list_dfs):
df.to_excel(writer,'sheet%s' % n)
writer.save()
save_xls(dfs, 'YourFile.xlsx')
I have some csv data in the following format.
Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09
L0 St vT 4R 0 0 0 0 0 0 0 0 0
L2 Tx st 4R 8 8 8 8 8 8 8 8 8
L2 Tx ss 4R 1 1 9 6 1 0 0 6 7
I want to plot a timeseries graph using the columns (Ln , Dr, Tg,Lab) as the keys and the 0:0n field as values on a timeseries graph.
I have the following code.
#!/usr/bin/env python
import matplotlib.pyplot as plt
import datetime
import numpy as np
import csv
import sys
with open("test.csv", 'r', newline='') as fin:
reader = csv.DictReader(fin)
for row in reader:
key = (row['Ln'], row['Dr'], row['Tg'],row['Lab'])
#code to extract the values and plot a timeseries.
How do I extract all the values in columns 0:0n without induviduall specifying each one of them. I want all the timeseries to be plotted on a single timeseries?
I'd suggest using pandas:
import pandas as pd
a=pd.read_csv('yourfile.txt',delim_whitespace=True)
for x in a.iterrows():
x[1][4:].plot(label=str(x[1][0])+str(x[1][1])+str(x[1][2])+str(x[1][3]))
plt.ylim(-1,10)
plt.legend()
I'm not really sure exactly what you want to do but np.loadtxt is the way to go here. make sure to set the delimiter correctly for your file
data = np.loadtxt(fname="test.csv",delimiter=',',skiprows=1)
now the n-th column of data is the n-th column of the file and same for rows.
you can access data by line: data[n] or by column: data[:,n]