Combine csv values and output to csv - python

I'm trying to read a csv file and combine the duplicate values then output the values into a csv again.
Iterate through each line in the text file. The first line contains headers, so should be skipped.
Separate the three values found in each line. Each line contains the product name, quantity sold, and unit price (the price of a single product), separated by a tab character.
Keep a running total for the quantity sold of each product; for example, the total quantity sold for ‘product b’ is 12.
Keep a record of the unit price of each product.
Write the result to the sales-report.csv; the summary should include the name of each product, the sales volume (total quantity sold), and the sales revenue (total quantity sold * by the product price).
What I intend.
Input Data:
product name,quantity,unit price
product c,2,22.5
product a,1,10
product b,5,19.7
product a,3,10
product f,1,45.9
product d,4,34.5
product e,1,9.99
product c,3,22.5
product d,2,34.5
product e,4,9.99
product f,5,45.9
product b,7,19.7
Output Data:
product name,sales volume,sales revenue
product c,5,112.5
product a,4,40
product b,12,236.4
product f,6,275.4
product d,6,207
product e,5,49.95
This is what I have so far, I've looked around and it isn't entirely clear how I'm supposed to perform list comprehension and combine values.
When I looked for an answer, it was mostly more complicated than it probably needs to be, it is relatively simple...
record = []
with open("items.csv", "r") as f:
next(f)
for values in f:
split = values.rstrip().split(',')
record.append(split)
print(record)

You can use pandas for this:
import pandas as pd
df = pd.read_csv('path/to/file')
Then calculate sales revenue, groupby and sum
df = df.assign(sales_revenue=lambda x: x['quantity'] * x['unit price']).groupby('product name').sum().reset_index()
product name quantity sales_revenue
0 product a 4 20.00
1 product b 12 39.40
2 product c 5 45.00
3 product d 6 69.00
4 product e 5 19.98
5 product f 6 91.80
You can save the result to a csv file
df.to_csv('new_file_name.csv', index=False)

pandas is the way to go with the problem. If you don't already use it, it aggregates operations across entire tables so you don't have to iterate yourself. Notice that entire columns can be multiplied in a single step. groupby will group the dataframe by each product and then its easy to sum.
import pandas as pd
df = pd.read_csv("f.csv")
df["sales revenue"] = df["quantity"] * df["unit price"]
del df["unit price"]
outdf = df.groupby("product name").sum()
outdf.rename(columns={"quantity": "sales volume"})
outdf.to_csv("f-out.csv")

Related

Querying deeply nested and complex JSON data with multiple levels

I am struggling to break down the method required to extract data from deeply nested complex JSON data. I have the following code to obtain the JSON.
import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt
base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers={'User-Agent': 'Myheaderdata'}
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
Which provides an output showing the JSON and a Pandas DataFrame. The dataframe has two columns, with the third (FACTS) containing a lot of nested data.
What I want to understand is how to navigate into that nested structure, to retrieve certain data. For example, I may want to go to the DEI level, or the US GAAP level and retrieve a particular attribute. Let's say DEI > EntityCommonStockSharesOutstanding and obtain the "label", "value" and "FY" details.
When I try to use the get function as follows;
data=[]
for response in response_dic:
data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
new_df=pd.DataFrame(data)
new_df.head()
I end up with the following attribute error;
AttributeError Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
1 data=[]
2 for response in response_dic:
----> 3 data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
4 base_df=pd.DataFrame(data)
5 base_df.head()
AttributeError: 'str' object has no attribute 'get'
Use pd.json_normalize:
For example:
entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']
df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
meta=['label', 'description'])
df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
meta=['label', 'description'])
>>> df1
end val accn ... frame label description
0 2018-10-31 106299106 0001564590-18-028629 ... CY2018Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
1 2019-02-28 106692030 0001627475-19-000007 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
2 2019-04-30 107160359 0001627475-19-000015 ... CY2019Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
3 2019-07-31 110803709 0001627475-19-000025 ... CY2019Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
4 2019-10-31 112020807 0001628280-19-013517 ... CY2019Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
5 2020-02-28 113931825 0001627475-20-000006 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
6 2020-04-30 115142604 0001627475-20-000018 ... CY2020Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
7 2020-07-31 120276173 0001627475-20-000031 ... CY2020Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
8 2020-10-31 122073553 0001627475-20-000044 ... CY2020Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
9 2021-01-31 124962279 0001627475-21-000015 ... CY2020Q4I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
10 2021-04-30 126144849 0001627475-21-000022 ... CY2021Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
[11 rows x 10 columns]
>>> df2
end val accn fy fp form filed frame label description
0 2018-10-03 900000000 0001627475-19-000007 2018 FY 10-K 2019-03-07 CY2018Q3I Entity Public Float The aggregate market value of the voting and n...
1 2019-06-28 1174421292 0001627475-20-000006 2019 FY 10-K 2020-03-02 CY2019Q2I Entity Public Float The aggregate market value of the voting and n...
2 2020-06-30 1532720862 0001627475-21-000015 2020 FY 10-K 2021-02-24 CY2020Q2I Entity Public Float The aggregate market value of the voting and n...
I came across this same issue. While the solution provided meets the requirements of your question it might be a better solution to flatten the entire dictionary and have all the columns represented in a long data frame.
That data frame can be used as a building block for a DB or can simply be queried as you wish.
The facts key can have more than sub key dei or us-gaap.
Also, within the us-gapp dictionary if you want to extract multiple xbrl tags at a time you will have a pretty difficult time.
The solution below is might not be the prettiest or more efficient but it gets all the levels of the dictionary along with all the facts and values.
import requests
import pandas as pd
import json
from flatten_json import flatten
headers= {'User-Agent':'My User Agent 1.0', 'From':'something somethin'}
file = 'https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json'
data = json.loads(requests.get(file, headers = headers).text)
#get the cik and name of the entity
Cik_Name = dict(list(data.items())[0: 2])
Cik_Name_df = pd.DataFrame(Cik_Name,index=[0])
#Flatten file
f = flatten(data['facts'],'|')
#drop into a dataframe and transpose
f = pd.DataFrame(f,index=[0]).T
#reset index
f = f.reset_index(level=0)
#rename columns
f.rename(columns={'index': 'Col_split', 0:'values'}, inplace= True)
#split Col_split column by delimiter
f = f.join(f['Col_split'].str.split(pat='|',expand=True).add_prefix('Col_split'))
#drop original Col_split column
f = f.drop(['Col_split','Col_split4'],axis = 1)
#move values column to the end
f = f[[c for c in f if c not in ['values']] + ['values']]
#create groups based on Col_split2 containing the value label
f['groups'] = f["Col_split2"].eq('label').cumsum()
df_list = []
#loop to break df by group and create new columns for label & description
for i, g in f.groupby('groups'):
label = g['values'].iloc[0]
description = g['values'].iloc[1]
g.drop(index = g.index[:2], axis = 0, inplace = True)
g['label'] = label
g['description'] = description
df_list.append(g)
final_df = pd.concat(df_list)
final_df.rename(columns={'Col_split0':'facts', 'Col_split1':'tag','Col_split3':'units'}, inplace=True)
final_df = final_df[['facts','tag','label','description','units','Col_split5','values']]
final_df['cum _ind'] = final_df["Col_split5"].eq('end').cumsum()
final_df = final_df.pivot(index = ['facts','tag','label','description','units','cum _ind'] , columns = 'Col_split5' ,values='values').reset_index()
final_df['cik'] = Cik_Name_df['cik'].iloc[0]
final_df['entityName'] = Cik_Name_df['entityName'].iloc[0]
final_df = final_df[['cik','entityName','facts','tag','label','description','units','accn','start','end','filed','form','fp','frame','fy','val']]
print(final_df)
please feel free to make improvements as you see fit and share them with the community.

Python pandas group by column and return most recent modal value

I have the following two dataframes:
One containing a list of all UserIDs
Another containing user web activity. It has the columns UserID, ProductID and Datetime.
Essentially, each row in the second dataframe pertains to an instance of a user viewing a product page on the given datetime.
Feel free to generate sample data with the following code:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({'UserID': [f'UID0{i}' for i in range(1, 10)]})
df2 = pd.DataFrame({'UserID': ['UID04', 'UID02', 'UID09', 'UID02', 'UID04', 'UID02', 'UID07', 'UID07', 'UID07', 'UID04', 'UID07', 'UID07'],
'ProductID': ['P017', 'P008', 'P241', 'P340', 'P363', 'P340', 'P166', 'P042', 'P042', 'P042', 'P166', 'P017'],
'Datetime': ['2017-09-10 15:48:09', '2018-05-26 04:52:35', '2017-09-29 18:26:42', '2017-03-06 15:04:58', '2017-09-07 18:44:24', '2016-03-11 05:06:32', '2016-04-11 18:22:19', '2017-09-04 04:44:23', '2018-12-19 07:34:06', '2018-04-09 04:39:55', '2017-04-11 18:22:19','2019-02-11 15:06:32']})
df2['Datetime'] = pd.to_datetime(df2['Datetime'], format='%Y-%m-%d %H:%M:%S')
I would like to obtain the most frequently viewed product by each user. If there are multiple modes, i.e. multiple products with the same highest number of views, the modal product with the most recent view (based on the Datetime column) must be considered. If a user has not viewed any product, we can have a default string like 'NoProduct'.
So for the given sample data, the expected output would be something like this:
UserID
UID01 NoProduct
UID02 P340
UID03 NoProduct
UID04 P042
UID05 NoProduct
UID06 NoProduct
UID07 P042
UID08 NoProduct
UID09 P241
I have only been able to obtain all the modes using the code:
pd.merge(df1, df2.groupby(['UserID'])['ProductID'].agg(pd.Series.mode).to_frame().reset_index(), how='left').fillna('NoProduct')
giving the output:
UserID ProductID
0 UID01 NoProduct
1 UID02 P340
2 UID03 NoProduct
3 UID04 [P017, P042, P363]
4 UID05 NoProduct
5 UID06 NoProduct
6 UID07 [P042, P166]
7 UID08 NoProduct
8 UID09 P241
But I have not been able to figure out how to return only a single mode based on the latest date of all modal products for each user. Please suggest the best way to accomplish this.
Try:
df2["tmp"] = df2.groupby(["UserID", "ProductID"], as_index=False)["ProductID"].transform("count")
df2 = df2.sort_values(by=["tmp", "Datetime"], ascending=[False, False])
x = (
df1.merge(
df2.drop_duplicates(subset=["UserID"], keep="first"),
on="UserID",
how="left",
)
.drop(columns=["Datetime", "tmp"])
.fillna("No Product")
)
print(x)
Prints:
UserID ProductID
0 UID01 No Product
1 UID02 P340
2 UID03 No Product
3 UID04 P042
4 UID05 No Product
5 UID06 No Product
6 UID07 P042
7 UID08 No Product
8 UID09 P241

How can I manipulate all items in a specific index in a CSV using Python?

I would like to change index[3] for all items by 10% of it's current value.
Anything I have found online has not produced any results. The closest result is below but I couldn't see how to implement.
for line in finalList:
price = line[3]
line[3] = float(line[3]) * 1.1
line[3] = round(line[3], 2)
print(line[3])
My products.csv looks like so:
Hardware,Hammer,10,10.99
Hardware,Wrench,12,5.75
Food,Beans,32,1.99
Paper,Plates,100,2.59
My code:
def getProductsData():
productsfile = open('products.csv', 'r')
# read products file
productsreader = csv.reader(productsfile)
products = []
# break each line apart (unsure of meaning here)
for row in productsreader:
for column in row:
print(column, end = '|')
# create and append list of the above data within a larger list
products.append(row)
# loop through the list and display it's contents
print()
# add 10% to price
# add another product to your list of lists
products.append(['Toddler', 'Millie', '2017', '8.25'])
print(products)
# write the contents to a new file
updatedfile = open('updated_products.csv', 'w')
for line in products:
updatedfile.write(','.join(line))
updatedfile.write(','.join('\n'))
updatedfile.close()
getProductsData()
To add 10% to price, you would first need to make sure the column type is numeric and then you can carry out the calculation.
df:
Hardware,Hammer,10,10.99
Hardware,Wrench,12,5.75
Food,Beans,32,1.99
Paper,Plates,100,2.59
Make sure column is numeric:
df[3]=pd.to_numeric(df[3])
Overwrite column with 10% added to values, and round to 2 decimal places:
df[3]=round(df[3]+(df[3]/10),2)
df:
0 1 2 3
0 Hardware Hammer 10 12.09
1 Hardware Wrench 12 6.32
2 Food Beans 32 2.19
3 Paper Plates 100 2.85

How to append a value in excel based on cell values from multiple columns in Python and/or R

I'm new to the openpyxl and other similar excel packages in Python (even Pandas). What I want to achieve is to append the lowest possible price that I can keep for each of the product based on the expense formula. The expense formula is in the code below, the data is like this on excel:
**Product** |**Cost** |**Price** | **Lowest_Price**
ABC 32 66
XYZ 15 32
DEF 22 44
JML 60 120
I have the code below code on Python 3.5, which works, however this might not be the most optimized solution, I need to know how to append the lowest value on Lowest_price column:
cost = 32 #Cost of product from cost column
price = 66 #Price of product from price column
Net = cost+5 #minimum profit required
lowest = 0 #lowest price that I can keep on the website for each product
#iterating over each row values and posting the lowest value adjacent to each product.
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
lowest = i
break
print (lowest) #this value should be printed adjacent to the price, in the Lowest price Column
Now if someone can help me doing that in Python and/or R. The reason I want in both Python and R is because I want to compare the time complexity, as I have a huge set of data to deal with.
I'm fine with code that works with any of the excel formats i.e. xls or xlsx as long it is fast
I worked it out this way.
import pandas as pd
df = pd.read_excel('path/file.xls')
row = list(df.index)
for x in row:
cost = df.loc[x,'Cost']
price = df.loc[x,'Price']
Net = cost+5
df.loc[x,'Lowest_Price'] = 0
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
df.loc[x,'Lowest_Price'] = i
break
#If you want to save it back to an excel file
df.to_excel('path/file.xls',index=False)
It gave this output:
Product Cost Price Lowest_Price
0 ABC 32 66 62.0
1 XYZ 15 32 0.0
2 DEF 22 44 0.0
3 JML 60 120 95.0

Python - time series alignment and "to date" functions

I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()

Categories

Resources