Turning stocktake data into a dictionary

Turning stocktake data into a dictionary - python

So I am automating stocktake comparisons, we receive a stock update daily and it needs to be compared to our own stock data to see if there are differences. I think the easiest way of doing this would be to get both stock reports into a dictionary format with {item a, quantity} etc.. I have done this for our own stock but the stock form the warehouse comes in an excel file and it separates each item out by batch number.
I have read this using xlrd using the following:
data = []
file = file_name
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)
row_numbers = range(6, sh.nrows)
for row in row_numbers:
if str(sh.row_values(row)[0]).startswith('Sku'):
data.append(sh.row_values(row)[1:3])
print(data)
and have it in the format of a list of lists. For reference this would look like [item a, 1200], [item a, 4000] etc.. The number of entries per item is not consistent and goes up to 6 but can also be 1. What would be the best method for creating a final dictionary with only one entry per time with a grand total across all of the original lines?

What you'll want to do is iterate through your list of lists, and for each one find whether the item is in the dictionary. If it's not, add it to the dictionary with the quantity as the mapped value. If it already is, look up the mapped value and add the quantity to it.
For example:
final_dict = {}
for entry in list_of_lists:
final_dict[entry[0]] = final_dict.get(entry[0], 0) + entry[1]
Note that here final_dict.get(x, y) means look up using the key x, and return y as a default if x isn't in the dictionary

Related

How do I find the count of a column of lists and display by date?

My dataset looks like this
Using python and pandas I want to display the count of each unique item in the coverage column which are stored in a list shown in the table.
I want to display that count by device and by date.
Example out put would be:
the unique coverage count being the count of each unique list value in the "coverage" row

You can use apply method to iterate over rows and apply a custom function. This function may return the length of the list. For example:
df["covarage_count"] = df["coverage"].apply(lambda x: len(x))

Here's how I solved it using for loops
coverage_list = []
for item in list(df["coverage"]):
if item == '[]':
item = ''
else:
item = list(item.split(","))
coverage_list.append(len(item))
# print(len(item))
df["coverage_count"] = coverage_list

How do I find the largest float in a list array?

I'm trying to create a python script for the Huntington-Hill apportionment method. For this, I need the population of each state, the number of seats currently assigned to each state, and value a, where a = p/sqrt(s*(s+1)). I need to identify which state has the largest a value, add one seat to that state and repeat until the state with the smallest population as the largest a. I've created row[3] in my list to store a values but I'm unable to have python identify the largest one.
I've tried to sort row[3] or simply find the max value, but am told 'float' object is not iterable. If I convert it to a string and then sort, it sorts each digit, giving me a list of 9s and then 8s etc.
import math
file_csv = open("statepop2.csv")
data_csv = csv.reader(file_csv)
list_csv = list(data_csv)
for row in list_csv:
row.append(0)
row[3] = int(row[1])/math.sqrt(int(row[2])*(int(row[2])+1))
print(sorted(row[3]))
I'm very new to all this, so any help is much appreciated
Edit: It seems this is a problem with the CSV, not with the sort. I'm not sure what's wrong, nor how to upload my CSV file.

You are trying to sort the value of a itself, hence giving a float as argument to the sorted() function. It is the same of typing:
sorted(4.3)
As 4.3 is not a list but a float, it is not iterable.
I suggest to simply create a list and appending the value to that list, then printing the sorted list.
import math, csv
file_csv = open("statepop2.csv")
data_csv = csv.reader(file_csv)
list_csv = list(data_csv)
rows = []
for row in list_csv:
row.append(0)
row[3] = int(row[1])/math.sqrt(int(row[2])*(int(row[2])+1))
rows.append(row[3])
print(sorted(rows))
If you need some other data from your CSV (i.e. the country name?) to be displayed along with the a value, just makes rows a dict:
rows = {}
Then you can add a dict entry, having the other piece of info as key:
countryName = row[x]
rows[countryName] = row[3]

The problem is that row[3] is an integer, not a list. You cannot sort an integer.
I am not sure what you exactly want to do, but try this:
for row in list_csv:
row.append(0)
row[3] = int(row[1])/math.sqrt(int(row[2])*(int(row[2])+1))
print(max([row[3] for row in list_csv]))

Simply use max().
a = [1.2, 3.8, 4.9]
b = max(a)
print(b)
# 4.9

unhashable type: 'dict'

I am new in here and want to ask something about removing duplicate data enter, right now I'm still doing my project about face recognition and stuck in remove duplicate data enter that I send to google sheets, this is the code that I use:
if(confidence <100):
id = names[id]
confidence = "{0}%".format (round(100-confidence))
row = (id,datetime.datetime,now().strftime('%Y-%m-%d %H:%M:%S'))
index = 2
sheet.insert_row (row,index)
data = sheet.get_all_records()
result = list(set(data))
print (result)
The message error "unhashable type: 'dict"
I want to post the result in google sheet only once enter

You can't add dictionaries to sets.
What you can do is add the dictionary items to the set. You can cast this to a list of tuples like so:
s = set(tuple(data.items()))
If you need to convert this back to a dictionary after, you can do:
for t in s:
new_dict = dict(t)

According to documentation of gspread get_all_records() returns list of dicts where dict has head row as key and value as cell value. So, you need to iterate through this list compare your ids to find and remove repeating items. Sample code:
visited = []
filtered = []
for row in data:
if row['id'] not in visited:
visited.append(row['id'])
else:
filtered.append(row)
Now, filtered should contain unique items. But instead of id you should put the name of the column which contains repeating value.

Append pandas dataframe with column of averaged values from string matches in another dataframe

Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
matchingOrders.append(price)
avgPrice.append(np.mean(matchingOrders))
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice

In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]

Dynamically parsing research data in python

The long (winded) version:
I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.
For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).
I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.
Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).
The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories. Will make a new post fot this.
I'm looking for suggestions on how to do both.
# data is a list of tab separated records
# fields is a list of my field names
# get a list of fieldtypes via gettype on our first row
# gettype is a function to get type from string without changing data
fieldtype = [gettype(n) for n in data[1].split('\t')]
# get the indexes for fields that aren't floats
mask = [i for i, field in enumerate(fieldtype) if field!="float"]
# for each row of data[skipping first and last empty lists] we split(on tabs)
# and take the ith element of that split where i is taken from the list mask
# which tells us which fields are not floats
records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]
# we now get a unique set of combos
# since set doesn't happily take a list of lists, we join each row of values
# together in a comma seperated string. So we end up with a list of strings.
uniquerecs = set([",".join(row) for row in records])
print len(uniquerecs)
quit()
def gettype(s):
try:
int(s)
return "int"
except ValueError:
pass
try:
float(s)
return "float"
except ValueError:
return "string"
Sample Data:
field0 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13 field14 field15
10 0 2 1 Right Right Right 5.76765674196 0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3 1 3 0 Left Left Right 8.00982745764 0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5 19 1 0 Right Left Left 4.69440026591 0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3 1 4 2 Left Right Left 9.58648184552 0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9 0 0 7 Left Left Left 7.65374257547 0.030318719717 0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

Not sure if I understand your question, but here are a few thoughts:
For parsing the data files, you usually use the Python csv module.
For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:
from collections import defaultdict
import csv
reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask = [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
category = ','.join([line[i] for i in mask])
data_of_category[category].append(line)
This way you don't have to calculate the categories in the first place an can process the data in one pass.
And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".

For at least part of your question, have a look at Named Tuples

Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.
Step 2: Turn that into a dict of first entry: rest of entries.
with open("...", "rb") as data_file:
lines = csv.Reader(data_file, some_custom_dialect)
categories = {line[0]: line[1:] for line in lines}
Step 3: Iterate over the items() of the data and do something with each line.
for category, line in categories.items():
do_stats_to_line(line)

Some useful answers already but I'll throw mine in as well. Key points:
Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key
If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.
def coerce_to_type(value):
_types = (int, float)
for _type in _types:
try:
return _type(value)
except ValueError:
continue
return value
def parse_row(row):
return [coerce_to_type(field) for field in row]
with open(datafile) as srcfile:
data = csv.reader(srcfile, delimiter='\t')
## Read headers, create namedtuple
headers = srcfile.next().strip().split('\t')
datarow = namedtuple('datarow', headers)
## Wrap with parser and namedtuple
data = (parse_row(row) for row in data)
data = (datarow(*row) for row in data)
## Group by the leading integer columns
grouped_rows = defaultdict(list)
for row in data:
integer_fields = [field for field in row if isinstance(field, int)]
grouped_rows[tuple(integer_fields)].append(row)
## DO SOMETHING INTERESTING WITH THE GROUPS
import pprint
pprint.pprint(dict(grouped_rows))
EDIT You may find the code at https://gist.github.com/985882 useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Turning stocktake data into a dictionary - python

Related

How do I find the count of a column of lists and display by date?

How do I find the largest float in a list array?

unhashable type: 'dict'

Append pandas dataframe with column of averaged values from string matches in another dataframe

Dynamically parsing research data in python

Categories

Resources