I'm an absolute novice here, but know my way around some Netsuite formulas and data concepts. I have a report that lists all PACK SKUs (Kit/Bundle) and their component SKUs along with the quantity for each component that makes up the PACK. This is similar to BOMs.
I have used decode function in other saved searches to transpose data in columns, but I'm super confused about how to achieve this for Pack/Components.
There is no individual identifier such as component(a) component (b) that I can identify. These are also not under {memberitems} and looks like the company had this restructured to be under PACK/Component structure.
Is there anyway I can get the many components automatically listed in columns with their quantities next to it? Or, is there a potential Python script or an Excel Macro that might be able to assist with 800k rows? Any help would be highly appreciated.
data snapshot
Related
I have a folder of 310 .csv files. Here is a sample of what the contents look like
I need to create a program that goes through all the files, lists the file name, then lists the top 4 values from the table and the x-value associated with it. Ideally this would all be saved to a text doc but as long as it prints in a readable format that would be ideal.
So, what is stopping you? Loop through the files, use pandas.read_csv to import each csv file and merge/join them all into one DataFrame. Use slicing to select the 4 top rows, and you can always print/visualize anything directly in a Jupyter Notebook. Exporting can be done using df.to_csv or any other method and if you need a short introduction to pandas, look here.
Keep in mind that it is always a good idea, to include a Minimal, Reproducible Example. Especially for a complicated merge operation between many DataFrames, this could help you a lot. However, there is no way around some research.
I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?
As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.
I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.
In my excel file, I have a list of some 7000-8000 binary chemical compounds. (Consists of 2 elements only).
And I have segregated them into their component elements, i.e., I have 2 columns of elements, namely: First Element and Second Element.
I have attached a screenshot below:
Now I want to fill in the respective Atomic Number and Atomic Weight beside every element as per a predefined list using Python.
How do I do that?
I have attached a screenshot of my predefined list below, as well:
People have told me things like, use the "CSV" package or the "pandas" package, but I would request some more procedural help wrt to the above packages or any other method you might suggest.
Also, if it cannot be done via Python, I am open to other languages as well.
I noticed that your task does not require python programming. The reason is :
You already have a predefined list of items stored in a excel sheet.
Excel already has built in function (VLOOKUP) for this task.
We just have to use VLOOKUP function in column Atomic number, Atomic weight ( you have to create columns in data2 sheet ) which will take care of searching for particular element atomic weight, number and return it in active cell.
Next, use fill handle to apply the function to all the cells or ( if data is in table , great!! no need to use fill handle because table automatically applies the function to whole column range )
I expect that you already know how to work with excel formulas and functions, if not comment down below for further assistance. Kindly upvote the answer if you liked it.
NOTE: If you need automation, then be sure to check out Excel VBA, google sheets, Apps script.
I am currently using
tCursors=tuple(cursors)
for row in myTable.GetRows(tCursors):
for i in range(0,len(cursors)):
curValue=cursors[i].CurrentValue
to get values from a data table in spotfire. The tuple contains the cursors for all of the columns that I am interested in. Is there a faster way to obtain multiple values from the data table, possibly loading them into an array, or loading the entire data table into a 2d array?
Sorry that this is vague, I am just not sure what options I have to get the data from the spotfire table into python in a format that can be used. Thank you
If you want to process large data set, you should use R (aka datafunctions).
Python is good to manipulate Spotfire elements thanks to the possibility to modify C# elements with the API.
R is good to manipulate data, you can easily perform calculation, aggregation, pivot etc. It is really faster than Ironpython for operations on data.
Here are some great resources to help you start with R:
The official R documentation
Documentation to manipulate data
R is widely used, so you can easily find chunks of code on the web to achieve what you want to do.