How to identify a pattern using Pandas on similar row names - python

I am importing an excel file with somewhat similar Vendor names and using agg function to add spend and then using sort function to sort the spend. Eventually, this data-frame is feeding onto a dynamic Bokeh plot.
I have vendor names which are minutely different due to the text format and my pandas data-frame is not recognizing this pattern when adding the spend. Despite the fact that its the same vendor I am not getting a holistic view of spend but missing some data and ultimately not getting counting in Bokeh plot.
Data
Vendor Site Spend
ABC INC A 300
ABC,Inc B 100
ABC,Inc. C 50
ABC,INC. D 10
Expected Result
All the data should add up to 460.

You could deal with punctuation, spaces, and caps vs lower before trying to get your sum but it will change the name of your Vendor in the output:
df.groupby([x.upper().replace(' ', '').replace(',','').replace('.','') for x in df['Vendor']])['Spend'].sum()
ABCINC 460
You could also modify the column name in place before calling the groupby:
df['Vendor'] = df['Vendor'].str.upper().str.replace(' ', '').str.replace(',','').str.replace('.','')
print(df.groupby('Vendor')['Spend'].sum())
The df now looks like:
Vendor Site Spend
0 ABCINC A 300
1 ABCINC B 100
2 ABCINC C 50
3 ABCINC D 10
and the output:
ABCINC 460

Related

Processing dataframe with conditionals, using df.apply

I have a catalog of trees, which I've imported into a dataframe. It looks like this:
>>> df
ID Tree Zone Temp_Limit Grade
0 1 Apple 1 21 A
1 2 Apple 1 21 B
2 3 Orange 3 28 B
3 4 Pear 2 26 A
4 5 Apple 4 24 C
The idea is that depending on the type of tree, zone, and temp_limit, the dosage for irrigation, fertilizers, estimated transplant date, etc. would be calculated. Those would be additional columns in the dataframe.
The problem is that the formulas are conditional. It's not just "multiply temp limit by 5 and add 4", more like "if it's an apple tree in zone 2, apply this formula, if it's an orange tree in zone 1, formula goes like this... etc"
And to make things a bit more complicated, there might be rows that have an ID, a Tree type, and no data, that correspond to trees that haven't been delivered, etc.
My current solution is to use df.apply and have a function to do the conditionals and skip the blank rows:
def calculate_irrigation(species,zone,templimit,grade):
if species.lower() == "apple":
if zone == 3:
etc etc etc
df['irrigation'] = df.apply (lambda x: calculate_irrigation(x['Tree'], x['Zone'], x['Temp_limit'], x['Grade'])
Question: is a Dataframe and df.apply the best solution for this? I used a df because it adapts very well to the data I'm working with and getting the data in there is pretty straightforward. Plus exporting the final results is easy. But when you have to do different operations based on values, and have to start putting functions in there, it makes you wonder if there's a better way you're not seeing.

How do I merge or attach a characteristic when the key I'm merging on isn't unique?

I have two different CSVs with different information that I need. The first has an account number, a ticker (mutual funds), and a dollar amount. The second has a list of Tickers and their classification (Stock, bond, etc.) I want to merge the two on the Ticker so that I have the account number, Ticker, classification, and dollar amount all together. Several of the account numbers hold the same funds, meaning the ticker will be used multiple times. When I try merging I get rows that duplicate and a lot of missing information.
I tried merging with inner and on the left. I tried making the second CSV a dictionary to reference. I attempted a for loop with lamda, but I'm pretty new to this so that didn't go well. I also tried to groupby account number and ticker before merging, but that didn't work either. The columns I'm trying to merge have the same datatype. I tried with non-float object and string.
pd.merge(df1, df2, on = 'Ticker', how = 'inner')
expected: (each account number may have 5 unique tickers)
A B C D
1 a bond 500
1 b stock 100
1 c bond 250
2 a bond 300
2 b stock 400
what I get:
A B C D
1 a bond 500
1 a bond 500
1 a bond 500
2 a bond 300
2 a bond 300
It seems to overwrite all the unique rows for the account number with the first row.

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

python pandas - map using 2 columns as reference

I have 2 txt files I'd like to read into python: 1) A map file, 2) A data file. I'd like to have a lookup table or dictionary read the values from TWO COLUMNS of one, and determine which value to put in the 3rd column using something like the pandas.map function. The real map file is ~700,000 lines, and the real data file is ~10 million lines.
Toy Dataframe (or I could recreate as a dictionary) - Map
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2000 SNPD
Toy Dataframe - Data File
Chr Position
1 1000
1 2000
2 1000
2 2001
Resulting final table:
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2001 NaN
I found several questions about this with only one column lookup: Adding a new pandas column with mapped value from a dictionary. But can't seem to find a way to use 2 columns. I'm also open to other packages that may handle genomic data.
As a bonus second question, it'd also be nice if there was a way to map the 3rd column if it was with a certain amount of the mapped value. In other words, row 4 of the resulting table above would map to SNPD, as it's only 1 away. But I'd be happy to just get the solution for above.
i would do it this way:
read your map data so that first two columns will become an index:
dfm = pd.read_csv('/path/to/map.csv', delim_whitespace=True, index_col=[0,1])
change delim_whitespace=True to sep=',' if you have , as a delimiter
read up your DF (setting the same index):
df = pd.read_csv('/path/to/data.csv', delim_whitespace=True, index_col=[0,1])
join your DFs:
df.join(dfm)
Output:
In [147]: df.join(dfm)
Out[147]:
Name
Chr Position
1 1000 SNPA
2000 SNPB
2 1000 SNPC
2001 NaN
PS for the bonus question try something like this

random sampling with pandas dataframe

I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks
Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.

Categories

Resources