Pandas dataframe query not finding value - python

I have an issue that I can't understand why the value I'm searching doesn't appear to exist in dataframe.
I've started with combining couple numerical columns in my original dataframe and assigned to a new column. And then I've extracted a list of unique values from that combined column
~snip
self.df['Flitch'] = self.df['Blast'].map(
str) + "-" + self.df['Bench'].map(str)
self.flitches = self.df['Flitch'].unique()
~snip
Now slightly further in the code I need to get earliest date values corresponding to these unique identifiers. So I go and run a query on the dataframe:
~snip
def get_dates(self):
'''Extracts mining and loading dates from filtered dataframe'''
loading_date, mining_date = [],[]
#loop through all unique flitch ids and get their mining
#and loading dates
for flitch in self.flitches:
temp = self.df.query('Activity=="MINE"')
temp = temp.query(f'Flitch=={flitch}')
mining = temp['PeriodStartDate'].min()
mining_date.append(mining)
~snip
...and I get nothing. I can't understand why. I mean, I'm comparing the data extracted from the column to that same column and I'm not getting any matches.
I've gone and manually checked that the list of unique ids is populated correctly.
I've checked that dataframe I'm running query on does indeed has those same flitch ids.
I've manually checked for several random values from self.flitches list and it comes back as False every time.
Before I've combined those two columns and used only 'Blast' as identifier, everything worked perfectly, but now I'm not sure what is happening.
Here for example I've printed the self.flitches list:
['5252-528' '5251-528' '3030-492' '8235-516' '7252-488' '7251-488'
'2351-588' '5436-588' '1130-624' '5233-468' '1790-516' '6301-552'
'6302-552' '5444-576' '2377-564' '2380-552' '2375-564' '5253-528'
'2040-468' '2378-564' '1132-624' '1131-624' '6314-540' '7254-488'
'7253-488' '8141-480' '7250-488']
And here is data from self.df['Flitch'] column:
173 5252-528
174 5251-528
175 5251-528
176 5251-528
177 5251-528
178 5251-528
180 3030-492
181 3030-492
182 3030-492
183 3030-492
...
It looks like they have to match but they don't...

Related

How to append from a dataframe to a list?

I have a dataframe called pop_emoj that has two columns (one for the emoji, and one for the emoji count) as seen below.
☕ 585
🌭 193
🌮 186
🌯 85
🌰 53
🌶 124
🌽 138
🍄 46
🍅 170
🍆 506
I have sorted the df based on the counts in descending order as seen below.
emoji_updated = pop_emoj.head(105).sort_values(ascending=False)
🍻 1809
🎂 1481
🍔 1382
🍾 1078
🥂 1028
And I'm trying to use the top n emojis to append to a new list called top_list, but I am getting stuck. Here is my code so far.
def top_number_of_emojis(n):
top_list = []
top_list = emoji_updated[0].tolist()
return top_list
I'm wanting to take all of column 1 (the emojis) and append them to my list (top_number_of_emojis). The output should look like this:
top_number_of_emojis(1) == ['🍻']
top_number_of_emojis(2) == ['🍻', '🎂']
top_number_of_emojis(3) == ['🍻', '🎂', '🍔']
if you already got the top 5 emojis, you just need to save them to a list.
An option for that is iterrows.
top_list = []
for id, emoji in emoji_updated.iterrows():
top_list.append(emoji)

Trouble setting table width using Python Docx

i have to create numerous pdfs of tables every year, so i was trying to write a script with Docx to create the table in Word with each column having its own set width and left or right alignment. Right now i am working on two tables - one with 262 rows, the other with 1036 rows.
The code works great for the table with 262 rows, but will not set column widths correctly for the table with 1036 rows. Since the code is identical for both tables, i am thinking it is a problem with the data itself, or possibly the size of the table? I tried creating the second larger table without any data, and the widths are correct. I then tried creating the table below with a subset of the 7 rows from the 1036 rows, including the rows with the largest numbers of characters, in case a column was not wrapping the text but instead was forcing the column widths to change. It runs fine - widths are correct. I use the exact same code on the full data set of 1036 rows, and the widths change. Any ideas?
Below is the code for only 7 rows of data. it works correctly - the first column is 3.5", the second and third columns are 1.25 inches.
from docx import Document
from docx.shared import Cm, Pt, Inches
import docx
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_ALIGN_VERTICAL
year = '2019'
set_width_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':(Inches(3.50), Inches(1.25), Inches(1.25)),
'PUR'+year +'_subtotals_indexed_by_chemical.txt':(Inches(3.50), Inches(1.25), Inches(1.25))}
set_alignment_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Commodity or site'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']],
'PUR'+year +'_subtotals_indexed_by_chemical.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Chemical'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']]}
#the data
list_of_lists = ['Chemical', 'Pounds Applied', 'Agricultural Applications'],['ABAMECTIN', '51,276.54', '69,659'],['ABAMECTIN, OTHER RELATED', '0.03', 'N/A'], ['S-ABSCISIC ACID', '1,856.38', '230'],['ACEPHATE', '158,054.76', '11,082'],['SULFUR','49,038,554.00','170,396'],['BACILLUS SPHAERICUS 2362, SEROTYPE H5A5B, STRAIN ABTS 1743 FERMENTATION SOLIDS, SPORES AND INSECTICIDAL TOXINS','11,726.29','N/A']
doc = docx.Document() # Create an instance of a word document
col_ttl = 3 # add enough columns as headings in the first list in list_of_lists
row_ttl =7# add rows to equal total number lists in list_of_lists
# Creating a table object
table = doc.add_table(rows= row_ttl, cols= col_ttl)
table.style='Light Grid Accent 1'
for r in range(len(list_of_lists)):
row=table.rows[r]
widths = set_width_dic[file] #Ex of widths = (Inches(3.50), Inches(1.25), Inches(1.25))
for c, cell in enumerate(table.rows[r].cells):#c is an index, cell is the empty cell of the table,
table.cell(r,c).vertical_alignment = WD_ALIGN_VERTICAL.BOTTOM
table.cell(r,c).width = widths[c]
par = cell.add_paragraph(str(list_of_lists[r][c]))
for l in set_alignment_dic[file]:
if l[0] == c:
par.alignment = l[1]
doc.save(path+'try.docx')
When i try to do the exact same code for the entire list_of_lists (a list of 1036 lists), the widths are incorrect: column 1 = 4.23", column 2 = 1.04", and column 3 = 0.89"
I printed the full 1036 row list_of_lists on my cmd box, then pasted it in a text file thinking i might be able to include it here. However when i attempted to run the full list, it would not paste back into the cmd box - it gave an EOL error, and only showed the first 65 lists in the list_of_lists. DocX is able to make the full table, just wont set the correct widths. I am baffled. I have looked through every StackExchange python Docx table width post i can find, and many other googled sites. Any thoughts much appreciated.
Just figured out the issue. I needed to add autofit = False. Now the code works for the longer table as well
table.autofit = False
table.allow_autofit = False

How to delete or drop lines into a dataframe using specific values in a column?

I'm using this code, but when I group to show results, I was expecting that No entry and Out of Business didn't appear but they do.
data2 = pd.DataFrame(data)
data2 = data2[(data2['results'] != 'No Entry') | (data2['results'] != 'Out of Business')]
data2.groupby('results').size().sort_values(ascending=False)
results
Pass 417
Pass w/ Conditions 233
Fail 192
No Entry 69
Out of Business 55
Not Ready 28
Thanks in advance.
simply use the following code to drop certain rows from a DF:
df = df.loc[df['results'] != 'No Entry']
df = df.loc[df['results'] != 'Out of Business']
The code will work in a way, that the df is going to be selected without the two rows.
I hope, this helps.
Take care

question on h2o hanging when attempting to slice rows

I have I guess a moderately sized dataframe of ~500k rows and 200 columns with 8GB of memory.
My problem is that when I got to slice my data, even very small sized datasets when this gets trimmed down to 6k rows and 200 columns, that it just hangs and hangs for 10/15 min+. Then if I hit the STOP button for python interactive and re-try the process happens in 2-3 seconds.
I don't know why I can do my row-slicing in this 2-3 seconds normally. It is making it impossible to run programs as things just hang and hang and have to be manually stopped before it works.
I am following the approach laid out on the h2o webpage:
import h2o
h2o.init()
# Import the iris with headers dataset
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# Slice 1 row by index
c1 = df[15,:]
c1.describe
# Slice a range of rows
c1_1 = df[range(25,50,1),:]
c1_1.describe
# Slice using a boolean mask. The output dataset will include rows with a sepal length
# less than 4.6.
mask = df["sepal_len"] < 4.6
cols = df[mask,:]
cols.describe
# Filter out rows that contain missing values in a column. Note the use of '~' to
# perform a logical not.
mask = df["sepal_len"].isna()
cols = df[~mask,:]
cols.describe
The error message from the console is as follows. I have this same error message repeated several times.:
/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in (.0)
149 return self._cache._id # Data already computed under ID, but not cached
150 assert isinstance(self._children,tuple)
--> 151 exec_str = "({} {})".format(self._op, " ".join([ExprNode._arg_to_expr(ast) for ast in self._children]))
152 gc_ref_cnt = len(gc.get_referrers(self))
153 if top or gc_ref_cnt >= ExprNode.MAGIC_REF_COUNT:
~/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in _arg_to_expr(arg)
161 return "[]" # empty list
162 if isinstance(arg, ExprNode):
--> 163 return arg._get_ast_str(False)
164 if isinstance(arg, ASTId):

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Categories

Resources