Is it possible to get only a limited number of columns for a column family from a row? Lets say I just want to fetch the first 10 values for ['cf1': 'col1'] for a particular row.
This is the same question as https://github.com/wbolster/happybase/issues/93
The answer is:
I think the only way to do this is a scan with a server side filter. I think the one you're after is the ColumnCountGetFilter:
ColumnCountGetFilter - takes one argument, a limit. It returns the first limit number of columns in the table. Syntax: ColumnCountGetFilter (‘’) Example: ColumnCountGetFilter (4)
Source: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_filtering.html
With Happybase that would look like this (untested):
for row_key, data in table.scan(columns=['cf1'], filter='ColumnCountGetFilter(10)'):
print(row_key, data)
use limit to get specific row in hbase
table.scan(limit=int(limit)
Related
I am new to Python and I want to access some rows for an already grouped dataframe (used groupby).
However, I am unable to select the row I want and would like your help.
The code I used for groupby shown below:
language_conversion = house_ads.groupby(['date_served','language_preferred']).agg({'user_id':'nunique',
'converted':'sum'})
language_conversion
Result shows:
For example, I want to access the number of Spanish-speaking users who received house ads using:
language_conversion[('user_id','Spanish')]
gives me KeyError('user_id','Spanish')
This is the same when I try to create a new column, which gives me the same error.
Thanks for your help
Use this,
language_conversion.loc[(slice(None), 'Arabic'), 'user_id']
You can see the indices(in this case tuples of length 2) using language_conversion.index
you should use this
language_conversion.loc[(slice(None),'Spanish'), 'user_id']
slice(None) here includes all rows in date index.
if you have one particular date in mind just replace slice(None) with that specific date.
the error you are getting is because u accessed columns before indexes which is not correct way of doing it follow the link to learn more indexing
I am dealing with a table that has roughly 50k rows, each of which containing a timestamp and an array of smallints of the length 25920. What I am trying to do is pull a single value from each array with a list of timestamps that is being passed. For example, I would have 25920 timestamps that I would pass and I would want the first element for the timestamp, then the second element for the second timestamp and so on. By now I am having a tunnel vision and do not seem to find a solution to what is probably a trivial problem.
I either end up pulling the full 25920 rows which consumes too much memory or execute 25920 queries that take way too long for obvious reasons.
I am using Python 3.8 with the psycopg2 module.
Thanks in advance!
You need to generate an index into the array for every row you extract with your query. In this specific case (diagonal) you want an index based on the row number. Something along the lines of:
SELECT ts, val[row_number() over (order by ts)] FROM ... ORDER BY ts
I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})
Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.
So what I want to do is print only rows that have for example the price (or any other row "title" cell greater or equal to let's say 50.
I haven't been able to find the answer elsewhere and couldn't do it myself with the API documentation.
I'm using Google Sheets API v4 and my goal is based on a sheets that contain information on mobile subscription, allow user to select what they want for price, GB, etc.
Here is what my sheets look like:
Also, here is an unofficial documentation which I found great even though it didn't contain the answer I need, maybe someone here would succeed?
I tried running the following code but it didn't work:
val_list = col5
d = wks.findall(>50) if cell.value >50 :
print (val_list)
I hope you will be able to help me. I'm new to Python.
I think you had the right idea, but it looks like findall is for strings or regex, not an arbitrary boolean condition. Also, some of the syntax is a bit off, but that's to be expected when you are just starting out.
Here is how I would approach this with just what I could find in your attached document. I doubt this is the fastest or cleanest way to do this, but I think it's at least conceptually clear:
#list of all values in 4th/price column
prices=wks.col_values(4)
#Remove nonnumeric characters from prices
prices=[p.replace('*','') for p in prices[1:]]
#Get indices of rows with price >=50
##i+2 to account for one indexing and removing header row
indices=[i+2 for i,p in enumerate(prices) if float(p)>=50]
#Print these rows
for i in indices:
row=wks.row_values(i)
print(row)
Going forward with this project, you may want to put these row values into a dataframe rather than just printing them so you can do further analysis on this subset of the data.
Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.