I was trying to do this in Python: I have multiple prefixes to query in Bigtable, but I only want the first result of each row set defined by a prefix. In essence, applying a limit of 1 for each row set, not for the entire scan.
Imagine you have the following records' row keys:
collection_1#item1#reversed_timestamp1
collection_1#item1#reversed_timestamp2
collection_1#item2#reversed_timestamp3
collection_1#item2#reversed_timestamp4
What if I want to retrieve just the latest entries for collection_1#item1# and collection_1#item2# at the same time?
The expected output should be the rows corresponding to :
collection_1#item1#reversed_timestamp1
collection_1#item2#reversed_timestamp3
Can this be done in Bigtable?
Thanks!
Is collection_1#item1#reversed_timestamp1 the rowkey or is reversed_timestamp1 actually a timestamp?
If it is not part of the rowkey you could use a filter like cells per column
https://cloud.google.com/bigtable/docs/using-filters#cells-per-column-limit e.g.
rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))
or cells per row
https://cloud.google.com/bigtable/docs/using-filters#cells-per-row-limit e.g.
rows = table.read_rows(filter_=row_filters.CellsRowLimitFilter(2))
depending on how your data is laid out.
Related
I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.
Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913
I am having a table which contains over 600000 records and a column named implementer_userid, value in which may get repeated for more than one record. Now i want to store how many times a particular distinct value is occuring in that column. COUNTIF(Excel), GroupBy(sql) and similar functions wont work as i dont want a count of a specific value and instead replace all distinct values with their frequencies. Help me by doing so in any one of the three frameworks: Excel, Pandas(Python) & SQL.
If I understand your problem correctly, you can just construct a frequency table using value_counts() function, and then go through your column, replacing keys (row values) with the respective frequencies, as retrieved from the dictionary you've constructed earlier. For example:
frequencies = your_pandas_dataframe['Your column'].value_counts()
your_pandas_dataframe['Result column'] = your_pandas_dataframe['Your column'].apply(lambda x: frequencies[x])
If you don't want this extra column, you can probably do something like this instead:
# ...
your_pandas_dataframe['Your column'] = your_pandas_dataframe['Your column'].apply(lambda x: frequencies[x])
Does this answer your question?
Im using Python to query a SQL database. I'm fairly new with databases. I've tried looking up this question, but I can't find a similar enough question to get the right answer.
I have a table with multiple columns/rows. I want to find the MAX of a single column, I want ALL columns returned (the entire ROW), and I want only one instance of the MAX. Right now I'm getting ten ROWS returned, because the MAX is repeated ten times. I only want one ROW returned.
The query strings I've tried so far:
sql = 'select max(f) from cbar'
# this returns one ROW, but only a single COLUMN (a single value)
sql = 'select * from cbar where f = (select max(f) from cbar)'
# this returns all COLUMNS, but it also returns multiple ROWS
I've tried a bunch more, but they returned nothing. They weren't right somehow. That's the problem, I'm too new to find the middle ground between my two working query statements.
In SQLite 3.7.11 or later, you can just retrieve all columns together with the maximum value:
SELECT *, max(f) FROM cbar;
But your Python might be too old. In the general case, you can sort the table by that column, and then just read the first row:
SELECT * FROM cbar ORDER BY f DESC LIMIT 1;
I want to shade every other column excluding the 1st row/header with grey. I read through the documentation for XLSX Writer and was unable to find any example for this, I also searched through the tag here and couldn't find anything.
why not set it up as a conditional format?
http://xlsxwriter.readthedocs.org/example_conditional_format.html
you should just declare a condition like "if cells row number %2 == 0"
I wanted to post the details on how I did this, and how I was able to do it dynamically. It's kinda hacky, but I'm new to Python and I just needed this to work for right now.
xlsW = pd.ExcelWriter(finalReportFileName)
rptMatchingDoe.to_excel(xlsW,'Room Counts Not Matching',index=False)
workbook = xlsW.book
rptMatchingSheet = xlsW.sheets['Room Counts Not Matching']
formatShadeRows = workbook.add_format({'bg_color': '#a9c9ff',
'font_color': 'black'})
rptMatchingSheet.conditional_format('A1:'+xlsAlpha[rptMatchingDoeColCount]+matchingCount,{'type': 'formula',
'criteria': '=MOD(ROW(),2) = 0',
'format': formatShadeRows})
xlsW.save()
xlsAlpha is a list that contains the max amount of columns my report could possible have. My first three columns are always consistent so I just set rptMatchingDoeColCount equal to 2 and then when I loop through the list to build my query I increment the count. The matchingCount variable is just a fetchone() result from a count(*) query on the view I'm pulling from in the database.
Eventually I think I will write a function to replace the hardcoded list assigned to xlsAlpha, so that it can be a virtually unlimited amount of columns.
If anyone has any suggestions on how I could improve this feel free to share.
I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.
If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.
If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values
Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions
This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]