Cassandra buffered read of millions of columns - python

I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.
Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):
results = {}
start = 0
while True:
# Fetch blocks of size 500
buffer = column_family.get(key, column_offset=start, column_count=500)
if len(buffer) == 0:
break
# Merge these results into the main one
results.update(buffer)
# Update the offset
start += len(buffer)
Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.
So how can I accomplish a buffered read of all the columns in a single row? Thanks.

From the pycassa 1.0.8 documentation
it would appear that you could use something like the following [pseudocode]:
results = {}
start = 0
startColumn = ""
while True:
# Fetch blocks of size 500
buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
# iterate returned values.
# set startColumn == previous column_finish.
Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.

In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.
for col in cf.xget(key, column_count=2**63-1):
# do something with the column.

Related

How can I use Python and Pandas to parse through text and return the strings I want in separate data cells?

So I have compiled a list of NFL game projections from the 2020 season for fantasy relevant players. Each row contains the team names, score, relevant players and their stats like in the text below. The problem is that each of the player names and stats are either different lengths or written out in slightly different ways.
`Bears 24-17 Jaguars
M.Trubisky- 234/2TDs
D.Montgomery- 113 scrim yards/1 rush TD/4 rec
A.Robinson- 9/114/1
C.Kmet- 3/35/0
G.Minshew- 183/1TD/2int
J.Robinson- 77 scrim yards/1 rush TD/4 rec
DJ.Chark- 3/36`
I'm trying to create a data frame that will split the player name, receptions, yards, and touchdowns into separate columns. Then I will able to compare these numbers to their actual game numbers and see how close the predictions were. Does anyone have an idea for a solution in Python? Even if you could point me in the right direction I'd greatly appreciate it!
You can get split the full string using the '-' (dash/minus sign) as the separator. Then use indexing to get different parts.
Using str.split(sep='-')[0] gives you the name. Here, the str would be the row, for example M.Trubisky- 234/2TDs.
Similarly, str.split(sep='-')[1]gives you everything but the name.
As for splitting anything after the name, there is no way of doing it unless they are in a certain order. If you are able to somehow achieve this, there is a way of splitting into columns.
I am going to assume that the trend here is yards / touchdowns / receptions, in which case, we can again use the str.split() method. I am also assuming that the 'rows' only belong to one team. You might have to run this script once for each team to create a dataframe, and then join all dataframes with a new feature called 'team_name'.
You can define lists and append values to them, and then use the lists to create a dataframe. This snippet should help you.
import re
names, scrim_yards, touchdowns, receptions = [], [], [], []
for row in rows:
# name = row.split(sep='-')[0] --> sample name: M.Trubisky
names.append(row.split(sep='-')[0])
stats = row.split(sep='-')[1].split(sep='/') # sample stats: [234, 2TDs ]
# Since we only want the 'numbers' from each stat, we can filter out what we want using regular expressions.
# This snippet was obtained from [here][1].
numerical_stats = re.findall(r'\b\d+\b', stats) # sample stats: [234, 2]
# now we use indexing again to get desired values
# If the
scrim_yards.append(numerical_stats[0])
touchdowns.append(numerical_stats[1])
receptions.append(numerical_stats[2])
# You can then create a pandas dataframe
nfl_player_stats = pd.DataFrame({'names': names, 'scrim_yards': scrim_yards, 'touchdowns': touchdowns, 'receptions': receptions})
As you are pointing out, often times the hardest part of processing a data file like this is handling all the variability and inconsistency in the file itself. There are a lot of things that can vary inside the file, and then sometimes the file also contains silly errors (typos, missing whitespace, and the like). Depending on the size of the data file, you might be better off simply hand-editing it to make it easier to read into Python!
If you tackle this directly with Python code, then it's a very good idea to be very careful to verify the actual data matches your expectations of it. Here are some general concepts on how to handle this:
First off, make sure to strip every line of whitespace and ignore blank lines:
for curr_line in file_lines:
curr_line = curr_line.strip()
if len(curr_line) > 0:
# Process the line...
Once you have your stripped, non-blank line, make sure to handle the "game" (matchup between two teams) line differently than the lines denoting players"
TEAM_NAMES = [ "Cardinals", "Falcons", "Panthers", "Bears", "Cowboys", "Lions",
"Packers", "Rams", "Vikings" ] # and 23 more; you get the idea
#...down in the code where we are processing the lines...
if any([tn in curr_line for tn in TEAM_NAMES]):
# ...handle as a "matchup"
else:
# ...handle as a "player"
When handling a player and their stats, we can use "- " as a separator. (You must include the space, otherwise players such as Clyde Edwards-Helaire will split the line in a way you did not want.) Here we unpack into exactly two variables, which gives us a nice error check since the code will raise an exception if the line doesn't split into exactly two parts.
p_name, p_stats = curr_line.split("- ")
Handling the stats will be the hardest part. It will all depend on what assumptions you can safely make about your input data. I would recommend being very paranoid about validating that the input data agrees with the assumptions in your code. Here is one notional idea -- an over-engineered solution, but that should help to manage the hassle of finding all the little issues that are probably lurking in that data file:
if "scrim yards" in p_stats:
# This is a running back, so "scrim yards" then "rush TD" then "rec:
rb_stats = p_stats.split("/")
# To get the number, just split by whitespace and grab the first one
scrim_yds = int(rb_stats[0].split()[0])
if len(rb_stats) >= 2:
rush_tds = int(rb_stats[1].split()[0])
if len(rb_stats) >= 3:
rec = int(rb_stats[2].split()[0])
# Always check for unexpected data...
if len(rb_stats) > 3:
raise Exception("Excess data found in rb_stats: {}".format(rb_stats))
elif "TD" in p_stats:
# This is a quarterback, so "yards"/"TD"/"int"
qb_stats = p_stats.split("/")
qb_yards = int(qb_stats[0]) # Or store directly into the DF; you get the idea
# Handle "TD" or "TDs". Personal preference is to avoid regexp's
if len(qb_stats) >= 2:
if qb_stats[1].endswidth("TD"):
qb_td = int(qb_stats[1][:-2])
elif qb_stats[1].endswith("TDs"):
qb_td = int(qb_stats[1][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Handle "int" if it's there
if len(qb_stats) >= 3:
if qb_stats[2].endswidth("int"):
qb_int = int(qb_stats[2][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Always check for unexpected data...
if len(qb_stats) > 3:
raise Exception("Excess data found in qb_stats: {}".format(qb_stats))
else:
# Must be a running back: receptions/yards/TD
rb_rec, rb_yds, rb_td = p_stats.split("/")

Unable to change value of dataframe at specific location

So I'm trying to go through my dataframe in pandas and if the value of two columns is equal to something, then I change a value in that location, here is a simplified version of the loop I've been using (I changed the values of the if/else function because the original used regex and stuff and was quite complicated):
pro_cr = ["IgA", "IgG", "IgE"] # CR's considered productive
rows_changed = 0
prod_to_unk = 0
unk_to_prod = 0
changed_ids = []
for index in df_sample.index:
if num=1 and color="red":
pass
elif num=2 and color="blue":
prod_to_unk += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "unknown"
rows_changed += 1
elif num=3 and color="green":
unk_to_prod += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "productive"
rows_changed += 1
else:
pass
print("Number of productive columns changed to unknown: {}".format(prod_to_unk))
print("Number of unknown columns changed to productive: {}".format(unk_to_prod))
print("Total number of rows changed: {}".format(rows_changed))
So the main problem is the changing code:
df_sample.at[index, "Functionality"] = "unknown" # or productive
If I run this code without these lines of code, it works properly, it finds all the correct locations, tells me how many were changed and what their ID's are, which I can use to validate with the CSV file.
If I use df_sample["Functionality"][index] = "unknown" # or productive the code runs, but checking the rows that have been changed shows that they were not changed at all.
When I use df.at[row, column] = value I get "AttributeError: 'BlockManager' object has no attribute 'T'"
I have no idea why this is showing up. There are no duplicate columns. Hope this was clear (if not let me know and I'll try to clarify it). Thanks!
To be honest, I've never used df.at - but try using df.loc instead:
df_sample.loc[index, "Functionality"] = "unknown"
You can also iat.
Example: df.iat[iTH row, jTH column]

SQLite3 How to Select first 100 rows from database, then the next 100

Currently I have database filled with 1000s of rows.
I want to SELECT the first 100 rows, and then select the next 100, then the next 100 and so on...
So far I have:
c.execute('SELECT words FROM testWords')
data = c.fetchmany(100)
This allows me to get the first 100 rows, however, I can't find the syntax for selecting the next 100 rows after that, using another SELECT statement.
I've seen it is possible with other coding languages, but haven't found a solution with Python's SQLite3.
When you are using cursor.fetchmany() you don't have to issue another SELECT statement. The cursor is keeping track of where you are in the series of results, and all you need to do is call c.fetchmany(100) again until that produces an empty result:
c.execute('SELECT words FROM testWords')
while True:
batch = c.fetchmany(100)
if not batch:
break
# each batch contains up to 100 rows
or using the iter() function (which can be used to repeatedly call a function until a sentinel result is reached):
c.execute('SELECT words FROM testWords')
for batch in iter(lambda: c.fetchmany(100), []):
# each batch contains up to 100 rows
If you can't keep hold of the cursor (say, because you are serving web requests), then using cursor.fetchmany() is the wrong interface. You'll instead have to tell the SELECT statement to return only a selected window of rows, using the LIMIT syntax. LIMIT has an optional OFFSET keyword, together these two keywords specify at what row to start and how many rows to return.
Note that you want to make sure that your SELECT statement is ordered so you get a stable result set you can then slice into batches.
batchsize = 1000
offset = 0
while True:
c.execute(
'SELECT words FROM testWords ORDER BY somecriteria LIMIT ? OFFSET ?',
(batchsize, offset))
batch = list(c)
offset += batchsize
if not batch:
break
Pass the offset value to a next call to your code if you need to send these batches elsewhere and then later on resume.
sqlite3 is nothing to do with Python. It is a standalone database; Python just supplies an interface to it.
As a normal database, sqlite supports standard SQL. In SQL, you can use LIMIT and OFFSET to determine the start and end for your query. Note that if you do this, you should really use an explicit ORDER BY clause, to ensure that your results are consistently ordered between queries.
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100')
...
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100 OFFSET 100')
You can crate iterator and call it multiple times:
def ResultIter(cursor, arraysize=100):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Or simply like this for returning the first 5 rows:
num_rows = 5
cursor = dbconn.execute("SELECT words FROM testWords" )
for row in cursor.fetchmany(num_rows):
print( "Words= " + str( row[0] ) + "\n" )

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.
Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

How to populate a CSV column by column in a loop with Python?

Solved: A friend of mine helped me add in code that takes the csv files which get outputted and combines them into a single new file. I will add the code in after the weekend in case anyone else with a similar issue wants to see it in the future!
Let me start by sharing my existing, working code. This code takes some raw data from a csv file and generates new csv files from it. The data consists of two columns, one representing voltage and one representing current. If the voltage value is not changing, the current values are sent to a new csv file whose name reflects the constant voltage. Once a new stable voltage is reached, another csv is made for that voltage and so on. Here it is:
for x in range(1,6):
input=open('signalnoise(%d).csv' %x,"r") # Opens raw data as readable
v1 = 0
first = True
for row in csv.reader(input,delimiter='\t'):
v2 = row[0]
if v1==v2:
voltage=float(v1)*1000
if first:
print("Writing spectra file for " +str(voltage) + "mV")
first = False
output=open('noisespectra(%d)' %x +str(voltage)+'mV.csv',"a")
current = [row[1]]
writer=csv.writer(output,delimiter='\t')
writer.writerow(current)
else:
v1 = row[0]
first = True
One note, for some reason the print command doesn't seem to go off until the entire script is done running but it prints the correct thing. This could just be my computer hanging while the script runs.
I would like to change this so that instead of having a bunch of files, I just have one output file with multiple columns. Each column would have its first entry be the voltage value followed by all the currents recorded for that voltage. Here is my idea so far but I'm stuck:
for x in range(1,6):
input=open('signalnoise(%d).csv' %x,"r") # Opens raw data as readable
v1 = 0
first = True
for row in csv.reader(input,delimiter='\t'):
v2 = row[0]
if v1==v2:
voltage=float(v1)*1000
if first:
column = ['voltage']
print("Writing spectra file for " +str(voltage) + "mV")
first = False
column=column+[row[1]] # Adds the current onto the column
saved = True # Means that a column is being saved
elif saved: # This is executed if there is a column waiting to be appended and the voltage has changed
#I get stuck here...
At this point I think I need to somehow use item.append() like the example here but I'm not entirely sure how to implement it. Then I would set saved = False and v1 = row[0] and have the same else statement as the original working code so that on the next iteration things would proceed as desired.
Here is some simple sample data to work with (although mine is actually tab delimited):
.1, 1
.2, 2
.2, 2
.2, 2.1
.2, 2
.3, 3
.4, 4
.5, 5.1
.5, 5.2
.5, 5
.5, 5.1
My working code would take this and give me two files named 'noisespectra(#)200.0mV.csv' and 'noisespectra(#)500.0mV.csv' which are single columns '2,2,2.1,2' and '5.1,5.2,5,5.1' respectively. I would like code which makes a single file named 'noisespectra(#).csv' which is two columns, '200.0mV,2,2,2.1,2' and '500.0mV,5.1,5.2,5,5.1'. In general, a particular voltage will not have the same number of currents and I think this could be a potential problem in using the item.append() technique, particularly if the first voltage has fewer corresponding currents than future voltages.
Feel free to disregard the 'for x in range()'; I am looping through files with similar names but that's not important for my problem.
I greatly appreciate any help anyone can give me! If there are any questions, I will try to address them as quickly as I can.
Keep track of the two sets of values in two lists, then do ...
combined = map(None, list_1, list_2)
And then output the combined list to csv.

Categories

Resources