How to Aggregate records in Python?

How to Aggregate records in Python? - python

I have a following csv file (each line is dynamic number of characters but the columns are fixed... hope i am making sense)
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
I am trying to get results into another csv file in the following format where it is grouped by student id and order by date ascending order.
001,#Math;A;02/20/2013#Biology;B;05/01/2013#Literature;B;03/02/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
003,#Biology;A;04/08/2013
There is one constraint though. The input file is huge around 200 millions rows. I tried using c# and storing it DB and write sql query. Its very slow and not accepted. After googling i hear python is very powerful for these operations. I am new to Python started playing with the code. I really appreciate the PYTHON gurus to help me to get the results as I mentioned above.

content='''
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
'''
from collections import defaultdict
lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())
aggregated = defaultdict(list)
for items in items_iter:
stud, class_, date, grade = (t.strip('*') for t in items)
aggregated[stud].append((class_, grade, date))
for stud, data in aggregated.iteritems():
full_grades = [';'.join(items) for items in data]
print '{},#{}'.format(stud, '#'.join(full_grades))
Output:
003,#Biology;A;04/08/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
001,#Math;A;02/20/2013#Literature;B;03/02/2013#Biology;B;05/01/2013
Of course, this is an ugly hackish code just to show you how it can be done in python. When working with large streams of data, use generators and iterators, and don't use file.readlines(), just iterate. The iterators will not read all the data at once but read chunk-by-chunk when you iterate over them, and not earlier.
If you are concerned if 200m records fit memory, then do the following:
sort the records into separate "buckets" (like in bucket sort) by students id
cat all_records.txt | grep 001 > stud_001.txt # do if for other students also
do the processing per bucket
merge
grep is just example. make a fancier script (awk or also python) that will filter by student ID and, for example, filter all with ID < 1000, later 1000 < ID < 2000 and so on. You can do it safely because your records per student are disjoint.

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?

One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)

Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

Applying function to pandas dataframe: is there a more efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

Efficient query on a sorted csv

I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?

Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.

After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.

pandas algorithm slow: for loops and lambda

summary: I am searching for misspellings between a bunch of data and it is taking forever
I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling
code:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
df_temp = df
df_temp['jelly'] = row
df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
df_agg = df_temp[df_temp['difference'] <3]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for row in data['z']['json_search_string']:
#levenshtein algorithm above
The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?
Notes:
Here is an example of what the data might look like:
JSON data
{
"A": {
"email": "blah",
"name": "Joe Blah",
"json_search_string": [
"Company A",
"Some random company name",
"Company B",
"etc",
"..."
And the csv columns:
ID, Search Column, Other Columns
1, Clompany A, XYZ
2, Company A, XYZ
3, Some misspelled company, XYZ
etc

Well, it is really hard to answer performance enhancement question.
Depending on the effort and performance, here are some suggestions.
Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:
# prepare fixed words from JSON DATA
fixed_words = []
for v in json_data.values():
fixed_words += v["json_search_string"]
# looping over each file, and compare them with word from fixed words
for f in file_list:
# do the comparison and save.
Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.
Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:
a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from #Corley Brigman , editdistance is one option you can use.
b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance
c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++
You could use several of my suggestion to gain higher performance.
Hope this would be helpful.

You could change your code to :
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
x_search = x['search column']
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
mask = [jellyfish.levenshtein_distance(s1,s2) < 3 for s1,s2 in zip(x_search, row) ]
df_agg = df_temp[mask]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
apply return a copy of a serie which can be more expensive:
a = range(10**4)
b = range(10**4,2*(10**4))
%timeit [ (x*y) <3 for x,y in zip(a,b)]
%timeit pd.DataFrame([a,b]).apply(lambda x: x[0]*x[1] < 3 )
1000 loops, best of 3: 1.23 ms per loop
1 loop, best of 3: 668 ms per loop

Python: fast iteration through file

I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.

I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.

Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.