efficient way of computing a dataframe using concat and split - python

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?

I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

Related

Using Multiprocessing on a Collection of DataFrames

Setup
I have a multiple datasets each with their own DataFrame. I'm running calculations within them before comparing my results to a separate DataFrame which we can think of as constraints.
For example lets say 2 sets of data in a dictionary:
df_data_1 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
df_data_2 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
data_sets = {'data_1': df_data, 'data_2': df_data_2}
and one set of constraints:
df_constraints = pd.DataFrame([['a', 10, 20, 10000000],
['b', 100, 200, 20000000],
['c', 1000, 2000, 30000000]])
df_constraints.columns = ['index', 'sumMin', 'sumMax', 'productMax']
df_constraints.set_index('index', inplace=True)
Visually:
data_set_1
data_set_2
constraints
Function
I'm making calculations within each set of data and then comparing them to a set of constraints. For the sake of simplifying my question I am only comparing the data to the first row of constraints here, but in reality I have to compare the results of my calculations within each data-set to up to 20 sets of constraints.
Here is a simplified version of the function that I am trying to have run in parallel:
def test_func(df_data, df_constraints):
# Run some calculations
df = df_data.copy()
df['sum'] = df.sum(axis=1)
df['product'] = df.product(axis=1)
# Compare results to constraints
df['sumFit'] = ((df['sum'] > df_constraints.loc['a', 'sumMin']) &
(df['sum'] < df_constraints.loc['a', 'sumMax']))
df['productFit'] = df['product'] < df_constraints.loc['a', 'productMax']
# Analyze results
count_sumFits = df['sumFit'].sum()
count_productFits = df['productFit'].sum()
df_results = pd.DataFrame([['data_set_1', count_sumFits, count_productFits]],
columns=['DataSet', 'FittingSums', 'FittingProducts'])
df_results.set_index('DataSet', inplace=True)
return df_results
Sequential Version
I can run this function sequentially through each set of data; iterating through the dictionary with a while loop and then append the results as shown here, but with increased complexity this is taking way longer than I would like. (It's ugly but it works)
n=0
while n < len(data_sets):
data_set_names = list(data_sets.keys())
df_temp = test_func(data_sets[data_set_names[n]], df_constraints)
df_all_results.loc[n, 'FittingSums'] = df_temp.loc[0, 'FittingSums']
df_all_results.loc[n, 'FittingProducts'] = df_temp.loc[0, 'FittingProducts']
n+=1
The Problem
When I have 25 data-sets and I'm running more complex analysis with more calculations, the run time ends up being minutes long. Leading me to pursue concurrency/multiprocessing. I'm hoping to make this significantly faster as it is one step of many that I'm trying to optimize and then run them all a few thousand times.
So, Multiprocessing...
Due to the need to pass two arguments to the function I've been looking at mp.Pool.starmap, and pool.map(partial(test_func, b=df_constraints), data_sets, but I haven't been able to get either method to work.
ex.1) mp.Pool.starmap
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets, itertools.repeat(df_contraints)
This is as far as I've been able to get. Is it possible to process data concurrently like this and then append results to a dataframe? I don't need them to be in any particular order I just want to get the data into the right format.
I don't fully understand your code and your logic but replace data_sets by data_sets.values():
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets.values(),itertools.repeat(df_contraints)))

Efficient way to loop with if statement

I have a sample data look like this (real dataset has more columns):
data = {'stringID':['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],'IDct':[1,3,4]}
data = pd.DataFrame(data)
data['Index1'] = [[3,6],[7,9],[5,6]]
data['Index2'] = [[4,8],[10,13],[8,9]]
What i want to achieve is i want to slice stringID column based on second elment in Index1 and Index2 (both are list), only if IDct value is bigger than 1, otherwise return NaN.
I tried this, it works as Output1 column, but there must be a better way (i mean faster when apply to a large dataset) to do it, please kindly advise, thanks!
data['pos'] = data.Index1.map(lambda x: x[1])
data['pos1'] = data.Index2.map(lambda x: x[1])
def cal(m):
if m['IDct'] > 1:
return m['stringID'][m['pos']:m['pos1']]
else:
return 'NaN'
data['Output1'] = data.apply(cal,axis=1)
I love pandas - but realistically speaking it's just one of many tools that belong in your tool belt.
pandas and numpy really shine for computation and analysis. It's okay to use pandas to visualize and analyze your data - but that doesn't mean it's the right tool for the job.
This kind of problem is better suited for regular python. Assuming we can, let's move StringID and IDct out of the dict and back into lists. If we assume the result is regular in shape (all lists are of equal length)
StringID = ['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],
IDct = [1,3,4]
Index1 = [[3,6],[7,9],[5,6]]
Index2 = [[4,8],[10,13],[8,9]]
for stringID, IDct, Index1, Index2 in zip(stringID, IDct, Index1, Index2):
result = []
if IDct > 1:
result.append(your_indexing_goes_here())
else:
result.append(None)
You can then blend the result data back in as you see fit.
data = {
'StringID': StringID,
'IDct': IDct,
'Index1': Index1,
'Index2': Index2,
'Result': result
}
pd.DataFrame(data)

Can this pandas workflow be converted to dask?

Please be nice - I'm not a proper programmer, I'm a scientist and I've read as many docs on this as I can find (they're a bit sparse).
I'm trying to convert this pandas code into dash because my input file is ~0.5TB with gz and it loads too slowly in native pandas. I have a 3 TB machine, btw.
This is an example of what I'm doing with pandas:
df = pd.DataFrame([['chr1',33329,17,'''33)'6'4?1&AB=?+..''','''X%&=E&!%,0("&"Y&!'''],
['chr1',33330,15,'''6+'/7=1#><C1*'*''','''X%=E!%,("&"Y&&!'''],
['chr1',33331,13,'''2*3A#/9#CC3--''','''X%E!%,("&"Y&!'''],
['chr1',33332,1,'''4**(,:3)+7-#<(0-''','''X%&E&!%,0("&"Y&!'''],
['chr1',33333,2,'''66(/C=*42A:.&*''','''X%=&!%0("&"&&!''']],
columns = ['chrom','pos','depth','phred','map'])
df.loc[:,'phred'] = [(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df.loc[:,"map"] = [(sum(map(ord,i)))/len(i) for i in df.loc[:,"map"]]
df = df.astype({'phred': 'int32', 'map': 'int32'})
df.query('(depth < 10) | (phred < 7) | (map < 10)', inplace=True)
for chrom, df_tmp in df.groupby('chrom'):
df_end = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(-1)-1))]
df_start = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(+1)+1))]
for start, end in zip(df_start.pos, df_end.pos):
print (start, end)
Gives
33332 33333
This works (to find regions of a cancer genome with no data) and it's optimised as much as I know how.
I load the real thing like:
df = pd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
index_col=None,
usecols=[0,1,3,5,6],
names = ['chrom','pos','depth','phred','map']
)
and I can do the same with Dask (way faster!):
df = dd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
usecols=[0,1,3,5,6],
compression='gzip',
blocksize=None,
names = ['chrom','pos','depth','phred','map']
)
but i'm stuck here:
ff=[(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df['phred'] = ff
Error: Column assignment doesn't support type list
Question - is this sort of thing possible? If so are there good tutes somewhere? I need to convert the whole block of pandas code above.
Thanks in advance!
You created list comprehensions to transform 'Fred' and 'map'; I converted these list comps to functions, and wrapped the functions in np.vectorize().
def func_p(p):
return (sum(map(ord, p)) - len(p) * 33) / len(p)
def func_m(m):
return (sum(map(ord, m))) / len(m)
vec_func_p = np.vectorize(func_p)
vec_func_m = np.vectorize(func_m)
np.vectorize() does not make code faster, but it does let you write a function with scalar inputs and outputs, and convert it to a function that takes array inputs and outputs.
The benefit is that we can now pass pandas Series to these functions (I also added the type conversion to this step):
df.loc[:, 'phred'] = vec_func_p( df.loc[:, 'phred']).astype(np.int32)
df.loc[:, 'map'] = vec_func_m( df.loc[:, 'map']).astype(np.int32)
Replacing the list comprehensions with these new functions gives the same results as your version (33332 33333).
#rpanai noted that you could eliminate the for loops. The following example uses groupby() (and a couple helper columns) to find the start and end position for each contiguous sequence of positions.
Using only pandas built-in functions should be compatible with Dask (and fast).
First, create demo data frame with multiple chromosomes and multiple contiguous blocks of positions:
data1 = {
'chrom' : 'chrom_1',
'pos' : [1000, 1001, 1002,
2000, 2001, 2002, 2003]}
data2 = {
'chrom' : 'chrom_2',
'pos' : [30000, 30001, 30002, 30003, 30004,
40000, 40001, 40002, 40003, 40004, 40005]}
df = pd.DataFrame(data1).append( pd.DataFrame(data2) )
Second, create two helper functions:
rank is a sequential counter for each group;
key is constant for positions in a contiguous 'run' of positions.
df['rank'] = df.groupby('chrom')['pos'].rank(method='first')
df['key'] = df['pos'] - df['rank']
Third, group by chrom and key to create a groupby object for each contiguous block of positions, then use min and max to find start and end value for the positions.
result = (df.groupby(['chrom', 'key'])['pos']
.agg(['min', 'max'])
.droplevel('key')
.rename(columns={'min': 'start', 'max': 'end'})
)
print(result)
start end
chrom
chrom_1 1000 1002
chrom_1 2000 2003
chrom_2 30000 30004
chrom_2 40000 40005

Python, loops with changeable parts of filenames

I have a bunch of very similar commands which all look like this (df means pandas dataframe):
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
I would like to make a loop for it, as follows:
for i in range(1,5):
for j in range(1,5):
df%i_part%j=...
Of course, it doesn't work with %. But is has to be some easy way to do it, I suppose.
Could You help me please?
You can try one of the following options:
Create a dictionary which maps the your df and access it by the name of the dataframe:
mapping = {"df1_part1": df1_part1, "df1_part2": df1_part2}
for i in range(1,5):
for j in range(1,5):
mapping[f"df{i}_part{j}"] = ...
Use globals to access dynamically your variables:
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
for i in range(1,5):
for j in range(1,5):
globals()[f"df{i}_part{j}"] = ...
One way would be to collect your pandas dataframes in a list of lists and iterate over that list instead of trying dynamically parse your python code.
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
dflist = [[df1_part1, df1_part2, df1_part3, df1_part4, df1_part5],
[df2_part1, df2_part2, df2_part3, df2_part4, df2_part5]]
for df in dflist:
for df_part in df:
# do something with df_part
Assuming that this process is part of data preparation, I would like to mention that you should try to work with "data preparation pipelines" whenever it is possible. Otherwise, the code will be a huge mess to read after a couple of months.
There are several ways to deal with this problem.
A dictionary is the most straightforward way to deal with this.
df_parts = {
'df1' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df1_partN},
'df2' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df2_partN},
'...' : {'part1': ..._part1, 'part2': ..._part2,...,'partN': ..._partN},
'dfN' : {'part1': dfN_part1, 'part2': dfN_part2,...,'partN': dfN_partN},
}
# print parts from `dfN`
for val in for df_parts['dfN'].values():
print(val)
# print part1 for all dfs
for df in df_parts.values():
print(df['part1'])
# print everything
for df in df_parts:
for val in df_parts[df].values():
print(val)
The good thing with this approach is that you can iterate through the whole dictionary, but you don't include range which may be confusing later. Also, it is better to assign every df_part directly to a dict instead of assigning N*N variables which may be used once or twice. In this case you can just use 1 variable and re-assign it as you progress:
# code using df1_partN
df1 = df_parts['df1']['partN']
# stuff to do
# happy? checkpoint
df_parts['df1']['partN'] = df1

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Categories

Resources