compare values in different chunks using pandas

compare values in different chunks using pandas - python

Say I have in memory a large file, loaded using chunksize in pandas. Now I have to compare every value with the ones ajdacent to it. My problem is that I can't seem to select at the same time the extreme values (in first and last position) of two different chunks.
Example:
print(df)
a
0 102
1 101
2 104
3 110
4 104
5 105
count = 0
for i in range(len(df)-1):
if df.iloc[i+1]['a']>df.iloc[i]['a']:
count+=1
count would be equal to 3 in this example. But say I have loaded df from a .csv with chunksize=1, how would I achieve a similar result, considering that values will be in different chunks? In practice chunksize is 10000 and so the problem would be limited to the first and last value for each chunk.

EDIT:
Here is an example where I store the last_chunk_value to update the value when running the next loop.
I've tested a 'brut force' method to compare with the 'chunk script'. The results are the same with both methods.
By the way, I've simplified the 'brut force' method.
import pandas as pd
import numpy as np
import random
# 'data' generation as csv file
file = open("data.csv", 'w')
file.write('rand_int' + '\n')
for i in range(0, 10000):
file.write(str(random.randint(80,120)) + '\n')
file.close()
# "brute force method"
df = pd.read_csv("data.csv")
length = int( (df.shift(-1) - df > 0).sum() )
print('number=', length)
# chunksize method
chunksize = 33
length = 0
last_chunk_value = np.nan
for chunk in pd.read_csv("data.csv", chunksize=chunksize):
chunk['shift'] = chunk.shift(1)
chunk.iloc[0, 1] = last_chunk_value
length += (chunk['rand_int'] - chunk['shift'] > 0).sum()
last_chunk_value = chunk.iloc[-1, 0]
print('number=', length)

Related

Reading from a .dat file Dataframe in Python

I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated

Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]

Python divide dataframe into chunks

I have a 1 column df with 37365 rows. I would need to separate it in chunks like the below:
df[0:2499]
df[2500:4999]
df[5000:7499]
...
df[32500:34999]
df[35000:37364]
The idea would be to use this in a loop like the below (process_operation does not work for dfs larger than 2500 rows)
while chunk <len(df):
process_operation(df[lower:upper])
EDIT:
I will be having different dataframes as inputs. Some of them will be smaller than 2500. What would be the best approach to also capture these?
Ej: df[0:1234] because 1234<2500

The range function is enough here:
for start in range(0, len(df), 2500):
process_operation(df[start:start+2500])

Do you mean something like that?
lower = 0
upper = 2499
while upper <= len(df):
process_operation(df[lower:upper])
lower += 2500
upper += 2500

I would use
import numpy as np
import math
chunk_max_size = 2500
chunks = int(math.ceil(len(df) / chunk_max_size))
for df_chunk in np.array_split(df, chunks):
#where: len(df_chunk) <= 2500

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35

If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.

You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))

You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35

I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)

Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

Repartition Dask DataFrame to get even partitions

I have a Dask DataFrames that contains index which is not unique (client_id). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code:
for p in range(ddd.npartitions):
print(len(ddd.get_partition(p)))
prints out something like that:
55
17
5
41
51
1144
4391
75153
138970
197105
409466
415925
486076
306377
543998
395974
530056
374293
237
12
104
52
28
My DataFrame is one-hot encoded and has over 500 columns. Larger partitions don't fit in memory. I wanted to repartition the DataFrame to have partitions even in size. Do you know an efficient way to do this?
EDIT 1
Simple reproduce:
df = pd.DataFrame({'x':np.arange(0,10000),'y':np.arange(0,10000)})
df2 = pd.DataFrame({'x':np.append(np.arange(0,4995),np.arange(5000,10000,1000)),'y2':np.arange(0,10000,2)})
dd_df = dd.from_pandas(df, npartitions=10).set_index('x')
dd_df2= dd.from_pandas(df2, npartitions=5).set_index('x')
new_ddf=dd_df.merge(dd_df2, how='right')
#new_ddf = new_ddf.reset_index().set_index('x')
#new_ddf = new_ddf.repartition(npartitions=2)
new_ddf.divisions
for p in range(new_ddf.npartitions):
print(len(new_ddf.get_partition(p)))
Note the last partitions (one single element):
1000
1000
1000
1000
995
1
1
1
1
1
Even when we uncomment the commented lines, partitions remain uneven in the size.
Edit II: Walkoround
Simple wlakoround can be achieved by the following code.
Is there a more elgant way to do this (more in a Dask way)?
def repartition(ddf, npartitions=None):
MAX_PART_SIZE = 100*1024
if npartitions is None:
npartitions = ddf.npartitions
one_row_size = sum([dt.itemsize for dt in ddf.dtypes])
length = len(ddf)
requested_part_size = length/npartitions*one_row_size
if requested_part_size <= MAX_PART_SIZE:
np = npartitions
else:
np = length*one_row_size/MAX_PART_SIZE
chunksize = int(length/np)
vc = ddf.index.value_counts().to_frame(name='count').compute().sort_index()
vsum = 0
divisions = [ddf.divisions[0]]
for i,v in vc.iterrows():
vsum+=v['count']
if vsum > chunksize:
divisions.append(i)
vsum = 0
divisions.append(ddf.divisions[-1])
return ddf.repartition(divisions=divisions, force=True)

You're correct that .repartition won't do the trick since it doesn't handle any of the logic for computing divisions and just tries to combine the existing partitions wherever possible. Here's a solution I came up with for the same problem:
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
The internal function sorted_division_locations does what you want already, but it only works on an actual list-like, not a lazy dask.dataframe.Index. This avoids pulling the full index in case there are many duplicates and instead just gets the counts and reconstructs locally from that.
If your dataframe is so large that even the index won't fit in memory then you'd need to do something even more clever.

How to generalize this calculation with a pandas DataFrame to any number of columns?

I have a file with some data that looks like
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
I can process this data and do math on it just fine:
import sys
import numpy as np
import pandas as pd
def main():
if(len(sys.argv) != 2):
print "Takes one filename as argument"
sys.exit()
file_name = sys.argv[1]
data = pd.read_csv(file_name, sep=" ", header=None)
data.columns = ["timestep", "mux", "muy", "muz"]
t = data["timestep"].count()
c = np.zeros(t)
for i in range(0,t):
for j in range(0,i+1):
c[i-j] += data["mux"][i-j] * data["mux"][i]
c[i-j] += data["muy"][i-j] * data["muy"][i]
c[i-j] += data["muz"][i-j] * data["muz"][i]
for i in range(t):
print c[i]/(t-i)
The expected result for my sample input above is
42.5
62.0
84.5
110.0
This math is finding the time correlation function for my data, which is the time-average of all permutations of the pairs of products in each column.
I would like to generalize this program to
work on n number of columns (in the i/j loop for example), and
be able to read in the column names from the file, so as to not have them hard-coded in
Which numpy or pandas methods can I use to accomplish this?

We can reduce it to one loop, as we would make use of array-slicing and use sum ufunc to operate along the rows of the dataframe and thus in the process make it generic to cover any number of columns, like so -
a = data.values
t = data["timestep"].count()
c = np.zeros(t)
for i in range(t):
c[:i+1] += (a[:i+1,1:]*a[i,1:]).sum(axis=1)
Explanation
1) a[:i+1,1:] is the slice of all rows until the i+1-th row and all columns starting from the second column, i.e mux, muy and so on.
2) Similarly, for [i,1:], that's the i-th row and all columns from second column onwards.
To keep it "pandas-way", simply replace a[ with data.iloc[.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare values in different chunks using pandas - python

Related

Reading from a .dat file Dataframe in Python

Python divide dataframe into chunks

Adding values from a CSV file

Repartition Dask DataFrame to get even partitions

How to generalize this calculation with a pandas DataFrame to any number of columns?

Categories

Resources