Merging large data in Python in local machine

Merging large data in Python in local machine - python

I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.

Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.

Related

Splitting pandas group into subgroups based on relation between consecutive data points

Suppose I have a dataset that records camera sightings of some object over time, and I groupby date so that each group represents sightings within the same day. I'd then like to break one group into 'subgroups' based on the time between sightings -- if the gap is too large, then I want them to be in different groups.
Consider the following as one group.
(Camera). (Time)
A 6
B 12
C 17
D 21
E 47
F 50
Suppose I had a cutoff matrix that told me how close the next sighting had to be for two adjacent cameras to be in the same group. For example, we might have cutoff_mat[d, e] = 10 which means that since cameras D and E are more than 10 units apart in time, I should break the group into two after D and before E. I would like to do so in a way that allows for efficient iteration over each of the resulting groups since my real goal is to compute some other matrix using values within each sub-group, and need to potentially break one group into many and not just two. How do I do this? The dataset is large (>100M points) so something fast would be appreciated.
I am thinking I could do this by creating another column in the original dataset that represents time between consecutive sightings on the same day, and somehow groupby both date AND this new column, but I'm not quite sure how that'd work. I also don't think pd.df.cut() works here since I don't have pre-determined bins.

How to iterate pandas dataframe without memory error

I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.

This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.

Divide a number to date in Excel or Python

Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks

My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84

You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2

Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here:

How to find the alignment of two data sets in pandas

Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.

Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA

For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.