How to work with aggregated data in pandas? - python

I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.

No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.
If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).

Related

Manually interpolating NAs in large dataset in Pandas

I am currently working on a project that uses Pandas, with a large dataset (~42K rows x 1K columns).
My dataset has many omissing values which I want to interpolate to obtain a better result when training an ML model using this data. My method of interpolating the data is by taking the average of the previous and the next value and then considering that the value for any NaN. Example:
TRANSACTION PAYED MONDAY TUESDAY WEDNESDAY
D8Q3ML42DS0 1 123.2 NaN 43.12
So in the above example the NaN would be replaced with the average of the 123.2 and 43.12 which is 83.16. If the value can't be interpolated then a 0 is put. I was able to implement this in a number of ways but I always end up getting into the issue of it taking a very long time to process all of the rows in the dataset despite running it on an Intel Core i9. The following are approaches I've tried and have found out that they take too long:
Interpolating the data and then only replacing the elements that need to be replaced instead of replacing the entire row.
Replacing the entire row with a new pd.Series that has the old and the interpolated values. It seems like my code is able to execute reasonably well on a Numpy Array but the slowness comes from the assignment.
I'm not quite sure why the performance of my code comes nowhere close to df.interpolate() despite it being the same idea. Here is some of my code responsible for the interpolation:
for transaction_id in df.index:
df.loc[transaction_id, 2:] = interpolate(df.loc[transaction_id, 2:])
def interpolate(array:np.array):
arr_len = len(array)
for i in range(array):
if math.isnan(array[i]):
if i == 0 or i == arr_len-1 or math.isnan(array[i-1]) or math.isnan(array[i+1]):
array[i] = 0
else:
statistics.mean([array[i-1], array[i+1]])
return array
My understanding is that Pandas has some sort of parallel techniques and functions that it is able to use to perform that. How can I speed this process up even a little?
df.interpolate(method='linear', limit_direction='forward', axis=0)
Try doing this it might help.

Reshaping a material science Dataset (probably using melt() )

I'm dealing with a materials science dataset and I'm in the following situation,
I have data organized like this:
Chemical_ Formula Property_name Property_Scalar
He Electrical conduc. 1
NO_2 Resistance 50
CuO3 Hardness
... ... ...
CuO3 Fluorescence 300
He Toxicity 39
NO2 Hardness 80
... ... ...
As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, how can I easily maybe split the dataset in smaller ones, fitting every formula with its descriptors in ORDER? ( I used fiction names and values, just to explain my problem.)
I'm on Jupyter Notebook and I'm using Pandas.
I'm editing my question trying to be more clear:
My goal would be to plot some histograms of (for example) nĀ°materials vs conductivity at different temperatures (100K, 200K, 300K). So I need to have both conductivity and temperature for each material to be clearly comparable. For example, I guess that a more convenient thing to obtain would be:
Chemical formula Conductivity Temperature
He 5 10K
NO_2 7 59K
CuO_3 10 300K
... ... ...
He 14 100K
NO_2 5 70K
... ... ...
I think that this issue can be related to reshaping the dataset but I should also have each formula to MATCH exactly the temperature and conductivity. Thank you for your help!
If you want to plot Conductivity versus Temperature for a given formula, you can simly select the rows that match this condition.
import pandas as pd
import matplotlib.pyplot as plt
formula = 'NO_2'
subset = df.loc[df['Chemical_Formula'] == formula].sort_values('Temperature')
x = subset['Temperature'].values
y = subset['Conductivity'].values
plt.plot(x, y)
Here, we are defining the formula you want to extract. Then we are selecting only the rows in the DataFrame where the value in the column 'Chemical Formula' matches your specified formula using df.loc[]. This returns a new DataFrame that is a subset of your original DataFrame that contains only rows where our condition is satisfied. We sort this subset by 'Temperature' (I assume you want to plot Temperature on the x-axis) and store it as subset. We then select the 'Temperature' and 'Conductivity' columns which return pandas.Seriesobjects, which we convert to numpy arrays by calling .values. We store these in x and y variables and pass them to the matplotlib plot function.
EDIT:
To get from the first DataFrame to the second DataFrame described in your post, you can use the pivot function (assuming your first DataFrame is named df):
df = df.pivot(index='Chemical_Formula', columns='Property_name', values='Property_Scalar')

Print and write values in a file greater and lesser than 10 and make a plot of it by using Python

I have a enormous data in (.csv) format which consists of various columns from that of my interest is column 3 and 7. I want to print both columns
Sample Data: {Only Col 3 and 7 are displayed}
Names Numbers
John 12
Kim 5
Alex 16
mike 2
giki 8
David 18
Desired Output #values greater than 10:
John 12
Alex 16
David 18
Desired Output #values lesser than 10:
Kim 5
mike 2
giki 8
Rhea
I'm not sure I understand what are trying to accomplish there, therefore I'll try to help you going through some basic stuff:
a) Do you already have your data on a DataFrame format? Or it is in some form of tabular data such as a csv or Excel file?
Dataframe = Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Anyways you will have to import pandas to read or manipulate this file. Then you can transform it into a DataFrame using one of Pandas reading functions, such as pandas.read_csv or pandas.read_excel.
import pandas as pd
# if your data is in a dictionary
df = pd.DataFrame(data=d)
# csv
df = pd.read_csv('file name and path')
b) Then you can slice through it using pandas, and create new DataFrames
output1 = df.loc[df['Numbers'] > 10]
output2 = df.loc[df['Numbers'] < 10]
c) The most basic way to plot is using the pandas method plot on your new DataFrame (you can get a lot fancier than that using matplotlib or seaborn). Although you should probably think about what kind of information you want to visualize, which is not clear to me.
out1.plot()
#histogram
out2.hist()
d) You can save your new dataframes using pandas as well. Here is an example of a CSV file
df.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None)
I hope I could shed some light into your doubts ;) .

In pandas, how to plot with multiple index?

I have double index in Panda's dataframe like the example below.
c d
a b
1 3 1.519970 -0.493662
2 4 0.600178 0.274230
3 5 0.132885 -0.023688
4 6 2.410179 1.450520
How do I plot column 'c' as y-axis and index 'b' as x-axis. With one index, it is easy to plot. But I have trouble with multi index plotting. Thank you for any help!!.
Option 1
Two options have been provided (in the comments) involving reset_index. They are
df.reset_index().plot(x="b",y="c")
Or,
df.reset_index(level=0, drop=True).c.plot()
Both of these should work as expected, but will become expensive for large dataframes.
Option 2
If you are worried about memory, here's an option that does not involve resetting the index:
plt.plot(df.index.get_level_values(1), df.c)
reset_index generates copies of the data. This is more efficient, because it doesn't have to.

Any existing methods to find a drop in a noisy time series?

I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.
You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64

Categories

Resources