Create Matrix from a DataFrame - python

I have a dataframe with several columns, including Department and ICA. I need to create a matrix where the rows are the departments, and the columns are ICA values (they range from bad-acceptable-good).
So position r,c would be a number that shows how many observations of ICA were recorded for each department.
For example, if Amazonas is row 1 and Acceptable is column 3, position (1,3) would be the number of acceptable observations for Amazonas.
Thanks!

You can get values from your DataFrame using integer-based indexing with the DataFrame.iloc method. This seems to do what you need.
For example, if df is your DataFrame, then df.iloc[0, 2] will give you the value at the first row and third column.

Related

Combine two Pandas Data Frames where "Dates" matches

I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

How can I obtain the top n groups in pandas?

I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:
df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')
As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.
I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:
My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?
For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.
One way to do it:
aux = (df_melted.groupby('Species')['RelAb']
.max()
.nlargest(20, keep='all')
.to_list())
top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()

Use Groupby to construct a dataframe with value counts of other column

I have a dataframe with two column features: startneighborhood and hour
hour can take any value from 1-24, i.e., [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
startneighborhood can be 37 different neighborhood options.
I want to find the number of hours for every neighborhood and use "hour" as an index.
So my matrix would be 24 rows x 37 columns, with the 1:24 hours array as my index and the 37 neighborhood as the column names.
How can I use Pandas to perform this computation? I'm a bit lost on the fastest way.
I've constructed the dataframe, with the index and the neighborhood names as the column names. I now just need to add the values..
Im a little bit confused by the question but I think what you want to do is a crosstab
import pandas as pd
df = <...> #construct your dataframe
table = pd.crosstab(index=df.hour,columns=df.startneighborhood)
This will give you a 24x37 table where each element is the count of the number of occurrences of that combination of hour and startneighborhood.

Categories

Resources