How can I handle time series forecasting analysis with multiple series - python

OK, I'll ask in more detail. I'm updating the question and will add an image as well. I have the sectors and job vacancies data for those sectors as in the picture. The first column is dates and it's an index. the other 18 columns are job vacancies data for sectors.
enter image description here
Now my question is,
When I chart calculations such as seasonality and moving average, there are 18 tables for each sector.
For example, the healthcare industry, or mining.
enter image description here
I have exactly 18 of these three tables. At the end of the data preprocessing stage, I will have almost hundreds of tables. I wanted to tell them table by table in the readme.md section when I upload them to my github profile. But this way it won't be possible. Do you think I'm going right? I think about it. Or am I making things difficult for myself?
Is there any other way to analyze them? Can't I merge? I am open to suggestions at this point... I am doing time series analysis for the first time.

Related

finding big enough sample size by expanding search categories. Algorithmic clustering?

I'm interested in finding 50+ similar samples within a dataset of 3M+ rows and 200 columns
Consider we've got .csv database of vehicles. Every row is one car, and in the columns, there are features like brand, millage, engine size etc.
brand
year bin
engine bin
millage
Ford
2014-2016
1-2
20K-30K
The procedure to automate:
When I receive a new sample I want to find 50+ similar ones. If I can't find exactly the same I can drop/broaden some information. For example, the same model of Ford between 2012 and 2016 is nearly the same car so I would expand the search with a bigger year bin. I expect that if I expand the search for enough categories I will always find a required population.
After this, I got a "search query" like this which returns me 50+ samples so it's maximally precise and big enough to observe the mean, variance etc.
brand
year bin
engine bin
millage
Ford
2010-2018
1-2
10K-40K
Is there anything like this already implemented?
I've tried k-means clustering vehicles by those features but it isn't precise enough and isn't easily interpretable for people without a data science background. I think the "distance" based metrics can't learn the "hard" constraints like not searching in different brands. But maybe there is a way of feature weighting?
I'm happy to receive every suggestion!

NLP with multiple text columns? How to approach this problem? (Python)

I have a healthcare dataset that includes columns with different text (such as medical history, doctor notes etc..). I want to use these notes to help build a 'criteria list' for the patients that stayed at the hospital for less than 2 days (i have that flagged in the dataset).
I'm new with NLP and have only done coursework projects where only one column of text is used but this dataset has multiple columns so how do i go about doing it? do i combine all the columns to be one big string and then do all the text cleaning and processing? or what is another option?
Heres a screenshot of the dataset, i couldnt get any other way to display it:

How do i plot a bar graphic with the groupby made column

I'm an Environmental Engineer, trying to make a leap change to the data science area which interests me more.
I'm new to Python, I work at a company that evaluates air quality data and I think that if I automate the analysis, I should save some time.
I've imported the CSV files with environmental data from the past month, did some filter in that just to make sure that the data were okay and did a groupby just analyse the data day-to-day (I need that in my report for the regulatory agency).
The step by step of what I did:
medias = tabela.groupby(by=["Data"]).mean()
display (tabela)
enter image description here
As you can see there's a column named Data, but when I do the info check it not recognizes the Data as a column.
print (medias.info())
enter image description here
How can I solve this? I need to plot some graphs with the concentration of rain and dust per day.
After grouping, please do a reset index
medias = tabela.groupby(by=["Data"]).mean()
medias = medias.reset_index()

How can we apply Slicing (Python) in real-life cases. Can someone provide me some industrial examples?

Im about to learn about Python in order to work with data analysis - and would like to know how do we apply slicing in real life cases. If someone has some examples from the real life
slicing is used to separate/get a specified part of something right?
so lets say i have some purchase data from my website
this data includes satisfaction level, average income, amount spent and time spent looking to purchase an item.
if i wanted to see if there was any correlation between income and amount spent on my website i could use slicing to only retrieve data from incomes that are above a certain point.
all in all slicing is used to manipulate data, and its crucial for data preparation in data analysis.

KMeans: Extracting the parameters/rules that fill up the clusters

I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.

Categories

Resources