I'm trying to convert a whole column (actually four columns) from object to datetime.
As I observed in dataframe, there are some NaN values which are making my life more difficult, so I took the following strategy:
Fill NaN values to a specific date using df[column].fillna('2030-1-1')
Convert to date time using pd.to_datetime
The curious thing is that this strategy worked only for two columns, while the other two the error insists, as shown in below picture.
As you can see, the first two columns has successfully been converted
Although all columns have received the same information from fillna command, the last two columns cannot be converted
as shown in this picture.
I also observed that these columns differ from datatype originally
as you can see, the "error column" originally has a related datetime.datetime array.
I have researched for NumPy array documentation, but I could not yet solve this problem. Would you help me?
Related
I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.
I have a column in a pyspark dataframe for the age of an electronic device, and these values are given in milliseconds. Is there an easy way to convert that column's values to years? I am not well versed in Spark.
EDIT: I understand that you can convert milliseconds to years pretty easily with basic math, I'm trying to take a column of a pyspark dataframe and iterate through it and convert all column values to a different value. Is there a specific pyspark function that makes this easier or no? I have a column where all values are very large integers with time in milliseconds, and I am trying to filter out values which are too small or large to make sense based on the lifespan of the device.
table.filter(F.col("age")>0).filter(F.col("age")<yearsToSeconds(20))
where yearsToSeconds is a very basic function converting years to seconds. I'd prefer being able to convert the column values to years, but I haven't worked with spark before and I don't know an optimal way to do that.
well, one way is to use withColumn.
here I'm demonstrating adding a new column called "ageinMin" to dataframe and calculate it based on "age" column from dataframe and dividing it by 600 to get equivalent minutes:
df.withColumn("ageinMin",col("age") /600)
I am new to coding and I cannot seem to find the answer to this problem online.
I am trying to compare two sets of data for a data science project.
I am comparing one column which could be either two types of categories; 'recurrence-events' or 'no-recurrence-events', to another column which is 'Age bracket' which has five categories.
I would like to find the relationship between the two and find the amount of rows that fall under each category.
below I have posted a screenshot of the outcome from a website that has done a similar project to the one I am doing using a dataset from the UCI Repository.
It would be helpful if you could put a sample of the data. Otherwise, it es difficult to understand what your problem is and what your desired result is.
However, I can understand that you have two categorical columns and you would want to get the number of rows in each of the combinations.
The following might help you.
import pandas as pd
# asuming that the DataFrame name is df and the columns are ['column1', 'column2']
pd.crosstab(df['column1'], df['column2'])
I have one dataframe of readings that come in a particular arrangement due to the nature of the experiment. I also have another dataframe, that contains information about each point on the dataframe and what each point corresponds to, in terms of what chemical was at that point. Note, there are only a few different chemicals over the dataframe, but they are arranged all over the dataframe.
What I want to do is to create a new, reorganised dataframe, where the columns are the type of chemical. My inital thought would be to compare the data and information dataframes to produce a dictionary, which I could then transform into a new dataframe. I could not figure out how to do this, and might not actually be the best approach either!
I have previously achieved it by manually rearranging the points on the dataframe to match the pattern I want, but I'm not happy with this approach and must be a better way.
Thanks in advance for any help!
In pandas you can replace the default integer-based index with an index made up of any number of columns using set_index().
What confuses me, though, is when you would want to do this. Regardless of whether the series is a column or part of the index, you can filter values in the series using boolean indexing for columns, or xs() for rows. You can sort on the columns or index using either sort_values() or sort_index().
The only real difference I've encountered is that indexes have issues when there are duplicate values, so it seems that using an index is more restrictive, if anything.
Why then, would I want to convert my columns into an index in Pandas?
In my opinion custom indexes are good for quickly selecting data.
They're also useful for aligning data for mapping, for aritmetic operations where the index is used for data alignment, for joining data, and for getting minimal or maximal rows per group.
DatetimeIndex is nice for partial string indexing, for resampling.
But you are right, a duplicate index is problematic, especially for reindexing.
Docs:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
Enables automatic and explicit data alignment
Allows intuitive getting and setting of subsets of the data set
Also you can check Modern pandas - Indexes, direct link.
As of 0.20.2, some methods, such as .unstack(), only work with indices.
Custom indices, especially indexing by time, can be particularly convenient. Besides resampling and aggregating over any time interval (the latter is done using .groupby() with pd.TimeGrouper()) which require a DateTimeIndex, you can call the .plot() method on a column, e.g. df['column'].plot() and immediately get a time series plot.
The most useful though, is alignment: for example, suppose you had some two sets of data that you want to add; they're labeled consistently, but sorted in a different order. If you set their labels to be the index of their dataframe, you can simply add the dataframes together and not worry about the ordering of the data.