Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe with 1 column having multiple values
Mat Header
0 TURBINE , GAS ; MAKE: M/S HITACHI ; MODEL: H-25
1 TURBINE , GAS ; MAKE: M/S HITACHI ; MODEL: H-25
[43823 rows x 1 columns]```
How to split all values into different columns like :
``` Item ??? Make Model
Turbine Gas M/S Hitachi H-25
You can split it like this (you need to change the column names):
df[['col1', 'col2', 'col3']] = df['col'].str.split(';',expand=True)
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 16 days ago.
Improve this question
My array is given (stock data) :
struction of my_data = [date, (price:volume), ..., (price:volume)]
For example
my_data =
["2022-12-01", (2000:157),(2005:5), (2010:23654), (2050:132)]
["2022-12-02", (1990:4),(2000:123)]
["2022-12-03", (2010:11),(2005:12100),(2050,342)]
["2022-12-04", (2080:1230),(2090:55),(3010,34212354),(3050,29873)]
As you see, the dimensions of (price:volume) are arbitrary.
What is best way of creating database whose data array is like above ?
You can create a table this way.
date
price
volume
2022-12-01
2000
157
2022-12-01
2000
5
2022-12-02
1990
4
This allows you to query based on the date individually, so you can easily filter a day's records or records in between a range of dates.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm using pandas and I have a dataset containing 20 columns and 65 rows. What I'm trying to do is to try to measure the data completeness. So, I want to check the percentage of NaN values compared to the whole dataset. For example, the output I need is: The percentage of NaNs in the dataset is: 40%
I've counted the number of NaNs by doing the following: comp_df.isna().sum().sum() and got a result of 776. But, I don't know what to do next.
Use:
comp_df = pd.DataFrame(dict(a=[np.nan,1,1],
b=[np.nan,np.nan,np.nan]))
print (comp_df)
a b
0 NaN NaN
1 1.0 NaN
2 1.0 NaN
In your solution is possible divide by DataFrame.size for number of all values:
print (comp_df.isna().sum().sum() / comp_df.size * 100)
66.66666666666666
Or reshape values to Series, by DataFrame.stack and use mean, what is sum/count by definition:
print (comp_df.isna().stack().mean() * 100)
66.66666666666666
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Date,hrs,Count,Status
2018-01-02,4,15,SFZ
2018-01-03,5,16,ACZ
2018-01-04,3,14,SFZ
2018-01-05,5,15,SFZ
2018-01-06,5,18,ACZ
This is the fraction of data to what I've been working on. The actual data is in the same format with around 1000 entries of each date in it. I am taking the start_date and end_date as inputs from user. Consider in this case it is:
start_date:2018-01-02
end_date:2018-01-06
So, I have to display a total for hrs and the count within the selected date range, on the output. Also I want to do it using an #app.callback in dash(plot.ly). Can someone help please?
Use Series.between with filtering by DataFrame.loc and boolean indexing for columns by condition and then sum:
df = df.loc[df['Date'].between('2018-01-02','2018-01-06'), ['hrs','Count']].sum()
print (df)
hrs 22
Count 78
dtype: int64
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe with 'genre' as a column. In this column, each entry has several values. For example, a movie 'Harry Potter' could have fantasy,adventure in the genre column. As I am doing a data analysis and exploration, I have no idea how to represent this column with multiple values to show any relationships between movies and/or genre.
I have thought of using a graph analysis to show the relationship, but I would like to explore other approaches I can consider?
sample data
You can use str.get_dummies for new indicator columns by genres:
df = pd.DataFrame({'Movies': ['Harry Potter', 'Toy Story'],
'Genres': ['fantasy,adventure',
'adventure,animation,children,comedy,fantasy']})
#print (df)
df = df.set_index('Movies')['Genres'].str.get_dummies(',')
print (df)
adventure animation children comedy fantasy
Movies
Harry Potter 1 0 0 0 1
Toy Story 1 1 1 1 1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a data frame with 1000 rows and 10 columns.
3 of these columns are 'total_2013', 'total_2014' and 'total_2015'
I would like to create a new column, containing the average of total over these 3 years for each row, but ignoring any 0 values.
If you are using pandas:
Use DataFrame.mean leveraging the skipna attribute.
First replace 0 with None using:
columns = ['total_2013', 'total_2014', 'total_2015']
df[columns].replace(0, None)
Then compute the mean:
df["total"] = df[columns].mean(
axis=1, # columns mean
skipna=True # skip nan values
)