I am working from this dataset and I would like to combine yr_built and yr_renovated into one, preferably to yr_built, based on this: if the value in yr_renovated is bigger than 0, then I would like to have this value, otherwise the yr_built's value.
Can you please help me on this?
Thank you!
Here you go. You basically need pandas for the dataframe, then create a new column using numpy to check if the value of 'yr_renovated' is greater than zero else use 'yr_built'
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/Jonasyao/Machine-Learning-Specialization-University-of-Washington-/master/Regression/Assignment_four/kc_house_data.csv', error_bad_lines=False)
df=df[['yr_built','yr_renovated','date','bedrooms']]
newdata['MyYear']=np.where(df['yr_renovated'] > 0,df['yr_renovated'],df['yr_built'])
newdata
Related
I have the following two 'Series' objects, which I want to write into a normal csv file with two columns (date and value):
However, with my code (see below) I only managed to include the value, but not the date with it:
Help would be greatly appreciated!
Best,
Lena
You could create a Pandas Dataframe for each of the series, and use the export to csv function of the dataframe. Like this:
import pandas as pd
hailarea_south = pd.DataFrame(hailarea_south)
hailarea_south.to_csv("hailarea_south.csv")
I suggest using pandas.write_csv since it handles most of the things you're currently doing manually. For this approach, however, a DataFrame is easiest to handle, so we need to convert first.
import numpy as np
import pandas as pd
# setup mockdata for example
north = pd.Series({'2002-04-01': 0, '2002-04-02': 0, '2021-09-28': 167})
south = pd.Series({'2002-04-01': 0, '2002-04-02': 0, '2021-09-28': 0})
# convert series to DataFrame
df = pd.DataFrame(data={'POH>=80_south': south, 'POH>=80_north': north})
# save as csv
df.to_csv('timeseries_POH.csv', index_label='date')
output:
date,POH>=80_south,POH>=80_north
2002-04-01,0,0
2002-04-02,0,0
2021-09-28,0,167
In case you want different separators, quotes or the like, please refer to this documentation for further reading.
I am trying to output a filtered list based on the input of the index.
In my case, I want to make the Location the index, and only show all the results whose location is 'Switzerland'. I am using jupyter-notebook
I have an xlsx file called Book1 containing [here.][1]
, I type this in.
import pandas as pd
from pandas import Series, DataFrame
from scipy import stats
substats=pd.read_excel('Book1.xlsx', index_col=1) #index_col=1 makes Location the index
I am stuck, but I am expecting [the output to be like this][2]
Notice that the second image index is not 4, 6, but instead 1, 2.
Can you help me with this?
[1]: https://i.stack.imgur.com/NIbKx.png
[2]: https://i.stack.imgur.com/whSEP.png
I believe this is an off-by-one error
You are assuming Pandas is 0-indexing their arrays, but it looks like they are 1-indexing it.
Using the 2nd column as the index should solve this.
I am completely new to python.. I would like to ask how can I fix my code?
I can't make it to work because for some reason, it only calculates columns.
import numpy as np
import pandas as pd
rainfall = pd.read_csv('rainfall.csv', low_memory=False, parse_dates=True, header=None)
mean_rainfall = rainfall[0].mean()
print(mean_rainfall)
the picture of my csv
In pandas dataframe mean function you can provide parameter to let him him know either take mean of a row or column.
Check Here: pandas.DataFrame.mean.
It seams though it takes default axis value of 1 so it is calculation the mean of column.
Try this:
mean_rainfall = rainfall.iloc[0].mean(axis = 1)
I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you
If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]
I have a Pandas dataframe in Python (3.6) with numeric and categorical attributes. I want to pull a list of numeric columns for use in other parts of my code. My question is what is the most efficient way of doing this?
This seems to be the standard answer:
num_cols = df.select_dtypes([np.number]).columns.tolist()
But I'm worried that select_dtypes() can be slow and this seem to add a middle step that I'm hoping isn't necessary (subsetting the data before pulling back the column names of just the numeric attributes).
Any ideas on a more efficient way of doing this? (I know there is a private method _get_numeric_data() that could also be used, but wasn't able to find out how that works and I don't love using a private method as a long-term solution).
df.select_dtypes is for selecting data, it makes a copy of your data, which you essentially discard, by then only selecting the columns. This is an inefficent way. Just use something like:
df.columns[[np.issubdtype(dt, np.number) for dt in df.dtypes]]
Two ways (without using df.select_dtypes which unnecessarily creates a temporary intermediate dataframe):
import numpy as np
[c for c in df.columns if np.issubdtype(df[c].dtype, np.number)]
from pandas.api.types import is_numeric_dtype
[c for c in df.columns if is_numeric_dtype(c)]
Or if you want the result to be a pd.Index rather than just a list of column name strings as above, here are three ways (first is from #juanpa.arrivillaga):
import numpy as np
df.columns[[np.issubdtype(dt, np.number) for dt in df.dtypes]]
from pandas.api.types import is_numeric_dtype
df.columns[[is_numeric_dtype(c) for c in df.columns]]
from pandas.api.types import is_numeric_dtype
df.columns[list(map(is_numeric_dtype, df.columns))]
Some other solutions consider a bool column to be numeric, but the solutions above do not (tested with numpy 1.22.3 / pandas 1.4.2).