Problem when adding a column to a pandas dataframe - python

I am adding a pandas series as a new column in a dataframe, and observed unattended changes.
X_test_col["Valor real ppna"]=y_test_pd
print (np.all(y_test_pd.iloc[0].dtype=="float64"), np.all(X_test_col["Valor real ppna"].dtype=="float64"))
Output
True True
Then
print (len(y_test_pd), len(X_test_col["Valor real ppna"]))
Output
9000 9000
And then
print (round(X_test_col["Valor real ppna"].count()))
Output
256
Finally
print (round(y_test_pd.count()))
Output:
ppna (kg/ha.dia) 9000
dtype: int64
As you can see the column
X_test_col["Valor real ppna"]
assigned as a new
column effectively appears as having the same length and the same dtypes as the original series.
BUT when I try to use the values, it appears that they are 256 and NOT 9000 as I expected.
I have been working with this problem for some hours, trying to assign the dtype and so on, and so on, and I need some help.
What should I do to correctly assign the 9000 values to the new dataframe column?
FYI: the y_test_pd comes from the sklearn train_test_split() function.
In order to explain better, I have passed the dataframes to excel.
This is the original dataset:
This is the column to add
And this is the desired output:

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

pandas dataframe aggregate adding index in some cases

i have a pandas dataframe with an id column and relatively large text in other column. i want to group by id column and concatenate all the large texts into one single text whenever id repeats. it works great in simple toy example but when i run it on my real data it adds index of rows added in final concatenated text. here is my example code
data = {"A":[1,2,2,3],"B":['asdsa','manish','shukla','wfs']}
testdf = pd.DataFrame(data)
testdf = testdf.groupby(['A'],as_index=False).agg({'B':" ".join})
as you can see this code works great but when i run it on my real data it adds indexes in begnning of column B like it will say something like "1 manish \n 2 shukla" for A=2. it obviously is working here but no idea why its misbehaving when i have larger text with real data. any pointers? i tried to search but apparently noone else has run into this issue.
ok i figured out the answer. if any rows in the dataframe as na or nulls, it does that. once i removed the na and nulls it worked.

Python Pandas: Series and getting value from data frame counts null entries?

I have a csv file with 7000 rows and 5 cols.
I have an array of 5000 words, that I want to add to the same CSV file in a new column. I added a column 'originalWord ', and used the pd.Series function which added the 5000 words in a single column as I want.
allWords=['x' * 5000]
df['originalWord']=pd.Series(allWords)
My problem now is I want to get the data in the column 'originalWord' - whether by putting them in an array or accessing the column directly - even though it's 5000 rows only and the file has 7000 rows (with the last 2000 being null values)
print(len(df['originalWord']))
7000
Any idea how to make it reflect the original length 5000 ? Thank you.
If I understand you correctly, what you're asking for isn't possible. From what I can gather, you have a DataFrame that has 7000 rows and 5 columns, meaning that the index is of size 7000. To this DataFrame, you would like to add a column that has 5000 rows. Since there are in total 7000 rows in the DataFrame, the appended column will have 2000 missing values that would thus be assigned NaN. That's why you see the length as 7000.
In short, there is no way of accessing df['originalWord'] and automatically exclude all missing values as even that Series has an index of size 7000. The closest you could get to is to write a function that would include dropna() if the issue is that you find it bothersome to repeatedly call it.

Pandas dataframe returns incorrect sort using two float columns

I am playing with some geo data. Given a point, I am trying to map to an object. So for each connection, I generate two distances, both floats. To find the closest, I want to sort my dataframe by both distances and pick the top row.
Unfortunately when I run a sort (df.sort_values(by=['direct distance', 'pt_to_candidate']) I get the following out-of-order result
I would expect the top two rows, but flipped. If I run the sort on either column solo, I get expected results. If I flip the order of the sort (['pt_to_candidate', 'direct distance']) I get a correct, though not what I necessarily want for my function.
Both columns are type float64.
Why is this sort returning oddly?
For completeness, I should state that I have more columns and rows. From the main dataframe, I filter first and then sort. Also, I cannot recreate by manually entering data into a new dataframe, so I suspect the float length is the issue.
Edit
Adding a value_counts on 'direct distance'
4.246947 7
3.147303 2
2.875081 1
2.875081 1

Applying a function to a column in a pandas dataframe

So I have a function replaceMonth(string), which is just a series of if statements that returns a string derived from a column in a pandas dataframe. Then I need to replace the original string with the derived one.
The dataframe is defined like this:
Index ID Year DSFS DrugCount
0 111111 Y1 3- 4 months 1
There are around 80K rows in the dataframe. What I need to do is to replace what is in column DSFS with the result from the replaceMonth(string) function.
So if, for example, the value in the first row of DSFS was '3-4 months', if I ran that string through replaceMonth() it would give me '_3_4' as the return value. Then I need to change the value in the dataframe from the '3- 4 months' to '_3_4'.
I've been trying to use apply on the dataframe but I'm either getting the syntax wrong or not understanding what it's doing correctly, like this:
dataframe['DSFS'].apply(replaceMonth(dataframe['DSFS']))
That doesn't ring right to me but I'm not sure where I'm messing up on it. I'm fairly new to Python so it's probably the syntax. :)
Any help is greatly appreciated!
When you apply you pass the function that you want applied to each element.
Try
dataframe['DSFS'].apply(replaceMonth)
Reassign to the dataframe to preserve the changes
dataframe['DSFS'] = dataframe['DSFS'].apply(replaceMonth)

Categories

Resources