I have a csv file with 7000 rows and 5 cols.
I have an array of 5000 words, that I want to add to the same CSV file in a new column. I added a column 'originalWord ', and used the pd.Series function which added the 5000 words in a single column as I want.
allWords=['x' * 5000]
df['originalWord']=pd.Series(allWords)
My problem now is I want to get the data in the column 'originalWord' - whether by putting them in an array or accessing the column directly - even though it's 5000 rows only and the file has 7000 rows (with the last 2000 being null values)
print(len(df['originalWord']))
7000
Any idea how to make it reflect the original length 5000 ? Thank you.
If I understand you correctly, what you're asking for isn't possible. From what I can gather, you have a DataFrame that has 7000 rows and 5 columns, meaning that the index is of size 7000. To this DataFrame, you would like to add a column that has 5000 rows. Since there are in total 7000 rows in the DataFrame, the appended column will have 2000 missing values that would thus be assigned NaN. That's why you see the length as 7000.
In short, there is no way of accessing df['originalWord'] and automatically exclude all missing values as even that Series has an index of size 7000. The closest you could get to is to write a function that would include dropna() if the issue is that you find it bothersome to repeatedly call it.
Related
My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)
In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.
so I have a dataframe with 300 thousand rows, 12 columns, including one column called value. Basically all the values in this column are of a type that can be converted to float, except for a row or two which includes the headers, which find their way in there as a result of the consolidation process. How can I find this troublesome row?
Thank you!
I am adding a pandas series as a new column in a dataframe, and observed unattended changes.
X_test_col["Valor real ppna"]=y_test_pd
print (np.all(y_test_pd.iloc[0].dtype=="float64"), np.all(X_test_col["Valor real ppna"].dtype=="float64"))
Output
True True
Then
print (len(y_test_pd), len(X_test_col["Valor real ppna"]))
Output
9000 9000
And then
print (round(X_test_col["Valor real ppna"].count()))
Output
256
Finally
print (round(y_test_pd.count()))
Output:
ppna (kg/ha.dia) 9000
dtype: int64
As you can see the column
X_test_col["Valor real ppna"]
assigned as a new
column effectively appears as having the same length and the same dtypes as the original series.
BUT when I try to use the values, it appears that they are 256 and NOT 9000 as I expected.
I have been working with this problem for some hours, trying to assign the dtype and so on, and so on, and I need some help.
What should I do to correctly assign the 9000 values to the new dataframe column?
FYI: the y_test_pd comes from the sklearn train_test_split() function.
In order to explain better, I have passed the dataframes to excel.
This is the original dataset:
This is the column to add
And this is the desired output:
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.