I have a dataframe whose length does not correspond - python

This is my dataframe but in the output the number of rows does not correspond to the length i am so confused
df['clean']
113 apc started it so lets finish what they started
235 upon all these votes from katsina apc governm...
1796 when two or more people are contesting for an...
1798 deji said peter obi is jumping from church t...
1850 before amnesia set in this was you and lemme s...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: clean, Length: 63664, dtype: object

Your left column is the index column. It doesn't have to count your rows, and it can be basically anything, even a DatetimeIndex. If you want the index to count the rows, you can use df.reset_index or provide your code from the declaration of df to this point.
With a Minimal, Reproducible Example, someone could tell you where you changed the index, unless you imported df that way.

Related

Pandas's dataframe becoming empty after removing empty rows

I have the following data set:
Survived Not Survived
0 NaN 22.0
1 38.0 NaN
2 26.0 NaN
3 35.0 NaN
4 NaN 35.0
.. ... ...
886 NaN 27.0
887 19.0 NaN
888 NaN NaN
889 26.0 NaN
890 NaN 32.0
I want to remove all the rows which contains NaN so i wrote the following code(the dataset's name is titanic_feature_data):
titanic_feature_data = titanic_feature_data.dropna()
And when i try to display the new dataset i get the following result:
Empty DataFrame
Columns: [Survived, Not Survived]
Index: []
What's the problem ? and how can i fix it ?
By using titanic_feature_data.dropna(), you are removing all rows with at least one missing value. From the data you printed in your question, it looks like all rows contains at least one missing value. Is it possible that simply all your rows contains at least one missing value? If so, it makes total sense that your dataframe is empty after dropna(), right?
Having said that, perhaps you are looking to drop rows that have a missing value for one particular column, for example column Not Survived. Then you could use:
titanic_feature_data.dropna(subset='Not Survived')
Also, if you are confused about why certain rows are dropped, I recommend checking for missing values explicitly first, without dropping them. That way you can see which instances would have been dropped:
incomplete_rows = titanic_feature_data.isnull().any(axis=1)
incomplete_rows is a boolean series, which indicates whether a row contains any missing value or not. You can use this series to subset your dataframe and see which rows contain missing values (presumably all of them, given your example)
titanic_feature_data.loc[incomplete_rows, :]

Reorganize Dataframe to Multi Index

I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.

Pandas - finding anomaly in paired column values in large Dataframe

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.
I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Adjusting Monthly Time Series Data in Pandas

I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.

Combining two columns from two dataframes; same indices but different lengths

Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.
My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).
follow Activity Actor Behavior Recipient1
0 1 NaN date 2.1.3.2012 NaN
1 1 NaN s ss.hx NaN
2 1 NaN 50505 vo 51608
3 1 NaN 51608 vr 50505
4 1 NaN s ss.he NaN
So far, I have written some code in pandas to select out the "s" lines into a new dataframe:
def get_act_line(group):
return group.ix[(group.Actor == 's')]
result = trimdata.groupby('follow').apply(get_act_line)
I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:
result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()
So my new dataframe looks like this:
follow follow Activity Actor Behavior Recipient1
1 2 1 ss.hx NaN NaN NaN
34 1 hf.xa NaN NaN f.53702
74 1 hf.fe NaN NaN NaN
10 1287 10 ss.hf NaN NaN db
1335 10 fe NaN NaN db
What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.
This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!
I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.
One way you can do this (though there may be more optimal or elegant ways) is:
mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan
where df is equivalent to your results dataframe. This should return (my column orders are slightly different):
Activity Actor Behavior Recipient1 follow
0 NaN date 2013-04-01 00:00:00 NaN 1
1 ss.hx NaN ss.hx NaN 1
2 NaN 50505 vo 51608 1
3 NaN 51608 vr 50505 1
4 ss.he NaN ss.hx NaN 1
References:
Explanation of df.ix from other STO post.

Categories

Resources