Pandas - Add column containing metadata about the row - python

I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.
I tried:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df['Num Hits'] = val
Which returns an error:
-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.

You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:
In [89]:
print df
0 1 2 3
0 0.835396 0.330275 0.786579 0.493567
1 0.751678 0.299354 0.050638 0.483490
2 0.559348 0.106477 0.807911 0.883195
3 0.250296 0.281871 0.439523 0.117846
4 0.480055 0.269579 0.282295 0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0 3
1 3
2 3
3 3
4 3
dtype: int64
#DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.

You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df.loc[index, 'Num Hits'] = val

Related

Python Return all Columns [duplicate]

Using Python Pandas I am trying to find the Country & Place with the maximum value.
This returns the maximum value:
data.groupby(['Country','Place'])['Value'].max()
But how do I get the corresponding Country and Place name?
Assuming df has a unique index, this gives the row with the maximum value:
In [34]: df.loc[df['Value'].idxmax()]
Out[34]:
Country US
Place Kansas
Value 894
Name: 7
Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.
Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):
df = df.reset_index()
df[df['Value']==df['Value'].max()]
This will return the entire row with max value
I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.
index = df.Value.argmax()
Now the index could be used to get the features for that particular row:
df.iloc[df.Value.argmax(), 0:2]
The country and place is the index of the series, if you don't need the index, you can set as_index=False:
df.groupby(['country','place'], as_index=False)['value'].max()
Edit:
It seems that you want the place with max value for every country, following code will do what you want:
df.groupby("country").apply(lambda df:df.irow(df.value.argmax()))
Use the index attribute of DataFrame. Note that I don't type all the rows in the example.
In [14]: df = data.groupby(['Country','Place'])['Value'].max()
In [15]: df.index
Out[15]:
MultiIndex
[Spain Manchester, UK London , US Mchigan , NewYork ]
In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')
In [17]: df.index[1]
Out[17]: ('UK', 'London')
You can also get the value by that index:
In [21]: for index in df.index:
print index, df[index]
....:
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562
Edit
Sorry for misunderstanding what you want, try followings:
In [52]: s=data.max()
In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854
In order to print the Country and Place with maximum value, use the following line of code.
print(df[['Country', 'Place']][df.Value == df.Value.max()])
You can use:
print(df[df['Value']==df['Value'].max()])
Using DataFrame.nlargest.
The dedicated method for this is nlargest which uses algorithm.SelectNFrame on the background, which is a performant way of doing: sort_values().head(n)
x y a b
0 1 2 a x
1 2 4 b x
2 3 6 c y
3 4 1 a z
4 5 2 b z
5 6 3 c z
df.nlargest(1, 'y')
x y a b
2 3 6 c y
import pandas
df is the data frame you create.
Use the command:
df1=df[['Country','Place']][df.Value == df['Value'].max()]
This will display the country and place whose value is maximum.
My solution for finding maximum values in columns:
df.ix[df.idxmax()]
, also minimum:
df.ix[df.idxmin()]
I'd recommend using nlargest for better performance and shorter code. import pandas
df[col_name].value_counts().nlargest(n=1)
I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!

python for loop using index to create values in dataframe

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

how to increment each row value in round robin method till some extent of given integer

trying to split an integer value to each row in dataframe
i have a pandas dataframe with 4 rows and say an integer as 5. the end result should have 2 for the first row and 1 for rest 3 rows.
df=pd.DataFrame(['a','b','c','d'],columns=['name'])
df['val']=0
no=5
while no>0:
for row in df['val']:
df['val']= row+1
no-=1
'Eventually one count has to be taken from 'no' and added to each row in the dataframe
i need to iterate through rows in a dataframe and increment the cell value till any said integer count.
Expected output
after 1 iteration the df will look like this and no will decrement to 4
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,0,0,0])),columns=['name','val'])
after 2nd iteration df will be as below and no will decrement to 3
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,1,0,0])),columns=['name','val'])
and this iteration goes till the any given integer. so for this case 4 th iteration will be as below and and no will decrement to 1
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,1,1,1])),columns=['name','val'])
and the final 5th iteration will be as below and no will decrement to 0 and loop ends
df=pd.DataFrame(list(zip(['a','b','c','d'],[2,1,1,1])),columns=['name','val'])
From what I understand, you want the value 1 in each row other than the first one, and [no - (number of non-first rows)] in the first row. You can get that by:
df['col'] = 1
df.loc[0, 'col'] = no - (len(df) - 1)
It is worth to mention that loops over the rows are almost always to be avoided when working with pandas. You can do almost everything without using loops on the rows (and way faster).
EDIT:
This is the best way to achieve what you requested in the edited question, if you have to loop on no for some other reason:
df['val'] = 0
no = 5
for i in range(no):
df.iloc[i % len(df), -1] += 1
But if you don't really need to loop, and just get the output in val, use:
df.loc[df.reset_index().index, 'val'] = no // len(df) + 1 * (df.reset_index().index < no % len(df))

Assign values to a dataframe by considering values in 2 columns of different dataframe as range

The following code explains the scenario,
I have a dataframe(df_ticker) with 3 columns
import pandas as pd
df_ticker = pd.DataFrame({'Min_val': [22382.729,36919.205,46735.164,62247.61], 'Max_val': [36901.758,46716.06,62045.06,182727.05],
'Ticker':['$','$$','$$$','$$$$']})
df_ticker`
df_ticker
My second dataframe contains 2 columns
df_values = pd.DataFrame({'Id':[1,2,3,4,5,6],'sal_val': [3098,45639.987,65487.4,56784.8,8,736455]})
df_values `
df_values
For every value in df_values ['sal_val'], I want to check in which range it falls in df_ticker [Max_val] and df_ticker [min_val] and assign df_ticker [ticker] accordingly.
Sample output would be something like this, sample_output
In the sample output, sal_val=3098 is greater than or equal to Min_val=22382.729 and less than or equal to max_val=36901.75, it was assigned ticker=$
I tried the following,
df_values['ticker']=df_ticker.\
loc[((df_values['sal_val']>=df_ticker['Min_val'])| (df_values['sal_val']<=df_ticker['Max_val']))]['Ticker']
df_values
It failed with error "ValueError: Can only compare identically-labeled Series objects"
Any solutions for this issue?
One way is to define a custom mapping function and use pd.Series.apply.
def mapper(x, t):
if x < t['Min_val'].min():
index = 0
elif x >= t['Max_val'].max():
index = -1
else:
index = next((idx for idx, (i, j) in enumerate(zip(t['Min_val'], t['Max_val']))\
if i <= x < j), None)
return t['Ticker'].iloc[index] if index is not None else None
df_values['Ticker'] = df_values['sal_val'].apply(mapper, t=df_ticker)
Result
Id sal_val Ticker
0 1 3098.000 $
1 2 45639.987 $$
2 3 65487.400 $$$$
3 4 56784.800 $$$
4 5 8.000 $
5 6 736455.000 $$$$
Explanation
pd.Series.apply accepts a custom mapping function as an input.
The mapping function takes each entry in sal_val and compares it to values in df_ticker via an if / else structure.
The first 2 if statements deal with minimum and maximum boundaries.
The final else statement uses a generator, which cycles through each row in df_ticker and finds the index of values where the input is within the range of Min_val and Max_val.
Finally, we use the index and feed it into df_ticker['Ticker'] via .iloc integer accessor.

Python pandas an efficient way to see which column contains a value and use its coordinates as an offset

As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values

Categories

Resources