I spend lots of time try to insert data into pandas' DataFrame
but just cannot as I expected.
there are two index:
1. current_time
2. company_name
After I use data.ix[] to insert a row,
the Dataframe create another column (named by the company_name)
Can anyone give me some advice, please.
import pandas
data=pandas.DataFrame(columns=['Date', 'Name', 'd1'])
data.set_index(['Date', 'Name'], inplace=True)
now = pandas.datetime.now()
data.ix[now, 'ACompany'] = [1]
To let pandas know the now, 'ACompany' are the levels of the index, you have to use some extra parantheses:
data.ix[(now, 'ACompany'), :] = 1
By just doing data.ix[now, 'ACompany'], pandas will by default try to interpret this as index=now, column='ACompany' (in the sense of .ix[rows, columns])
Further, it is recommended to use .loc instead of .ix if you want to index solely by the labels.
Related
I have a mock pandas dataframe 'df' where I want to create a new column 'fruit' and was wondering the easiest way to do this. The new column 'fruit_cost' will be taking the integer from the 'cost' column where the item type is equal to 'fruit'. What would the standard was of doing this in PANDAS be? Should I use conditional logic, or is there a simpler way. If anyone has any good practice tutorials for this type of thing it would also be beneficial.
In SQL I would create it using a case:
SQL
case
when item_type = 'fruit' then cost
else 0
end
as fruit_cost
*Python
import pandas as pd
list_of_customers =[
['patrick','lemon','fruit',10],
['paul','lemon','fruit',20],
['frank','lemon','fruit',10],
['jim','lemon','fruit',20],
['wendy','watermelon','fruit',39],
['greg','watermelon','fruit',32],
['wilson','carrot','vegetable',34],
['maree','carrot','vegetable',22],
['greg','','',],
['wilmer','sprite','drink',22]
]
df = pd.DataFrame(list_of_customers,columns = ['customer','item','item_type','cost'])
print(df)
#create new field 'fruit_cost'
df[fruit_cost] = if df[item_type] == 'fruit':
df[cost]
else:
0
df["fruit_cost"] = df["cost"].where(df["item_type"] == "fruit", other=0)
Here's some solutions:
np.where
df['fruit_cost'] = np.where(df['item_type'] == 'fruit', df['cost'], 0)
Dataframe.where
df['fruit_cost'] = df['cost'].where(df['item_type'] == 'fruit', 0)
There isn't really a standard since there are so many ways to do this; it's a matter of preference. I suggest you take a look at these links:
Pandas: Column that is dependent on another value
Set Pandas Conditional Column Based on Values of Another Column
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
My dataframe until now,
and I am trying to convert cols which is a list of all columns from 0 to 188 ( cols = list(hdata.columns[ range(0,188) ]) ) which are in this format yyyy-mm to datetimeIndex. There are other few columns as well which are 'string' Names and can't be converted to dateTime hence,so I tried doing this,
hdata[cols].columns = pd.to_datetime(hdata[cols].columns) #convert columns to **datetimeindex**
But this is not working.
Can you please figure out what is wrong here?
Edit:
A better way to work on this type of data is to use Split-Apply-Combine method.
Step 1: Split the data which you want to perform some specific operation.
nonReqdf = hdata.iloc[:,188:].sort_index()
reqdf= reqdf.drop(['CountyName','Metro','RegionID','SizeRank'],axis=1)
Step 2: do the operations. In my case it was converting the dataframe columns with year and months to datetimeIndex. And resample it quarterly.
reqdf.columns = pd.to_datetime(reqdf.columns)
reqdf = reqdf.resample('Q',axis=1).mean()
reqdf = reqdf.rename(columns=lambda x: str(x.to_period('Q')).lower()).sort_index() # renaming so that string is yyyy**q**<1/2/3/4> like 2012q1 or 2012q2 likewise
Step 3: Combine the two splitted dataframe.(merge can be used but may depend on what you want)
reqdf = pd.concat([reqdf,nonReqdf],axis=1)
In order to modify some of the labels from an Index (be it for rows or columns), you need to use df.rename as in
for i in range(188):
df.rename({df.columns[i]: pd.to_datetime(df.columns[i])},
axis=1, inplace=True)
Or you can avoid looping by building a full sized index to cover all the columns with
df.columns = (
pd.to_datetime(cols) # pass the list with strings to get a partial DatetimeIndex
.append(df.columns.difference(cols)) # complete the index with the rest of the columns
)
I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]
You can specify a format for each column by using df.style.format(), however, i want this behavior but then index based instead of column based. I realise its a bit more tricky because a column has a specific datatype, and a row can be mixed.
Is there a workaround to get it anyway? The df.style.apply() method has the flexibility, but i don't think it supports number formatting, only (CSS) styling.
Some sample data:
import pandas as pd
df = pd.DataFrame([[150.00, 181.00, 186.00],
[ 5.85, 3.73, 2.12]],
index=['Foo', 'Bar'],
columns=list('ABC'))
If i transpose the Dataframe, is easy:
mapper = {'Foo': '{:.0f}',
'Bar': '{:.1f}%'}
df.T.style.format(mapper)
But i want this formatting without transposing, something like:
df.style.format(mapper, axis=1)
You may not need to use the Styler class for this if the target is to re-format row values. You can use that mapper dictionary to match the formats you want, through a map and apply combination by row. The following should be a decent start:
df.apply(lambda s: s.map(mapper.get(s.name).format), axis=1)
Thanks!