How can we apply conditions for a Dataset in python, specially applying those and want to fetch the column name as an output?
let's say the below one is the dataframe so my question is how can we retrieve a colname(let's say "name") as an output by applying conditions on this dataframe
name salary jobDes
______________________________________
store1 | daniel | 50k | datascientist
store2 | paladugu | 55k | datascientist
store3 | theodor | 53k | dataEngineer
fetch a column name as a result like let's say "name"
Elaborated:
import pandas as pd
data = {'name':['daniel', 'paladugu', 'theodor'], 'jobDes':['datascientist', 'datascientist', 'dataEngineer']}
df = pd.DataFrame(data)
print(df['name']) # just that easy
OUTPUT:
0 daniel
1 paladugu
2 theodor
Name: name, dtype: object
Presuming you are using either pandas or dask, you should be able to get a list of column names with
df.columns.
This means that if you wish to know what the first column is called you can index it (these start with 0 for the first element because python/c) as usual with df.columns[0] etc.
If you then wish to access all the data in it, you can use
df[df.columns[0]] or the actual column name df['name'].
If your data frame is named df, df.columns returns a list of all of the column names.
Related
I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)
You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()
Let's suppose I have the following two dataframes:
int = pd.DataFrame({'domain1':['ABC.6','GF53.7','SDC78.12','GGH7T.64'], 'domain2': ['UI76.89','76TH3.2','YU1QW.45','BY76.12']})
domain1 domain2
ABC.6 UI76.89
GF53.7 76TH3.2
SDC78.12 YU1QW.45
GGH7T.64 BY76.12
And another dataframe:
doms = pd.DataFrame({'domains':['GF53','VB96','UI76','GGH7T','BY76','ABC','SDC78']})
domains
GF53
VB96
UI76
GGH7T
BY76
ABC
SDC78
I want to create a new dataframe that will include all the rows from 'int' dataframe only if the values in both 'domain1' and 'domain2' columns contain substrings from 'domain' column in 'doms' dataframe.
For example in this case the result should look like:
domain1 domain2
ABC.6 UI76.89
GGH7T.64 BY76.12
just some str.contains mixed with a joint regex:
int[int.domain1.str.contains('|'.join(doms.domains)) &\
int.domain2.str.contains('|'.join(doms.domains))]
domain1 domain2
0 ABC.6 UI76.89
3 GGH7T.64 BY76.12
try this, DataFrame.stack transforms rows to columns then apply contains to filter out the values followed by DataFrame.unstack to get back original data.
df[df.stack().str.contains("|".join(doms.domains)).unstack().all(axis=1)]
domain1 domain2
0 ABC.6 UI76.89
3 GGH7T.64 BY76.12
I have a dataframe, df, with a standard wide format:
df:
'state' | 'population' | 'region'
0 'CA' | 10000 | 'west'
1 'UT' | 6000 | 'west'
2 'NY' | 8500 | 'east'
I need to be able to rename certain values in the state column that match some conditions I've set. For example, I need to rename cases of 'NY' to 'New York' if the region variable matches 'east'. I'd like to avoid slicing and concatenating the dataframe back together.
I've tried subsetting the dataframe using the code below, but the rename doesn't seem to apply properly.
region_filter = df['region'] == 'east'
df[region_filter] = df.loc[region_filter, 'state'].rename({'NY': 'New York'})
Rename should only be applied when trying to change axes labels. Pandas' replace() function is meant for mapping dataframe values.
Also, line two should read df.loc[region_filter, 'state'] = df.loc[region_filter, 'state'].rename({'NY': 'New York'}) to avoid a shape mismatch error.
I have dataframe with the following data:
+----------+------------+-------------+---------------+----------+
|id |name |predicted |actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| null|100.10023 |2020-01-10|
| null| NirPost| 57145|null |2020-01-10|
+----------+------------+-------------+---------------+----------+
I want to merge these two rows into one, based on the name. This df is the result of a query which I've restricted to one company and single day. In the real dataset, there is 70~ companies with daily data. I want to rewrite this data into a new table as single rows.
This is the output I'd like:
+----------+------------+-------------+---------------+----------+
|id |name |predicted | actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| 57145 |100.10023 |2020-01-10|
+----------+------------+-------------+---------------+----------+
I've tried this:
df.replace('null','').groupby('name',as_index=False).agg(''.join)
However, this outputs my original df but with NaN instead of null.
`df.dtypes`:
id float64
name object
predicted float64
actual float64
yyyy_mm_dd object
dtype: object
How about you explicitly pass all the columns in the groupby with max so that it eliminates the null values?
import pandas as pd
import numpy as np
data = {'id':[215,np.nan],'name':['nirpost','nirpost'],'predicted':[np.nan,57145],'actual':[100.12,np.nan],'yyyy_mm_dd':['2020-01-10','2020-01-10']}
df = pd.DataFrame(data)
df = df.groupby('name').agg({'id':'max','predicted':'max','actual':'max','yyyy_mm_dd':'max'}).reset_index()
print(df)
Returns:
name id predicted actual yyyy_mm_dd
0 nirpost 215.0 57145.0 100.12 2020-01-10
Of course since you have more data you should probably consider adding something else in your groupby so as to not delete too many rows, but for the example data you provide, I believe this is a way to solve the issue.
EDIT:
If all columns are being named as max_original_column_name then you can simply use this:
df.columns = [x[:-4] for x in list(df)]
With the list comprehension you are creating a list that strips the last 4 characters (that is _max from each value in list(df) which is the list of the name of the columns. Last, you are assigning it with df.columns =
I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.
As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]
You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.
A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.