How to analyze a dataframe with multiple headers? - python

For example, I have a df with 3 headers. I want to analyze data from one of the columns in the first header and one of the columns in the second header. How do i do that?

It's hard to know if this will work because you haven't provided you're data but you can try this.
First access the column names
data.columns
Then isolate the corresponding columns you would like to analyze
data = data[['column_1', 'column_2']]
Index the columns based on the names that appear as the current column names, ignore the column names not currently used and just index based on the corresponding match.
You can then rename the columns.
data.columns = ['new_column_1_name', 'new_column_2_name']

You can pull them out as tuples:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[["A", "B"], ["a", "b"]])
In [12]: df
Out[12]:
A B
a b
0 1 2
1 3 4
In [13]: df[[("A", "a")]]
Out[13]:
A
a
0 1
1 3
In your case it might be:
df[[("Year", "All ages")]]
See the advanced section of the docs for multi-index indexing and slicing.

Related

Change a dataframe with one header to two headers

I have a dataframe and I want to change it to the others which I attached. The name of columns are not important, also the new one does not have index 0,1,2.... This seems rather obvious, but I can't seem to figure out it. Note: I don't want to change the values and columns. I only have the same structure. I've added a simple dataframe as an example:
import pandas as pd
df = pd.DataFrame()
df['timestamp'] =[ 1, 2]
df['cons_id'] = [2, 3]
df[ 'value'] = [ 4,5]
The dataframe which I have:
Here is the output that I want to change the dataframe to:
Are you looking for set_index:
df = df.set_index('timestamp').rename_axis(columns='timestamp')
print(df)
# Output
timestamp cons_id value
timestamp
1 2 4
2 3 5

get value from dataframe based on row values without using column names

I am trying to get a value situated on the third column from a pandas dataframe by knowing the values of interest on the first two columns, which point me to the right value to fish out. I do not know the row index, just the values I need to look for on the first two columns. The combination of values from the first two columns is unique, so I do not expect to get a subset of the dataframe, but only a row. I do not have column names and I would like to avoid using them.
Consider the dataframe df:
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
I would like to get tra from the second row, based on the b and 2 combination that I know beforehand. I've tried subsetting with
df = df.loc['b', :]
which returns all the rows with b on the same column (provided I've read the data with index_col = 0) but I am not able to pass multiple conditions on it without crashing or knowing the index of the row of interest. I tried both df.loc and df.iloc.
In other words, ideally I would like to get tra without even using row indexes, by doing something like:
df[(df[,0] == 'b' & df[,1] == `2`)][2]
Any suggestions? Probably it is something simple enough, but I have the tendency to use the same syntax as in R, which apparently is not compatible.
Thank you in advance
As #anky has suggested, a way to do this without knowing the column names nor the row index where your value of interest is, would be to read the file in a pandas dataframe using multiple column indexing.
For the provided example, knowing the column indexes at least, that would be:
df = pd.read_csv(path, sep='\t', index_col=[0, 1])
then, you can use:
df = df.iloc[df.index.get_loc(("b", 2)):]
df.iloc[0]
to get the value of interest.
Thanks again #anky for your help. If you found this question useful, please upvote #anky 's comment in the posted question.
I'd probably use pd.query for that:
import pandas as pd
df = pd.DataFrame(index=['a', 'b', 'b', 'b', 'c'], data={"col1": [1, 2, 3, 1, 3], "col2": ['bla', 'tra', 'foo', 'bar', 'cra']})
df
col1 col2
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
df.query('col1 == 2 and col2 == "tra"')
col1 col2
b 2 tra

pandas groupby dataframes, calculate diffs between consecutive rows

Using pandas, I open some csv files in a loop and set the index to the cycleID column, except the cycleID column is not unique. See below:
for filename in all_files:
abfdata = pd.read_csv(filename, index_col=None, header=0)
abfdata = abfdata.set_index("cycleID", drop=False)
for index, row in abfdata.iterrows():
print(row['cycleID'], row['mean'])
This prints the 2 columns (cycleID and mean) of the dataframe I am interested in for further computations:
1 1.5020712104685252e-11
1 6.56683605063102e-12
2 1.3993315187144084e-11
2 -8.670502467042485e-13
3 7.0270625256163566e-12
3 9.509995221868016e-12
4 1.2901435995915644e-11
4 9.513106448422182e-12
The objective is to use the rows corresponding to the same cycleID and calculate the difference between the mean column values. So, if there are 8 rows in the table, the final array or list would store 4 values.
I want to make it scalable as well where there can be 3 or more rows with the same cycleIDs. In that case, each cycleID could have 2 or more mean differences.
Update: Instead of creating a new ques about it, I thought I'd add here.
I used the diff and groupby approach as mentioned in the solution. It works great but I have this extra need to save one of the mean values (odd row or even row doesn't matter) in a new column and make that part of the new data frame as well. How do I do that?
You can use groupby
s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)
output
1 -8.453876e-12
3 -1.486037e-11
5 2.482933e-12
7 -3.388330e-12
8 3.000000e-12
UPDATE
d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]
df = pd.DataFrame(d, columns=['cycleID', 'mean'])
df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
difference mean
1 -8.453876e-12 6.566836e-12
3 -1.486037e-11 -8.670502e-13
5 2.482933e-12 9.509995e-12
7 -3.388330e-12 9.513106e-12

Map a dataframe to a column of cartesian products by column name

Note: Cartesian product, might not be the right language, since we are working with data, not sets. It is more like "free product" or "words".
There is more than one way to turn a dataframe into a list of lists.
Here is one way
In that case, the list of lists represents actually a list of columns, where the list index is the row index.
What I want to do, is take a data frame, select specific columns by name, then produce a new list where the inner lists are cartesian products of the elements from the selected columns. A simplified example is given here:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
magicMap(df)
df = [[1,3],[2,4],[3,5]]
With column names:
df # full of columns with names
magicMap(df, listOfCollumnNames)
df = [[c1r1,c2r1...],[c1r2, c2r2....], [c1r3, c2r3....]...]
Note: "cirj" is column i row j.
Is there a simple way to do this?
The code
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
df2= df.transpose()
goes from, df
0 1 2
0 1 2 3
1 3 4 5
to that, df2
0 1
0 1 3
1 2 4
2 3 5
looks like what you need
df2.values.tolist()
[[1, 3], [2, 4], [3, 5]]
and to get the column order in the way you want use df3 = df2.reindex(columns=column_names) where column_names is the order you want,
You can also send the dataframe to a numpy array with:
df.T.to_numpy()
array([[1, 3],
[2, 4],
[3, 5]], dtype=int64)
If it must be a list, then use the other answer provided or use:
df.T.to_numpy().tolist()

Filter dataframe based on value_counts of other dataframe [duplicate]

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).
Help?
Use groupby filter:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
A B
0 1 2
1 1 4
I recommend reading the split-combine-section of the docs.
Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:
df1 = df[df.groupby("A")['A'].transform('size') > 1]
Or use Series.map with Series.value_counts:
df1 = df[df['A'].map(df['A'].value_counts()) > 1]
#jezael solution works very well, Here is a different approach to filter based on values count :
For example, if the dataset is :
df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})
Convert and save the count as a dictionary
ount_freq = dict(df['a'].value_counts())
Create a new column and copy the target column, map the dictionary with newly created column
df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)
Now we have a new column with count freq, you can now define a threshold and filter easily with this column.
df[df.count_freq>1]
Additionlly, in case one wants to filter and have 'count' column:
attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)
#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A
I might be a little late to this party but:
df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)
This is how you create a new dataframe and then just filter it...
df[df['A']>100]

Categories

Resources