Print Tab Separated rows of Python Dataframe - python

I want to print the following dataframe as tab delimited string
sku ids output
1 a 0.1
2 b 0.2
3 d 0.4
Output:
1 a 0.1
2 b 0.2
3 d 0.4
It should be a iterate process and print all the rows.
I have tried str.join() but it is not giving me the output that i am looking for. Any help would be appreciated. Thanks.

Apply lambda on each row
def applytab(row):
print('\t'.join(map(str,row.values)))
#print('\t'.join(map(str,df.columns))) # to print the column names if required
df.apply(applytab,axis=1)
Output
a 0.1 1
b 0.2 2
d 0.4 3

I am very new to Pandas/Dataframes and my answer can certainly be improved, but one way to achieve your required result is the following:
def printDataFrame(df):
for i in range(len(df.index)):
row = list(df.iloc[i])
print("\t".join(map(str, row)))
printDataFrame(df)
This functions loops through all the rows, then for each row inserts a tap after every element in the row and then prints the row as a string.

Related

Shift data groups

I am a newbie in python and i want to perform a sort of shifting based on a shift unit that i have in a column.
My data is as the following :
Group Rate
1 0.1
1 0.2
1 0.3
2 0.9
2 0.12
The shifting_Unit of the first group is 2 and for the second 1
The desired output is the following :
Group Shifted_Rate
1 0
1 0
1 0.1
2 0
2 0.9
I tried to do the following but it is not working :
df['Shifted_Rate'] = df['Rate'].shift(df['Shift_Unit'])
Is there another way to do it without the shift() method ?
I think this might be the first time I've worked with pandas, so this might not be helpful, but from what I've found in the documentation for pandas.DataFrame.shift(), it looks like the periods variable that relates to the "number of periods to shift" is an int. Because of this (that is, because this is an int rather than something like a list or dict), I have the feeling that you might need to approach this type of problem by making individual data frames and then putting these data frames together. I tried this out and used pandas.DataFrame.append() to put the individual data frames together. There might be a more efficient way to do this with pandas, but for now, I hope this helps with your immediate situation.
Here is the code that I used to do approach your situation (this code is in a file called q11.py in my case):
import numpy as np
import pandas as pd
# The periods used for the shifting of each group
# (e.g., '1' is for group 1, '2' is for group 2).
# You can add more items here later if need be.
periods = {
'1': 2,
'2': 1
}
# Building the first DataFrame
df1 = pd.DataFrame({
'Rate': pd.Series([0.1, 0.2, 0.3], index=[1, 1, 1]),
})
# Building the second DataFrame
df2 = pd.DataFrame({
'Rate': pd.Series([0.9, 0.12], index=[2, 2]),
})
# Shift
df1['Shifted_Rate'] = df1['Rate'].shift(
periods=periods['1'],
fill_value=0
)
df2['Shifted_Rate'] = df2['Rate'].shift(
periods=periods['2'],
fill_value=0
)
# Append the df2 DataFrame to df1 and save the result to a new DataFrame df3
# ref: https://pythonexamples.org/pandas-append-dataframe/
# ref: https://stackoverflow.com/a/51953935/1167750
# ref: https://stackoverflow.com/a/40014731/1167750
# ref: https://pandas.pydata.org/pandas-docs/stable/reference/api
# /pandas.DataFrame.append.html
df3 = df1.append(df2, ignore_index=False)
# ref: https://stackoverflow.com/a/18023468/1167750
df3.index.name = 'Group'
print("\n", df3, "\n")
# Optional: If you only want to keep the Shifted_Rate column:
del df3['Rate']
print(df3)
When running the program, the output should look like this:
$ python3 q11.py
Rate Shifted_Rate
Group
1 0.10 0.0
1 0.20 0.0
1 0.30 0.1
2 0.90 0.0
2 0.12 0.9
Shifted_Rate
Group
1 0.0
1 0.0
1 0.1
2 0.0
2 0.9

Multiply each column by their relative factor using prefix column name

I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |
for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])
You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.
Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe

Only want to consider a dataframe up to the present point

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?
It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)
Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

Counting the weighted intersection of equivalent rows in two tables

The following question is a generalization to the question posted here:
Counting the intersection of equivalent rows in two tables
I have two FITS files. For example, the first file has 100 rows and 2 columns. The second file has 1000 rows and 3 columns.
FITS FILE 1 FITS FILE 2
A B C D E
1 2 1 2 0.1
1 3 1 2 0.3
2 4 1 2 0.9
I need to take the first row of the first file, i.e 1 and 2 and check how many rows in the second file have C = 1 and D = 2 weighting each pair (C,D) with respect to the corresponding value in column E.
In the example, I have 3 rows in the second file that have C = 1 and D = 2. They have weights E = 0.1, 0.3, and 0.9, respectively. Weighting with respect to the values in E, I need to associate the value 0.1+0.3+0.9 = 1.3 to the pair (A,B) = (1,2) of the first file. Then, I need to do the same for the second row (first file), i.e 1 and 3 and find out how many rows in the second file have 1 and 3, again weighting with respect to the value in column E, and so on.
The first file does not have duplicates (all the rows have different pairs, none are identical, only file 2 has many identical pairs which I need to find).
I finally need the weighted numbers of rows in the second file that have the similar values as that of the rows of the first FITS file.
The result should be:
A B Number
1 2 1.3 # 1 and 2 occurs 1.3 times
1 3 4.5 # 1 and 3 occurs 4.5 times
and so on for all pairs in A and B columns.
I know from the post cited above that the solution for weights in column E all equal to 1 involves Counter, as follows:
from collections import Counter
# Create frequency table of (C,D) column pairs
file2freq = Counter(zip(C,D))
# Look up frequency value for each row of file 1
for a,b in zip(A,B):
# and print out the row and frequency data.
print a,b,file2freq[a,b]
To answer the question I need to include the weights in E when I use Counter:
file2freq = Counter(zip(C,D))
I was wondering if it is possible to do that.
Thank you very much for your help!
I'd follow up on the suggestion made by Iguananaut in the comments to that question. I believe numpy is an ideal tool for this.
import numpy as np
fits1 = np.genfromtxt('fits1.csv')
fits2 = np.genfromtxt('fits2.csv')
summed = np.zeros(fits1.shape[0])
for ind, row in enumerate(fits1):
condition = (fits2[:,:2] == row).all(axis=1)
summed[ind] = fits2[condition,-1].sum() # change the assignment operator to += if the rows in fits1 are not unique
After the import, the first 2 lines will load the data from the files. That will return an array of floats, which comes with the warning: comparing one float to another is prone to bugs. In this case it will work though, because both the columns in fits1.csv and the first 2 columns in fits2.csv are integers and parsed in the same manner by genfromtxt.
Then, in the for-loop the variable condition is created, which states that anytime the first two columns in fits2 match with the columns of the current row of fits1, it is to be taken into account (the result is a boolean array).
Then, finally, for the current row index ind, set the value of the array summed to the sum of all the values in column 3 of fits2, where the condition was True.
For a mini example I made, I got this:
oliver#armstrong:/tmp/sto$ cat fits1.csv
1 2
1 3
2 4
oliver#armstrong:/tmp/sto$ cat fits2.csv
1 2 .1
1 2 .3
1 2 .9
2 4 .3
1 5 .5
2 4 .7
# run the above code:
# summed is:
# array([ 1.3, 0. , 1. ])

Categories

Resources