Is there a built in function in pandas that does the following? - python

i have two dataframes df1 and df2. they have a column with common values but in df1['comon_values_column'] every value comes up only once while in df2['comon_values_column'] each value can come up more than once. I want to see if i can do the following with a single line and without a loop
for value in df2['comon_values_column']:
df2['empty_column'].loc[df2['comon_values_column']==value]=df1['other_column'].loc[df1['comon_values_column']==value]
i have tried to use merge but because of the size of the dataframe it is very difficult to make sure that it does exactly what i want

Related

Mapping values from one df to another when dtypes are Objects

I have 2 dataframes that I am trying to map to each other - the one I created from a csv file and the other from an Excel file. I am trying to map (something like a vlookup in Excel) the name of the one df to the respective code in the other.
When trying to do the same task with another dataframe where the values in the 'key' column were of type 'integer' the process worked well. In this case, both columns are of type Object and I am able to map only one value and for others receive a "NaN". The one value that is successfully mapped is a value with 2 decimals points (for example 9.5.5) which is why I presume pandas treats the column as an object and not integer.
I have attempted the following:
Changing the dtypes of both columns to strings and then trying to map them:
df_1['code'] = df_1['code'].astype(str)
df_2['code'] = df_2['code'].astype(str)
I adjusted the index of both so that the map function works with index instead of columns:
df_1.set_index('code', inplace=True)
df_2.set_index('code', inplace=True)
The mapping was done using the following code:
df_1['code_name'] = df_1.index.map(df_2['code_name'])
Not too sure what else is possible. I cannot use .apply or .applymap functions as python states that my data is a dataframe type. I did also attempt to use the .squeeze() function to convert the df_2 to a Series and I still got the same results - namely NaN for the majority and only certain values changed. If it helps, the values that were mapped are aligned to the Left (when opened in Excel) while those unmapped with NaN are aligned to the Right (perhaps seen as numbers).
If possible, I prefer to use the .map function opposed to .merge as it is faster.

Create a new interdependent df column using itertuples()

I am very aware of the strong priority and preference of using scalar operation in the world of pandas dataframe. However I have so far exhausted all the options of an exercise where I need to create a new df column which depends on the previous row's value of another column and the value of such another column would also depends on the previous row's value of the new column and thus these two columns are inter-dependent (not the case of circular-referencing because each of these two columns will depend on the previous values of each another).
A kind of for-loop case (event driven type) is most relevant and I learn of the itertuples() function to start:
for row in df.itertuples()
However, I want to know what is the best way to create a new df based on such function? Should I first create a new/blank df column and then do some sort of append function?

How to use DataFrame.isin without the constraint of having to match both index and value?

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:
print(df1['Col1'].isin(df2['col3']).value_counts())
This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?
Note: I have checked for whitespace in the columns as well. There is no whitespace.
PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?
First, convert the column you use as parameter inside your isin() method as a list.
Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.
From your example:
print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())
Try running that again.

how to write an empty column in a csv based on other columns in the same csv file

I don't know whether this is a very simple qustion, but I would like to do a condition statement based on two other columns.
I have two columns like: the age and the SES and the another empty column which should be based on these two columns. For example when one person is 65 years old and its corresponding socio-economic status is high, then in the third column(empty column=vitality class) a value of 1 is for example given. I have got an idea about what I want to achieve, however I have no idea how to implement that in python itself. I know I should use a for loop and I know how to write conditons, however due to the fact that I want to take two columns into consideration for determining what will be written in the empty column, I have no idea how to write that in a function
and furthermore how to write back into the same csv (in the respective empty column)
[]
Use the pandas module to import the csv as a DataFrame object. Then you can do logical statements to fill empty columns:
import pandas as pd
df = pd.read_csv('path_to_file.csv')
df.loc[(df['age']==65) & (df['SES']=='high'), 'vitality_class'] = 1
df.to_csv('path_to_new_file.csv', index=False)

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Categories

Resources