DataFrame from variable and filtering data

DataFrame from variable and filtering data - python

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').

See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

Related

Add intermediate rows in a dataframe based on the previous record

Be the following dataframe:
ID
direction
country
time
0
IN
USA
12:10
0
OUT
FRA
14:20
0
OUT
ESP
16:11
1
IN
GER
11:13
1
OUT
USA
10:29
2
OUT
USA
09:21
2
OUT
ESP
21:33
I would like to add the following functionality to the above dataframe:
If there are two rows sequentially with the value of the attribute "direction" equal to OUT for the same ID. An intermediate row is created with the same data of the first OUT row by changing the direction to IN.
Here is an example applied to the above dataframe:
ID
direction
country
time
0
IN
USA
12:10
0
OUT
FRA
14:20
0
IN
FRA
14:20
0
OUT
ESP
16:11
1
IN
GER
11:13
1
OUT
USA
10:29
2
OUT
USA
09:21
2
IN
USA
09:21
2
OUT
ESP
21:33
Thank you for your help.

Maintain a new dataframe
dfNew = pd.DataFrame()
and loop through each row of the existing dataframe.
for column_name, item in dfOld.iteritems():
Look at the value under direction with every loop, and if it is IN, take that entire row and append it to the new dataframe.
dfNew.append(item, ignore_index=True)
If it is out, add the entire row as above, but also create a new row
dfNew.loc[len(dfNew.index)] = [value1, value2, value3, ...]
or edit the existing row (contained in item) and add it to the new dataframe as well.

How to spread the data in pandas?

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe

If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN

One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns

If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032

where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work

I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to extract info from original dataframe after doing some analysis on it?

So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?

You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa

You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.

Delete rows based on values in column in python

I am performing data clean on a .csv file for performing analytics. I am trying delete the rows having null values in their column in python.
Sample file:
Unnamed: 0 2012 2011 2010 2009 2008 2005
0 United States of America 760739 752423 781844 812514 843683 862220
1 Brazil 732913 717185 715702 651879 649996 NaN
2 Germany 520005 513458 515853 519010 518499 494329
3 United Kingdom (England and Wales) 310544 336997 367055 399869 419273 541455
4 Mexico 211921 212141 230687 244623 250932 239166
5 France 193081 192263 192906 193405 187937 148651
6 Sweden 87052 89457 87854 86281 84566 72645
7 Romania 17219 12299 12301 9072 9457 8898
8 Nigeria 15388 NaN 18093 14075 14692 NaN
So far used is:
from pandas import read_csv
link = "https://docs.google.com/spreadsheets......csv"
data = read_csv(link)
data.head(100000)
How can I delete these rows?

Once you have your data loaded you just need to figure out which rows to remove:
bad_rows = np.any(np.isnan(data), axis=1)
Then:
data[~bad_rows].head(100)

You need to use the dropna method to remove these values. Passing in how='any' into the method as an argument will remove the row if any of the values is null and how='all' will only remove the row if all of the values are null.
cleaned_data = data.dropna(how='any')
Edit 1.
It's worth noting that you may not want to have to create a copy of your cleaned data. (i.e. cleaned_data = data.dropna(how='any').
To save memory you can pass in the inplace option that will modify your original DataFrame and return None.
data.dropna(how='any', inplace=True)
data.head(100)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame from variable and filtering data - python

Related

Add intermediate rows in a dataframe based on the previous record

How to spread the data in pandas?

Calculating new rows in a Pandas Dataframe on two different columns

How to extract info from original dataframe after doing some analysis on it?

Delete rows based on values in column in python

Categories

Resources