I have a file with 4 columns(csv file) and n lines.
I want the 4th column values to move to the next line every time.
ex :
[LN],[cb],[I], [LS]
to
[LN],[cb],[I]
[LS]
that is, if my file is:
[LN1],[cb1],[I1], [LS1]
[LN2],[cb2],[I2], [LS2]
[LN3],[cb2],[I3], [LS3]
[LN4],[cb4],[I4], [LS4]
the output file will look like
[LN1],[cb1],[I1]
[LS1]
[LN2],[cb2],[I2]
[LS2]
[LN3],[cb2],[I3]
[LS3]
[LN4],[cb4],[I4]
[LS4]
Test file:
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
Output required:
101 Xavier Mexico City 41
88.0
102 Ann Toronto 28
79.0
103 Jana Prague 33
81.0
104 Yi Shanghai 34
80.0
105 Robin Manchester 38
68.0
Split the dataframe into 2 dataframes, one with the first 3 columns and the other with the last column. Add a new helper-column to both so you can order them afterwards. Now combine them again and order them first by index (which is identical for entries which where previously in the same row) and then by the helper column.
Since there is no test data, this answer is untested:
from io import StringIO
import pandas as pd
s = """col1,col2,col3,col4
101 Xavier,Mexico City,41,88.0
102 Ann,Toronto,28,79.0
103 Jana,Prague,33,81.0
104 Yi,Shanghai,34,80.0
105 Robin,Manchester,38,68.0"""
df = pd.read_csv(StringIO(s), sep=',')
df1 = df[['col1', 'col2', 'col3']].copy()
df2 = df[['col4']].rename(columns={'col4':'col1'}).copy()
df1['ranking'] = 1
df2['ranking'] = 2
df_out = df1.append(df2)
df_out = df_out.rename_axis('index_name').sort_values(by=['index_name', 'ranking'], ascending=[True, True])
df_out = df_out.drop(['ranking'], axis=1)
Another solution to this is to convert the table to a list, then rearrange the list to reconstruct the table.
import pandas as pd
df = pd.read_csv(r"test_file.csv")
df_list = df.values.tolist()
new_list = []
for x in df_list:
# Removes last element from list and save it to a variable 'last_x'.
# This action also modifies the original list
last_x = x.pop()
# append the modified list and the last values to an empty list.
new_list.append(x)
new_list.append([last_x])
# Use the new list to create the new table...
new_df = pd.DataFrame(new_list)
Related
I would like to make my data frame more aesthetically appealing and drop what I believe are the unnecessary first row and column from the multi-index. I would like the column headers to be: 'Rk', 'Team','Conf','G','Rec','ADJOE',.....,'WAB'
Any help is such appreciated.
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df = df[0]
df
You only have to iterate over the existing columns and select the second value. Then you can set the list of values as new columns:
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df.columns = [x[1] for x in df.columns]
df.head()
Output:
Rk Team Conf G Rec AdjOE AdjDE Barthag EFG% EFGD% ... ORB DRB FTR FTRD 2P% 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 122.42 89.05 .97491 60.21 421 ... 30.2120 2318 30.4165 21.710 62.21 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 117.39 89.06 .95982 53.835 42.93 ... 37.26 27.6141 28.2242 33.3247 54.827 424 34.8108 29.418 65.2303 3.416
When you read from HTML, specify the row number you want as header:
df = pd.read_html(url, header=1)[0]
print(df.head())
output:
>>
Rk Team Conf G Rec ... 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 ... 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 ... 424 34.8108 29.418 65.2303 3.416
2 3 Kentucky SEC 26 21-510–3 ... 46.342 35.478 29.519 68.997 4.89
3 4 Arizona P12 25 23-213–1 ... 39.91 33.7172 31.471 72.99 6.24
4 5 Baylor B12 26 21-59–4 ... 49.2165 35.966 30.440 68.3130 6.15
I think I am overthinking this - I am trying to copy existing pandas data frame columns and values and making rolling averages - I do not want to overwrite original data. I am iterating over the columns, taking the columns and values, making a rolling 7 day ma as a new column with the suffix _ma as a copy to the original copy. I want to compare existing data to the 7day MA and see how many standard dev the data is from the 7 day MA - which I can figure out - I am just trying to save MA data as a new data frame.
I have
for column in original_data[ma_columns]:
ma_df = pd.DataFrame(original_data[ma_columns].rolling(window=7).mean(), columns = str(column)+'_ma')
and getting the error : Index(...) must be called with a collection of some kind, 'Carrier_AcctPswd_ma' was passed
But if I am iterating with
for column in original_data[ma_columns]:
print('Colunm Name : ', str(column)+'_ma')
print('Contents : ', original_data[ma_columns].rolling(window=7).mean())
I get the data I need :
My issue is just saving this as a new data frame, which I can concatenate to the old, and then do my analysis.
EDIT
I have now been able to make a bunch of data frames, but I want to concatenate them together and this is where the issue is:
for column in original_data[ma_columns]:
MA_data = pd.DataFrame(original_data[column].rolling(window=7).mean())
for i in MA_data:
new = pd.concat(i)
print(i)
<ipython-input-75-7c5e5fa775b3> in <module>
17 # print(type(MA_data))
18 for i in MA_data:
---> 19 new = pd.concat(i)
20 print(i)
21
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
307 "first argument must be an iterable of pandas "
308 "objects, you passed an object of type "
--> 309 '"{name}"'.format(name=type(objs).__name__)
310 )
311
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"
You should iterate over column names and assign the resulting pandas series as a new named column, for example:
import pandas as pd
original_data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})
ma_columns = ['A', 'B']
for column in ma_columns:
new_column = column + '_ma'
original_data[new_column] = pd.DataFrame(original_data[column].rolling(window=7).mean())
print(original_data)
Output dataframe:
A B A_ma B_ma
0 0 100 NaN NaN
1 1 101 NaN NaN
2 2 102 NaN NaN
3 3 103 NaN NaN
4 4 104 NaN NaN
.. .. ... ... ...
95 95 195 92.0 192.0
96 96 196 93.0 193.0
97 97 197 94.0 194.0
98 98 198 95.0 195.0
99 99 199 96.0 196.0
[100 rows x 4 columns]
This is my data,
prakash 101
Ram 107
akash 103
sakshi 115
vidushi 110
aman 106
lakshay 99
I want to select all rows from akash to vidushi or all rows from Ram to aman. In real scenarios, there will be thousand of rows and multiple columns and I will be getting multiple queries to select a range of rows on the basis of some column value. how can i do that?
Heres the right way to do it..
start = 'akash'
end = 'vidushi'
l = list(df['names']) #ordered list of names
subl = l[l.index(start):l.index(end)+1] #list of names between the start and end
df[df['names'].isin(subl)] #filter dataset for list of names
2 akash 103
3 sakshi 115
4 vidushi 110
Create some variables (which you can adjust), then use .loc and .index[0] (note: df[0] can be replaced with the name of your header, so if your header is called Names, then change all instances of df[0] to df['Names']:
var1 = 'Ram'
var2 = 'aman'
a = df.loc[df[0]==var1].index[0]
b = df.loc[df[0]==var2].index[0]
c = df.iloc[a:b+1,:]
c
output:
0 1
1 Ram 107
2 akash 103
3 sakshi 115
4 vidushi 110
5 aman 106
try set_index then use loc
df = pd.DataFrame({"name":["prakash","Ram","akash","sakshi","vidushi","aman","lakshay"],"val":[101,107,103,115,110,106,99]})
(df.set_index(['name']).loc["akash":"vidushi"]).reset_index()
output:
name val
0 akash 103
1 sakshi 115
2 vidushi 110
You can use the range to select rows
print x[2:4]
#output
akash 103
sakshi 115
vidushi 110
aman 106
If you want to fill the values based on a specific column you can use np.where
First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
I have a 4x4 dataframe (df). I created two child dataframes (4x1), (4x2). And updated both. In first case, the parent is updated, in second, it is not. How to ensure that the parent dataframe is updated when child dataframe is updated?
I have a 4x4 dataframe (df). From this as a parent, I created two child dataframes - dfA with single column (4x1) and dfB with two columns (4x2). I have NaN values in both subsets. Now, when I use fillna on both, in respective dfA and dfB, i can see the NaN values updated with given value. Fine upto now. However, now when I check the Parent Dataframe, in First case (4x1), the updated value reflects whereas in Second case (4x2), it does not. Why it is so. And What should I do to let the changes in child dataframe reflect in the parent dataframe.
studentnames = ['Maths','English','Soc.Sci', 'Hindi', 'Science']
semisteronemarks = [15, 50, np.NaN, 50, np.NaN]
semistertwomarks = [25, 53, 45, 45, 54]
semisterthreemarks = [20, 50, 45, 15, 38]
semisterfourmarks = [26, 33, np.NaN, 35, 34]
semisters = ['Rakesh','Rohit', 'Sam', 'Sunil']
df1 = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
# case 1
dfA = df['Soc.Sci']
dfA.fillna(value = 98, inplace = True)
print(dfA)
print(df)
# case 2
dfB = df[['Soc.Sci', 'Science']]
dfB.fillna(value = 99, inplace = True)
print(dfB)
print(df)
'''
## contents of parent df ->>
## Actual Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 NaN 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 NaN 35 34.0
## Expected Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 99.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 99.0 35 34.0
# note the difference in output for column Soc.Sci in case 2.
In your code df1 is defined df is not.
With the approach being used
# case 1
dfA = df1['Soc.Sci'] # changed df to df1
dfA.fillna(value = 98, inplace = True)
df1['Soc.Sci'] = dfA # Because dfA is not a dataframe but a series
# if you want to do
df1['Soc.Sci'] = dfA['Soc.Sci']
# you will need to change the dfA
dfA = df1[['Soc.Sci']] # this makes it a dataframe
# case 2
dfB = df1[['Soc.Sci', 'Science']] # changed df to df1
dfB.fillna(value = 99, inplace = True)
df1[['Soc.Sci','Science']] = dfB[['Soc.Sci','Science']]
print(df1)
I would suggest just using the fillna in the parent df.
df1['Soc.Sci'].fillna(value=99,inplace=True)
You should have seen a warning:
Warning (from warnings module):
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
It means that dfB may be a copy instead of a view. And according to the results it is. There is little that can be done here, and specifically you cannot force pandas to generate a view. The choice depends of parameters only known to pandas and its developpers.
But it is always possible to assign to the columns of the parent DataFrame:
# case 2
df = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
df[['Soc.Sci', 'Science']] = df[['Soc.Sci', 'Science']].fillna(value = 99)
print(df)