parsing complex csv file - python

I have a CSV file which maps each country to some value, but the problem is that it's not well formed, it's header has repetitive pattern: Countries, Amount, Countries, Amount, ... (here Amounts measure different things, for example suicide rate, alcohol consumption etc., note that for some countries data is missing), please see input DataFrame: df_in.
I would like to get countries as index and those 'Amounts' as columns, please see output DataFrame, df_out
df_in = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/input.csv', sep = ';', header = 0, index_col = None,
na_values = [''], mangle_dupe_cols = False)
df_out = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/output.csv', sep = ';', header = 0, index_col = None,
na_values = [''], mangle_dupe_cols = False)
I was thinking that at first I get all unique countries from input (make it an index of new empty DataFrame, for example
col_pat = df_in.columns[df_in.columns.to_series().str.contains('Countries')]
cntry = df_in.ix[:, col_pat]
un_elm = pd.Series(map(str, pd.unique(cntry.values.ravel())))
countries = un_elm[un_elm != 'nan']
then start splitting main DataFrame (Counrtries as index and Amount as column) and joining it cumulatively to empty DataFrame.
Any other ideas, thanks?

first use .ix to select columns based on location
df_in = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/input.csv', sep = ';', header = 0, index_col = None,
na_values = [''], mangle_dupe_cols = False)
df1 = df_in.ix[:,:2].dropna().set_index('Countries1')
df2 = df_in.ix[:,2:4].dropna().set_index('Countries2')
df3 = df_in.ix[:,4:].dropna().set_index('Countries3')
then concatenate on axis 1 :
pd.concat([df1,df2,df3], axis=1)
Amount Amount Amount
Austria NaN 5 NaN
Denmark 6 NaN NaN
France 3 NaN NaN
Ireland NaN NaN 6
Norway NaN 2 NaN
Russia NaN NaN 5
Slovenia NaN NaN 4
Spain NaN 3 3
Sweden 5 1 2
Switzerland 4 4 NaN
U.K. 1 NaN NaN
United States 2 NaN 1

Related

Creation of DataFrame with specific conditions on rows

Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN

Insert new data to dataframe

I have a dataframe
employees = [('Jack', 34, 'Sydney' ) ,
('Riti', 31, 'Delhi' ) ,
('Aadi', 16, 'London') ,
('Mark', 18, 'Delhi' )]
dataFrame = pd.DataFrame( employees,
columns=['Name', 'Age', 'City'])
I would like to append this DataFrame with some new columns. I did it with:
data = ['Height', 'Weight', 'Eyecolor']
duduFrame = pd.DataFrame(columns=data)
This results in:
Name Age City Height Weight Eyecolor
0 Jack 34.0 Sydney NaN NaN NaN
1 Riti 31.0 Delhi NaN NaN NaN
2 Aadi 16.0 London NaN NaN NaN
3 Mark 18.0 Delhi NaN NaN NaN
So far so good.
Now I have new Data about Height, Weight and Eyecolor for "Riti":
Riti_data = [(172, 74, 'Brown')]
This I would like to add to dataFrame.
I tried it with
dataFrame.loc['Riti', [duduFrame]] = Riti_data
But I get the error
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
What am I doing wrong?
try this :
dataFrame.loc[dataFrame['Name']=='Riti', ['Height','Weight','Eyecolor']] = Riti_data
your mistake I think was not to specify the columns you did : duduFrame instead of the data which contains the name columns you want to add the new value
You can do this :
df = pd.concat([dataFrame, duduFrame])
df = df.set_index('Name')
df.loc['Riti',data] = [172,74,'Brown']
Resulting in :
Age City Height Weight Eyecolor
Name
Jack 34.0 Sydney NaN NaN NaN
Riti 31.0 Delhi 172 74 Brown
Aadi 16.0 London NaN NaN NaN
Mark 18.0 Delhi NaN NaN NaN
Pandas has a pd.concat function, whose role is to concatenate dataframes, either vertically (axis = 0), or in your case horizontally (axis = 1).
However, I personally see merging horizontally more like a pd.merge use-case, which gives you more flexibility on how exactly do you want the merge to happen.
In your case, you want to match Name column, right ?
So I would do it in 2 steps:
Build both dataframes with column Name and their respective data
Merge both dataframes with pd.merge(df1, df2, on = 'Name', how = 'outer')
The how = outer parameter makes sure that you don't lose any data from df1 or df2, in case some Name has data in only one of both dataframes. This will be easier for you to catch errors with your data, and will make you think more in terms of SQL JOIN, which is a necessary way of thinking :).

How to create multiple dataframe from a excel data table

I have extracted this data frame from an excel spreadsheet using pandas library,
after getting the needed columns and,
I have table formatted like this,
REF PLAYERS
0 103368 Andrés Posada Sanmiguel
1 300552 Diego Posada Sanmiguel
2 103304 Roberto Motta Stanziola
3 NaN NaN
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 REF PLAYERS
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 REF PLAYERS
I want to create multiple dataframes converting each row [['REF', 'PLAYERS']] to a new dataframe columns.
suggestions are welcomed I also need to preserve the blank spaces. A pandas newbie.
For this to work, you must first read the dataframe from the file differently: set the argument header=None in your pd.read_excel() function. Because now your columns are called "REF" and "PLAYERS", but we would like to group by them.
Then the first column name probably would be "0", and the first line will be as follows, where the df is the name of your dataframe:
# Set unique index for each group
df["group_id"] = (df[0] == "REF").cumsum()
Solution:
# Set unique index for each group
df["group_id"] = (df["name_of_first_column"] == "REF").cumsum()
# Iterate over groups
dataframes = []
for name, group in df.groupby("group_id"):
df_ = group
# promote 1st row to column name
df_.columns = df_.iloc[0]
# and drop it
df_ = df_.iloc[1:]
# drop index column
df_ = df_[["REF", "PLAYERS"]]
# append to the list of dataframes
dataframes.append(df_)
All your multiple dataframes are now stored in an array dataframes.
You can split your dataframe, into equal lengths (in your case 4 rows for each df), using np.split.
Since you want 4 rows per dataframe, you can split it into 5 different df:
import numpy as np
dfs = [df.loc[idx] for idx in np.split(df.index,5)]
And then create your individual dataframes:
df1 = dfs[1]
df1
REF PLAYERS
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
df2 = dfs[2]
df2
REF PLAYERS
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN

Python : Remodelling a dataframe and regrouping data from a specific column with predefined rows

Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]

Self creating columns based on value in another

This is very similar to the question i asked yesterday. The aim is to be able to add a functionality which will allow for a column to be created depending on the value shown in another. For example, when it finds a country code in a specified file, i would like it to create a column with the name 'Country Code Total', and sum the amount of units for every row with that same country code
This is what my script outputs at the moment:
What i want to see:
My Script:
df['Sum of Revenue'] = df['Units Sold'] * df['Dealer Price']
df['AR Revenue'] = df[]
df = df.sort_values(['End Consumer Country', 'Currency Code'])
# Sets first value of index by position
df.loc[df.index[0], 'Unit Total'] = df['Units Sold'].sum()
# Sets first value of index by position
df.loc[df.index[0], 'Total Revenue'] = df['Sum of Revenue'].sum()
# Sums the amout of Units with the End Consumer Country AR
df['AR Total'] = df.loc[df['End Consumer Country'] == 'AR', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country AU
df['AU Total'] = df.loc[df['End Consumer Country'] == 'AU', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country NZ
df['NZ Total'] = df.loc[df['End Consumer Country'] == 'NZ', 'Units Sold'].sum()
However, as i know the countries that will come up in this file, i have added them accordingly to my script to find. How would i write my script so that if it finds another country code, for example GB, it would create a column called 'GB Total' and sum the units for every row with the country code set to GB.
Any help would be greatly appreciated!
If you truly need that format, then here is how I would proceed (starting data below):
# Get those first two columns
d = {'Sum of Revenue': 'Total Revenue', 'Units Sold': 'Total Sold'}
for col, newcol in d.items():
df.loc[df.index[0], newcol] = df[col].sum()
# Add the rest for every country:
s = df.groupby('End Consumer Country')['Units Sold'].sum().to_frame().T.add_suffix(' Total')
s.index = [df.index[0]]
df = pd.concat([df, s], 1, sort=False)
Output: df:
End Consumer Country Sum of Revenue Units Sold Total Revenue Total Sold AR Total AU Total NZ Total US Total
a AR 13.486216 1 124.007334 28.0 3.0 7.0 11.0 7.0
b AR 25.984073 2 NaN NaN NaN NaN NaN NaN
c AU 21.697871 3 NaN NaN NaN NaN NaN NaN
d AU 10.962232 4 NaN NaN NaN NaN NaN NaN
e NZ 16.528398 5 NaN NaN NaN NaN NaN NaN
f NZ 29.908619 6 NaN NaN NaN NaN NaN NaN
g US 5.439925 7 NaN NaN NaN NaN NaN NaN
As you can see, pandas added a bunch of NaN values as we only assigned something to the first row, and a DataFrame must be rectangular
It's far simpler to have a different DataFrame that summarizes the totals and within each country. If this is fine, then everything simplifies to a single .pivot_table
df.pivot_table(index='End Consumer Country',
values=['Sum of Revenue', 'Units Sold'],
margins=True,
aggfunc='sum').T.add_suffix(' Total)
Output:
End Consumer Country AR Total AU Total NZ Total US Total All Total
Sum of Revenue 39.470289 32.660103 46.437018 5.439925 124.007334
Units Sold 3.000000 7.000000 11.000000 7.000000 28.000000
Same information, much simpler to code.
Sample data:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'End Consumer Country': ['AR', 'AR', 'AU', 'AU', 'NZ', 'NZ', 'US'],
'Sum of Revenue': np.random.normal(20,6,7),
'Units Sold': np.arange(1,8,1)},
index = list('abcdefg'))
End Consumer Country Sum of Revenue Units Sold
a AR 13.486216 1
b AR 25.984073 2
c AU 21.697871 3
d AU 10.962232 4
e NZ 16.528398 5
f NZ 29.908619 6
g US 5.439925 7

Categories

Resources